Re: [HACKERS] Hash Functions - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: [HACKERS] Hash Functions |
Date | |
Msg-id | 20170601182522.hu57alga5e5ctn5b@alap3.anarazel.de Whole thread Raw |
In response to | Re: [HACKERS] Hash Functions (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] Hash Functions
Re: [HACKERS] Hash Functions Re: [HACKERS] Hash Functions |
List | pgsql-hackers |
On 2017-06-01 13:59:42 -0400, Robert Haas wrote: > I'm not actually aware of an instance where this has bitten anyone, > even though it seems like it certainly could have and maybe should've > gotten somebody at some point. Has anyone else? Two comments: First, citus has been doing hash-partitiong and append/range partitioning for a while now, and I'm not aware of anyone being bitten by this (although there've been plenty other things ;)), even though there've been cases upgrading to different collation & encodings. Secondly, I think that's to a significant degree caused by the fact that in practice people way more often partition on types like int4/int8/date/timestamp/uuid rather than text - there's rarely good reasons to do the latter. > Furthermore, neither range nor list partitioning depends on properties > of the hardware, like how wide integers are, or whether they are > stored big-endian. A naive approach to hash partitioning would depend > on those things. That's clearly worse. I don't think our current int4/8 hash functions depend on FLOAT8PASSBYVAL. > 3. Implement portable hash functions (Jeff Davis or me, not sure > which). Andres scoffed at this idea, but I still think it might have > legs. Coming up with a hashing algorithm for integers that produces > the same results on big-endian and little-endian systems seems pretty > feasible, even with the additional constraint that it should still be > fast. Just to clarify: I don't think it's a problem to do so for integers and most other simple scalar types. There's plenty hash algorithms that are endianess independent, and the rest is just a bit of care. Where I see a lot more issues is doing so for more complex types like arrays, jsonb, postgis geometry/geography types and the like, where the fast and simple implementation is to just hash the entire datum - and that'll very commonly not be portable at all due to padding and type wideness differences. > My personal guess is that most people will prefer the fast > hash functions over the ones that solve their potential future > migration problems, but, hey, options are good. I'm pretty sure that will be the case. I'm not sure that adding infrastructure to allow for something that nobody will use in practice is a good idea. If there ends up being demand for it, we can still go there. I think that the number of people that migrate between architectures is low enough that this isn't going to be a very common issue. Having some feasible way around this is important, but I don't think we should optimize heavily for it by developing new infrastructure / complicating experience for the 'normal' uses. Greetings, Andres Freund
pgsql-hackers by date: