Thread: Combining hash values
Hi hackers, Andres Freund asked me off list what I thought about this part of execGrouping.c, which builds a hash from the hashes of several datums in a loop: /* rotate hashkey left 1 bit at each step */ hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0); ... hashkey ^= hkey; I think we should consider improving it and putting it somewhere central. 1. The same code also appears in nodeHash.c, and we also need to do the same thing in a couple of new projects that my EDB colleagues and I are working on for proposal as 10.0 features, based on DSM-backed hash tables. So I think it would make sense to define a hash_combine function or macro to be reused by all of such places rather than repeating it everywhere. 2. I suspect that this algorithm for combining hashes is weak, and could amplify weaknesses in the hash functions feeding it. Compare Boost's hash_combine, which does some more work before XORing: seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2); That constant approximates the golden ratio (as a fraction of the 32 bit hash space), and it also appears in our hash_any and hash_uint32 functions. I don't claim to understand the mathematics but I think this may have to do with TACP section 6, theorem S and exercises 8 and 9, though I'm not sure if it's being used for the same purpose in algorithms that add it (is it just some random bits? [1][2]) and algorithms that multiply by it [3][4]. Perhaps we could write a reusable hash_combine function/macro using existing code from hashfunc.c, or use the formula from Boost, or something else, to improve the quality of our hash combinations. It would be very interesting to hear from someone with expertise in this area. Hash indexes don't currently support multiple column keys. If they ever do in future, they would also benefit from high quality combination. Assuming we'd use the same algorithm there too, changing the combination algorithm after we start persisting the results to disk in hash indexes would obviously be difficult. There doesn't seem to be any reason why we can't change the algorithm before then. [1] http://stackoverflow.com/questions/4948780/magic-number-in-boosthash-combine [2] https://xkcd.com/221/ [3] http://cstheory.stackexchange.com/questions/9639/how-did-knuth-derive-a [4] http://community.haskell.org/~simonmar/base/src/Data-HashTable.html -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > 2. I suspect that this algorithm for combining hashes is weak, and > could amplify weaknesses in the hash functions feeding it. Very possibly, but ... > Compare > Boost's hash_combine, which does some more work before XORing: > seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2); I can't help being reminded of Knuth's story about he tried to invent the world's best random number generator, and was disappointed when it almost immediately converged to a short repeating sequence. If there's any actual theoretical basis to the above, I'd be interested to see it. But as-is, the use of addition rather than XOR looks fishy, and the way the old seed is shifted around looks more likely to result in cancellation than anything else. > That constant approximates the golden ratio (as a fraction of the 32 > bit hash space), and it also appears in our hash_any and hash_uint32 > functions. I think it's just somebody's idea of a magic random number. Your link > [3] http://cstheory.stackexchange.com/questions/9639/how-did-knuth-derive-a provides some reasons to think it might be a good choice for a very specific application, but this is not that application --- in particular, it doesn't involve multiplying by that constant. regards, tom lane
On Mon, Aug 1, 2016 at 4:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> 2. I suspect that this algorithm for combining hashes is weak, and >> could amplify weaknesses in the hash functions feeding it. > > Very possibly, but ... Concrete example: suppose a clever data type author defines a perfect hash function that maps values to small integers. In terms of hash collisions, this performs optimally in a single column hash join, agg, index etc by definition, and seems like an entirely reasonable thing to want to do, but performs terribly with Postgres's hash combination algorithm as soon as you use several columns. The stuffing is all up one end of the cushion, which is OK for a perfect hash on its own, but we do very little to spread it around when combining. For two columns that hash to 8 bit integers, we map 16 bits of information to only 9 bits: postgres=# select count(*) as distinct_outputs, postgres-# min(collisions), postgres-# max(collisions), postgres-# avg(collisions), postgres-# stddev(collisions) postgres-# from ( postgres(# select s1.i # (s2.i << 1), postgres(# count(*) as collisions postgres(# from generate_series(0, 255) as s1(i) postgres(# cross join generate_series(0, 255) as s2(i) postgres(# group by 1 postgres(# ) ss; ┌──────────────────┬─────┬─────┬──────────────────────┬────────┐ │ distinct_outputs │ min │ max │ avg │ stddev │ ├──────────────────┼─────┼─────┼──────────────────────┼────────┤ │ 512 │ 128 │ 128 │ 128.0000000000000000 │ 0 │ └──────────────────┴─────┴─────┴──────────────────────┴────────┘ (1 row) The Boost combiner does better. If I have this right: postgres=# create or replace function hash_combine(seed bigint, hash bigint) postgres-# returns bigint as postgres-# $$ postgres$# select (seed # (hash + 2654435769 + (seed << 6) + (seed >> 2))) postgres$# % 4294967296; postgres$# $$ language sql; CREATE FUNCTION postgres=# postgres=# select count(*) as distinct_outputs, postgres-# min(collisions), postgres-# max(collisions), postgres-# avg(collisions), postgres-# stddev(collisions) postgres-# from ( postgres(# select hash_combine(hash_combine(0, s1.i), s2.i), postgres(# count(*) as collisions postgres(# from generate_series(0, 255) as s1(i) postgres(# cross join generate_series(0, 255) as s2(i) postgres(# group by 1 postgres(# ) ss; ┌──────────────────┬─────┬─────┬────────────────────┬────────────────────────┐ │ distinct_outputs │ min │ max │ avg │ stddev │ ├──────────────────┼─────┼─────┼────────────────────┼────────────────────────┤ │ 16583 │ 1 │ 9 │ 3.9519990351564856 │ 0.70952439864334926106 │ └──────────────────┴─────┴─────┴────────────────────┴────────────────────────┘ (1 row) Not as good as I hoped (maybe there is a bug in my query?), but not a 128 car pile-up. There are probably other kinds of specialised (or poorly written) hash functions that will work well or acceptably on their own but not so well when combined without good mixing. Like Boost, we are dealing with hashes produced by arbitrary hash functions that can be supplied by user code, not just by the high quality built-in function from Bob Jenkins. >> Compare >> Boost's hash_combine, which does some more work before XORing: > >> seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2); > > I can't help being reminded of Knuth's story about he tried to invent > the world's best random number generator, and was disappointed when > it almost immediately converged to a short repeating sequence. If > there's any actual theoretical basis to the above, I'd be interested > to see it. But as-is, the use of addition rather than XOR looks fishy, > and the way the old seed is shifted around looks more likely to result > in cancellation than anything else. I provided boost::hash_combine as an example only because it's widely known and used in used in C++ circles and familiar to me, but so far I have only taken it on authority that it works. Here's what I managed to dig up on it this morning: According to C++ N3876[1], a proposal to add std::hash_combine (but not a specific algorithm) to the C++ standard library, Boost's algorithm comes from a paper called "Methods for Identifying Versioned and Plagiarised Documents (2003)"[2], where you can see this recipe for shifting and summing as equation 1, with some discussion and pointers to other papers. But I see in N3876's feedback (see the comment from Jeffrey Jasskin) that Bob Jenkins' mixer (or at least an aspect of it: not requiring intermediate state to fit in the final output size) is discussed as a potential implementation. We already have that in our source tree... perhaps we should use it. >> That constant approximates the golden ratio (as a fraction of the 32 >> bit hash space), and it also appears in our hash_any and hash_uint32 >> functions. > > I think it's just somebody's idea of a magic random number. Your link >> [3] http://cstheory.stackexchange.com/questions/9639/how-did-knuth-derive-a > provides some reasons to think it might be a good choice for a very > specific application, but this is not that application --- in particular, > it doesn't involve multiplying by that constant. Yeah. About its use in his 1997 algorithm, Bob Jenkins says: "The golden ratio really is an arbitrary value. Its purpose is to avoid mapping all zeros to all zeros." [1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3876.pdf [2] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.2680 -- Thomas Munro http://www.enterprisedb.com
<p dir="ltr">Surely combining multiple hashes is the same problem as hashing a block of memory? Shouldn't we just use thesame algorithm as hash_any()?<p dir="ltr">I was originally going to suggest using a crc to combine but iirc we changedhash_any() a while back and decided against using crc. I don't know if that was wise but wouldn't want to suggestrelitigating that.
On 1 August 2016 at 08:19, Greg Stark <stark@mit.edu> wrote: > Surely combining multiple hashes is the same problem as hashing a block of > memory? Shouldn't we just use the same algorithm as hash_any()? > Yes, I imagine that should work well. I suspect that Thomas's proposal would also probably work well, as would a number of other hashing algorithms with reasonable pedigree, such as the one used for array hashing. I don't have any particular preference, but I do know that what usually turns out to be disastrous is an arbitrary made-up formula like rotating and xor'ing. The last thing we should attempt to do is invent our own hashing algorithms. On that subject, while looking at hashfunc.c, I spotted that hashint8() has a very obvious deficiency, which causes disastrous performance with certain inputs: create table foo (a bigint); insert into foo select (x*2^32)+x from generate_series(1,1000000) g(x); analyse foo; select * from foo f1, foo f2 where f1.a=f2.a; With random values that query (using a hash join) takes around 2 seconds on my machine, but with the values above I estimate it will take 20 hours (although I didn't wait that long!). The reason is pretty obvious -- all the bigint values above hash to the same value. I'd suggest using hash_uint32() for values that fit in a 32-bit integer and hash_any() otherwise. Regards, Dean
Greg Stark <stark@mit.edu> writes: > I was originally going to suggest using a crc to combine but iirc we > changed hash_any() a while back and decided against using crc. I don't know > if that was wise but wouldn't want to suggest relitigating that. Nah, CRCs are designed to solve a different problem, ie detecting single-bit and burst errors with high probability. In particular, I don't think they make any particular promises with respect to spreading changes into all bits of the result. That's important for our hash functions because we usually take just the lowest N bits of the result as being adequately randomized. regards, tom lane
Dean Rasheed <dean.a.rasheed@gmail.com> writes: > On that subject, while looking at hashfunc.c, I spotted that > hashint8() has a very obvious deficiency, which causes disastrous > performance with certain inputs: Well, if you're trying to squeeze 64 bits into a 32-bit result, there are always going to be collisions somewhere. > I'd suggest using hash_uint32() for values that fit in a 32-bit > integer and hash_any() otherwise. Perhaps, but this'd break existing hash indexes. That might not be a fatal objection, but if we're going to put users through that I'd like to think a little bigger in terms of the benefits we get. I've thought for some time that we needed to move to 64-bit hash function results, because the size of problem that's reasonable to use a hash join or hash aggregation for keeps increasing. Maybe we should do that and fix hashint8 as a side effect. regards, tom lane
On Mon, Aug 1, 2016 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Dean Rasheed <dean.a.rasheed@gmail.com> writes: >> On that subject, while looking at hashfunc.c, I spotted that >> hashint8() has a very obvious deficiency, which causes disastrous >> performance with certain inputs: > > Well, if you're trying to squeeze 64 bits into a 32-bit result, there > are always going to be collisions somewhere. > >> I'd suggest using hash_uint32() for values that fit in a 32-bit >> integer and hash_any() otherwise. > > Perhaps, but this'd break existing hash indexes. That might not be > a fatal objection, but if we're going to put users through that > I'd like to think a little bigger in terms of the benefits we get. > I've thought for some time that we needed to move to 64-bit hash function > results, because the size of problem that's reasonable to use a hash join > or hash aggregation for keeps increasing. Maybe we should do that and fix > hashint8 as a side effect. Well, considering that Amit is working on makes hash indexes WAL-logged in v10[1], this seems like an awfully good time to get any breakage we want to do out of the way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] https://www.postgresql.org/message-id/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com
On Mon, Aug 1, 2016 at 7:24 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > On 1 August 2016 at 08:19, Greg Stark <stark@mit.edu> wrote: >> Surely combining multiple hashes is the same problem as hashing a block of >> memory? Shouldn't we just use the same algorithm as hash_any()? > > Yes, I imagine that should work well. I suspect that Thomas's proposal > would also probably work well, as would a number of other hashing > algorithms with reasonable pedigree, such as the one used for array > hashing. I don't have any particular preference, but I do know that > what usually turns out to be disastrous is an arbitrary made-up > formula like rotating and xor'ing. The last thing we should attempt to > do is invent our own hashing algorithms. +1. (x << 1) | y isn't the stupidest way of combining hash values anybody's ever invented, but there are surely others that are better. I don't much care whether we adopt Thomas's proposal or Greg's or something else, but I can't see why we'd stick with (x << 1) | y when better approaches are known. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Aug 1, 2016 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Perhaps, but this'd break existing hash indexes. That might not be >> a fatal objection, but if we're going to put users through that >> I'd like to think a little bigger in terms of the benefits we get. >> I've thought for some time that we needed to move to 64-bit hash function >> results, because the size of problem that's reasonable to use a hash join >> or hash aggregation for keeps increasing. Maybe we should do that and fix >> hashint8 as a side effect. > Well, considering that Amit is working on makes hash indexes > WAL-logged in v10[1], this seems like an awfully good time to get any > breakage we want to do out of the way. Yeah, the compatibility stakes would go up quite a bit as soon as that happens, because then users might start using hash indexes without the expectation of having to rebuild them at random. (BTW, this line of reasoning also says that if Amit finds he needs to twiddle the on-disk format a bit to make WAL logging work, it would not be a problem.) regards, tom lane