Thread: Hash function for numeric (WIP)
There is currently no support for hashing numerics. This prevents numerics from being hash indexed or used in a hashed aggregation. This patch makes the necessary changes to the catalogs to enable hashing for numerics, and implements a first sketch at a hash function for numerics. For any two numerics that compare equal, we need to compute the same hash value for both datums, even if their bit patterns differ. This patch computes the hash from the "n_weight" and "n_data" fields of the numeric -- notably, it doesn't use the numeric's "scale" field, since two equal numerics may have different scales. The hash seems to return the correct results for the simple test cases that I've tried, but I'm not very familiar with the details of the numeric implementation -- can anyone comment on whether this hash function is correct? Thanks to Sailesh Krishnamurthy for reporting this problem. -Neil
Attachment
Neil Conway <neilc@samurai.com> writes: > For any two numerics that compare equal, we need to compute the same > hash value for both datums, even if their bit patterns differ. I feel uncomfortable about this proposal because it will compute different hashes for values that differ only in having different numbers of trailing zeroes. Now the numeric.c code is supposed to suppress extra trailing zeroes on output, but that's never been a correctness property ... are we willing to make it one? There are various related cases involving unstripped leading zeroes. Another point is that sign = NUMERIC_NAN makes it a NAN regardless of any other fields; ignoring the sign does not get the right result here. Lastly, calling hashint2 seems like overkill, you might as well just XOR the weight into the digit_hash. regards, tom lane
I wrote: > I feel uncomfortable about this proposal because it will compute > different hashes for values that differ only in having different > numbers of trailing zeroes. Now the numeric.c code is supposed to > suppress extra trailing zeroes on output, but that's never been a > correctness property ... are we willing to make it one? > There are various related cases involving unstripped leading zeroes. > Another point is that sign = NUMERIC_NAN makes it a NAN regardless > of any other fields; ignoring the sign does not get the right result > here. Something else I just remembered is that ndigits = 0 makes it a zero regardless of the weight. Perhaps a sufficiently robust way would be to form the hash as the XOR of each supplied digit, circular-shifted by say 3 times the digit's weight. This is insensitive to leading/trailing zeroes: if (is NAN) return -1; // or any other fixed value hash = 0; shift = 3 * weight; for (i = 0; i < ndigits; i++) { thisshift = (shift & 31); hash |= ((uint32) digit[i]) << thisshift; if (thisshift > 0) hash |= ((uint32) digit[i]) >> (32 - thisshift); shift -= 3; } return hash; That might look pretty ugly, but then again hash_any isn't especially cheap. regards, tom lane
On Fri, 2007-04-27 at 04:09 -0400, Tom Lane wrote: > I feel uncomfortable about this proposal because it will compute > different hashes for values that differ only in having different > numbers of trailing zeroes. Now the numeric.c code is supposed to > suppress extra trailing zeroes on output, but that's never been a > correctness property ... are we willing to make it one? I don't think that is such an onerous requirement: we could easily add code to enforce this invariant (that might even be worth doing regardless, to verify that the comments remain consistent with reality). > There are various related cases involving unstripped leading zeroes. numeric.h claims that leading zeros will also be stripped -- is that not correct? > Another point is that sign = NUMERIC_NAN makes it a NAN regardless > of any other fields; ignoring the sign does not get the right result > here. I believe the patch was actually correct for NUMERIC_NAN (it already special-cased it). On Fri, 2007-04-27 at 10:02 -0400, Tom Lane wrote: > Something else I just remembered is that ndigits = 0 makes it a zero > regardless of the weight. Good point, fixed. > > Perhaps a sufficiently robust way would be to form the hash as the > > XOR of each supplied digit, circular-shifted by say 3 times the > > digit's weight. -Neil
Sorry for fat-fingering the previous reply -- I wanted to add: On Fri, 2007-04-27 at 10:02 -0400, Tom Lane wrote: > Perhaps a sufficiently robust way would be to form the hash as the > XOR of each supplied digit, circular-shifted by say 3 times the > digit's weight. The only objection I have to this is that it means we need to have another hash function in the backend. The Jenkins hash we use in hash_any() has been studied and we can have at least some confidence in its collision-resistance, etc. Anyway, attached is a revised version of the hash_any()-based patch. -Neil
Attachment
Neil Conway <neilc@samurai.com> writes: > On Fri, 2007-04-27 at 10:02 -0400, Tom Lane wrote: >> Perhaps a sufficiently robust way would be to form the hash as the >> XOR of each supplied digit, circular-shifted by say 3 times the >> digit's weight. > The only objection I have to this is that it means we need to have > another hash function in the backend. The Jenkins hash we use in > hash_any() has been studied and we can have at least some confidence in > its collision-resistance, etc. I'm still not very comfortable with that. You're proposing to add a pretty obvious failure mechanism --- any numeric-returning function that failed to "normalize" its output would now create a subtle, hard-to-find bug. Even if you can promise that all the functions in numeric.c get it right, what of user-written add-ons? And the only return for taking this risk is speculation that the performance of the hash function might be better. I think if you want to go this way, you should at least provide some evidence that we get a hashing performance benefit in exchange for adding a new restriction on numeric-value validity. Perhaps a suitable test would be to compare the number of hash collisions in a large set of randomly-chosen-but-distinct numeric values. regards, tom lane
On Mon, 2007-30-04 at 00:04 -0400, Tom Lane wrote: > I'm still not very comfortable with that. You're proposing to add a > pretty obvious failure mechanism --- any numeric-returning function > that failed to "normalize" its output would now create a subtle, > hard-to-find bug. What about teaching hash_numeric() to explicitly ignore leading and trailing zero digits? > Perhaps a suitable test would be to compare the number of > hash collisions in a large set of randomly-chosen-but-distinct > numeric values. Okay, I did a little testing. I created a table with ~2 million numerics, generated with equal probability by one of the two expressions: random()::numeric * ((random() * 100) ^ 20) * 100 random()::numeric * -100; There were 251 duplicate inputs that I didn't bother eliminating; of course, they should have effected both hash functions identically. Results: neilc=# create temp table r1 as select hash_numeric(a), count(*) from x group by hash_numeric(a); neilc=# create temp table r2 as select old_hash_numeric(a), count(*) from x group by old_hash_numeric(a); neilc=# select count(*) from r1 where count > 1; count -------- 123342 (1 row) neilc=# select count(*) from r2 where count > 1; count ------- 756 (1 row) old_hash_numeric() is the hash_any()-based hash, hash_numeric() is my coding of your suggested hash function (see attached patch). So it seems we need a better hash if we're not going to use hash_any(). The analysis required to write a robust hash function from scratch is precisely why I'd prefer to use hash_any() if possible. -Neil
Attachment
Neil Conway <neilc@samurai.com> writes: > On Mon, 2007-30-04 at 00:04 -0400, Tom Lane wrote: >> I'm still not very comfortable with that. You're proposing to add a >> pretty obvious failure mechanism --- any numeric-returning function >> that failed to "normalize" its output would now create a subtle, >> hard-to-find bug. > What about teaching hash_numeric() to explicitly ignore leading and > trailing zero digits? Hm, but apply hash_any() to the remaining digits? That might work, if you are careful about how you factor the weight into it (or just not try to use the weight in the hash). >> Perhaps a suitable test would be to compare the number of >> hash collisions in a large set of randomly-chosen-but-distinct >> numeric values. > Okay, I did a little testing. > [ test that totally destroys my proposed hash function ] OK, so *that* idea doesn't work. How about yours above? regards, tom lane
On Thu, 2007-03-05 at 23:57 -0400, Tom Lane wrote: > Hm, but apply hash_any() to the remaining digits? That might work, if > you are careful about how you factor the weight into it (or just not try > to use the weight in the hash). Attached is a patch that implements this idea. Since leading or trailing zeroes are far from the common case, I think we should still include the weight in the hash when possible: the patch does so when it doesn't find a leading zero in the Numeric. I did some testing by constructing an in-memory Numeric with leading and trailing zeroes, and this approach seems to work. Unless you see anything else that needs fixing, I'll apply this patch to HEAD in a day or two. -Neil
Attachment
Neil Conway <neilc@samurai.com> writes: > On Thu, 2007-03-05 at 23:57 -0400, Tom Lane wrote: >> Hm, but apply hash_any() to the remaining digits? That might work, if >> you are careful about how you factor the weight into it (or just not try >> to use the weight in the hash). > Attached is a patch that implements this idea. Since leading or trailing > zeroes are far from the common case, I think we should still include the > weight in the hash when possible: the patch does so when it doesn't find > a leading zero in the Numeric. You can do it always if you simply decrement the weight for each leading zero removed. (Of course, if you end up with no nonzero digits, you need to return a fixed hashcode such as 0, regardless of weight. The patch as given is wrong since it makes the test for no-digits before instead of after removing zeroes.) It'd be a good idea if you repeat the previous number-of-collisions experiment on this code. regards, tom lane
On Sun, 2007-06-05 at 21:30 -0400, Tom Lane wrote: > You can do it always if you simply decrement the weight for each leading > zero removed. On reflection, the patch as given was wrong anyway: if two Numerics are identical except for the presence of leading zeroes, we can't consider the weight when hashing one of them but not the other. > The patch as given is wrong since it makes the test for no-digits > before instead of after removing zeroes.) Okay, how about the attached revision? > It'd be a good idea if you repeat the previous number-of-collisions > experiment on this code. I'll do this shortly. -Neil
Attachment
On Sun, 2007-06-05 at 21:30 -0400, Tom Lane wrote: > It'd be a good idea if you repeat the previous number-of-collisions > experiment on this code. I repeated the same experiment, and got essentially the same number of collisions (829 collisions on ~2 million randomly generated numerics, with 273 duplicates). Since the modified hash still uses hash_any() and really only differs when there are leading/trailing zeros, this is consistent with what I'd expect. Patch applied to HEAD. -Neil
Neil Conway <neilc@samurai.com> writes: > On Sun, 2007-06-05 at 21:30 -0400, Tom Lane wrote: >> It'd be a good idea if you repeat the previous number-of-collisions >> experiment on this code. > I repeated the same experiment, and got essentially the same number of > collisions (829 collisions on ~2 million randomly generated numerics, > with 273 duplicates). Since the modified hash still uses hash_any() and > really only differs when there are leading/trailing zeros, this is > consistent with what I'd expect. Right, given that there presumably weren't any leading/trailing zeroes in your sample, the digit hashing ought to be exactly the same. I was just worried that the slightly different treatment of the weight might somehow invalidate the results. regards, tom lane