Thread: Hash function for numeric (WIP)

Hash function for numeric (WIP)

From
Neil Conway
Date:
There is currently no support for hashing numerics. This prevents
numerics from being hash indexed or used in a hashed aggregation. This
patch makes the necessary changes to the catalogs to enable hashing for
numerics, and implements a first sketch at a hash function for numerics.

For any two numerics that compare equal, we need to compute the same
hash value for both datums, even if their bit patterns differ. This
patch computes the hash from the "n_weight" and "n_data" fields of the
numeric -- notably, it doesn't use the numeric's "scale" field, since
two equal numerics may have different scales. The hash seems to return
the correct results for the simple test cases that I've tried, but I'm
not very familiar with the details of the numeric implementation -- can
anyone comment on whether this hash function is correct?

Thanks to Sailesh Krishnamurthy for reporting this problem.

-Neil


Attachment

Re: Hash function for numeric (WIP)

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> For any two numerics that compare equal, we need to compute the same
> hash value for both datums, even if their bit patterns differ.

I feel uncomfortable about this proposal because it will compute
different hashes for values that differ only in having different
numbers of trailing zeroes.  Now the numeric.c code is supposed to
suppress extra trailing zeroes on output, but that's never been a
correctness property ... are we willing to make it one?

There are various related cases involving unstripped leading zeroes.

Another point is that sign = NUMERIC_NAN makes it a NAN regardless
of any other fields; ignoring the sign does not get the right result
here.

Lastly, calling hashint2 seems like overkill, you might as well
just XOR the weight into the digit_hash.

            regards, tom lane

Re: Hash function for numeric (WIP)

From
Tom Lane
Date:
I wrote:
> I feel uncomfortable about this proposal because it will compute
> different hashes for values that differ only in having different
> numbers of trailing zeroes.  Now the numeric.c code is supposed to
> suppress extra trailing zeroes on output, but that's never been a
> correctness property ... are we willing to make it one?

> There are various related cases involving unstripped leading zeroes.

> Another point is that sign = NUMERIC_NAN makes it a NAN regardless
> of any other fields; ignoring the sign does not get the right result
> here.

Something else I just remembered is that ndigits = 0 makes it a zero
regardless of the weight.

Perhaps a sufficiently robust way would be to form the hash as the
XOR of each supplied digit, circular-shifted by say 3 times the
digit's weight.  This is insensitive to leading/trailing zeroes:

    if (is NAN)
        return -1;    // or any other fixed value
    hash = 0;
    shift = 3 * weight;
    for (i = 0; i < ndigits; i++)
    {
        thisshift = (shift & 31);
        hash |= ((uint32) digit[i]) << thisshift;
        if (thisshift > 0)
            hash |= ((uint32) digit[i]) >> (32 - thisshift);
        shift -= 3;
    }
    return hash;

That might look pretty ugly, but then again hash_any isn't especially
cheap.

            regards, tom lane

Re: Hash function for numeric (WIP)

From
Neil Conway
Date:
On Fri, 2007-04-27 at 04:09 -0400, Tom Lane wrote:
> I feel uncomfortable about this proposal because it will compute
> different hashes for values that differ only in having different
> numbers of trailing zeroes.  Now the numeric.c code is supposed to
> suppress extra trailing zeroes on output, but that's never been a
> correctness property ... are we willing to make it one?

I don't think that is such an onerous requirement: we could easily add
code to enforce this invariant (that might even be worth doing
regardless, to verify that the comments remain consistent with reality).

> There are various related cases involving unstripped leading zeroes.

numeric.h claims that leading zeros will also be stripped -- is that not
correct?

> Another point is that sign = NUMERIC_NAN makes it a NAN regardless
> of any other fields; ignoring the sign does not get the right result
> here.

I believe the patch was actually correct for NUMERIC_NAN (it already
special-cased it).

On Fri, 2007-04-27 at 10:02 -0400, Tom Lane wrote:
> Something else I just remembered is that ndigits = 0 makes it a zero
> regardless of the weight.

Good point, fixed.


> > Perhaps a sufficiently robust way would be to form the hash as the
> > XOR of each supplied digit, circular-shifted by say 3 times the
> > digit's weight.

-Neil



Re: Hash function for numeric (WIP)

From
Neil Conway
Date:
Sorry for fat-fingering the previous reply -- I wanted to add:

On Fri, 2007-04-27 at 10:02 -0400, Tom Lane wrote:
> Perhaps a sufficiently robust way would be to form the hash as the
> XOR of each supplied digit, circular-shifted by say 3 times the
> digit's weight.

The only objection I have to this is that it means we need to have
another hash function in the backend. The Jenkins hash we use in
hash_any() has been studied and we can have at least some confidence in
its collision-resistance, etc.

Anyway, attached is a revised version of the hash_any()-based patch.

-Neil


Attachment

Re: Hash function for numeric (WIP)

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> On Fri, 2007-04-27 at 10:02 -0400, Tom Lane wrote:
>> Perhaps a sufficiently robust way would be to form the hash as the
>> XOR of each supplied digit, circular-shifted by say 3 times the
>> digit's weight.

> The only objection I have to this is that it means we need to have
> another hash function in the backend. The Jenkins hash we use in
> hash_any() has been studied and we can have at least some confidence in
> its collision-resistance, etc.

I'm still not very comfortable with that.  You're proposing to add a
pretty obvious failure mechanism --- any numeric-returning function
that failed to "normalize" its output would now create a subtle,
hard-to-find bug.  Even if you can promise that all the functions in
numeric.c get it right, what of user-written add-ons?  And the only
return for taking this risk is speculation that the performance of the
hash function might be better.

I think if you want to go this way, you should at least provide some
evidence that we get a hashing performance benefit in exchange for
adding a new restriction on numeric-value validity.  Perhaps a suitable
test would be to compare the number of hash collisions in a large set of
randomly-chosen-but-distinct numeric values.

            regards, tom lane

Re: Hash function for numeric (WIP)

From
Neil Conway
Date:
On Mon, 2007-30-04 at 00:04 -0400, Tom Lane wrote:
> I'm still not very comfortable with that.  You're proposing to add a
> pretty obvious failure mechanism --- any numeric-returning function
> that failed to "normalize" its output would now create a subtle,
> hard-to-find bug.

What about teaching hash_numeric() to explicitly ignore leading and
trailing zero digits?

> Perhaps a suitable test would be to compare the number of
> hash collisions in a large set of randomly-chosen-but-distinct
> numeric values.

Okay, I did a little testing. I created a table with ~2 million
numerics, generated with equal probability by one of the two
expressions:

    random()::numeric * ((random() * 100) ^ 20) * 100
    random()::numeric * -100;

There were 251 duplicate inputs that I didn't bother eliminating; of
course, they should have effected both hash functions identically.
Results:

neilc=# create temp table r1 as select hash_numeric(a), count(*) from x
group by hash_numeric(a);
neilc=# create temp table r2 as select old_hash_numeric(a), count(*)
from x group by old_hash_numeric(a);

neilc=# select count(*) from r1 where count > 1;
 count
--------
 123342
(1 row)

neilc=# select count(*) from r2 where count > 1;
 count
-------
   756
(1 row)

old_hash_numeric() is the hash_any()-based hash, hash_numeric() is my
coding of your suggested hash function (see attached patch).

So it seems we need a better hash if we're not going to use hash_any().
The analysis required to write a robust hash function from scratch is
precisely why I'd prefer to use hash_any() if possible.

-Neil


Attachment

Re: Hash function for numeric (WIP)

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> On Mon, 2007-30-04 at 00:04 -0400, Tom Lane wrote:
>> I'm still not very comfortable with that.  You're proposing to add a
>> pretty obvious failure mechanism --- any numeric-returning function
>> that failed to "normalize" its output would now create a subtle,
>> hard-to-find bug.

> What about teaching hash_numeric() to explicitly ignore leading and
> trailing zero digits?

Hm, but apply hash_any() to the remaining digits?  That might work, if
you are careful about how you factor the weight into it (or just not try
to use the weight in the hash).

>> Perhaps a suitable test would be to compare the number of
>> hash collisions in a large set of randomly-chosen-but-distinct
>> numeric values.

> Okay, I did a little testing.
> [ test that totally destroys my proposed hash function ]

OK, so *that* idea doesn't work.  How about yours above?

            regards, tom lane

Re: Hash function for numeric (WIP)

From
Neil Conway
Date:
On Thu, 2007-03-05 at 23:57 -0400, Tom Lane wrote:
> Hm, but apply hash_any() to the remaining digits?  That might work, if
> you are careful about how you factor the weight into it (or just not try
> to use the weight in the hash).

Attached is a patch that implements this idea. Since leading or trailing
zeroes are far from the common case, I think we should still include the
weight in the hash when possible: the patch does so when it doesn't find
a leading zero in the Numeric. I did some testing by constructing an
in-memory Numeric with leading and trailing zeroes, and this approach
seems to work.

Unless you see anything else that needs fixing, I'll apply this patch to
HEAD in a day or two.

-Neil


Attachment

Re: Hash function for numeric (WIP)

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> On Thu, 2007-03-05 at 23:57 -0400, Tom Lane wrote:
>> Hm, but apply hash_any() to the remaining digits?  That might work, if
>> you are careful about how you factor the weight into it (or just not try
>> to use the weight in the hash).

> Attached is a patch that implements this idea. Since leading or trailing
> zeroes are far from the common case, I think we should still include the
> weight in the hash when possible: the patch does so when it doesn't find
> a leading zero in the Numeric.

You can do it always if you simply decrement the weight for each leading
zero removed.  (Of course, if you end up with no nonzero digits, you
need to return a fixed hashcode such as 0, regardless of weight.  The
patch as given is wrong since it makes the test for no-digits before
instead of after removing zeroes.)

It'd be a good idea if you repeat the previous number-of-collisions
experiment on this code.

            regards, tom lane

Re: Hash function for numeric (WIP)

From
Neil Conway
Date:
On Sun, 2007-06-05 at 21:30 -0400, Tom Lane wrote:
> You can do it always if you simply decrement the weight for each leading
> zero removed.

On reflection, the patch as given was wrong anyway: if two Numerics are
identical except for the presence of leading zeroes, we can't consider
the weight when hashing one of them but not the other.

> The patch as given is wrong since it makes the test for no-digits
> before instead of after removing zeroes.)

Okay, how about the attached revision?

> It'd be a good idea if you repeat the previous number-of-collisions
> experiment on this code.

I'll do this shortly.

-Neil


Attachment

Re: Hash function for numeric (WIP)

From
Neil Conway
Date:
On Sun, 2007-06-05 at 21:30 -0400, Tom Lane wrote:
> It'd be a good idea if you repeat the previous number-of-collisions
> experiment on this code.

I repeated the same experiment, and got essentially the same number of
collisions (829 collisions on ~2 million randomly generated numerics,
with 273 duplicates). Since the modified hash still uses hash_any() and
really only differs when there are leading/trailing zeros, this is
consistent with what I'd expect.

Patch applied to HEAD.

-Neil



Re: Hash function for numeric (WIP)

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> On Sun, 2007-06-05 at 21:30 -0400, Tom Lane wrote:
>> It'd be a good idea if you repeat the previous number-of-collisions
>> experiment on this code.

> I repeated the same experiment, and got essentially the same number of
> collisions (829 collisions on ~2 million randomly generated numerics,
> with 273 duplicates). Since the modified hash still uses hash_any() and
> really only differs when there are leading/trailing zeros, this is
> consistent with what I'd expect.

Right, given that there presumably weren't any leading/trailing zeroes
in your sample, the digit hashing ought to be exactly the same.  I was
just worried that the slightly different treatment of the weight might
somehow invalidate the results.

            regards, tom lane