Re: [HACKERS] Hash Functions - Mailing list pgsql-hackers

From amul sul
Subject Re: [HACKERS] Hash Functions
Date
Msg-id CAAJ_b942x+gE3u2r4Mk41X2wCtWLaGdLotRqDo+_X6ELCktp4g@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Hash Functions  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Hash Functions
List pgsql-hackers
On Tue, Aug 29, 2017 at 11:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Aug 22, 2017 at 8:14 AM, amul sul <sulamul@gmail.com> wrote:
> Attaching patch 0002 for the reviewer's testing.

I think that this 0002 is not something we can think of committing
because there's no guarantee that hash functions will return the same
results on all platforms.  However, what we could and, I think, should
do is hash some representative values of each data type and verify
that hash(x) and hashextended(x, 0) come out the same at least as to
the low-order 32 bits -- and maybe that hashextend(x, 1) comes out
different as to the low-order 32 bits.  The easiest way to do this
seems to be to cast to bit(32).  For example:

SELECT v, hashint4(v)::bit(32) as standard,
        hashint4extended(v, 0)::bit(32) as extended0,
        hashint4extended(v, 1)::bit(32) as extended1
FROM (VALUES (0), (1), (17), (42), (550273), (207112489)) x(v)
WHERE hashint4(v)::bit(32) != hashint4extended(v, 0)::bit(32)
   OR hashint4(v)::bit(32) = hashint4extended(v, 1)::bit(32);

I suggest adding a version of this for each data type.

Thanks for the suggestion, I have updated 0002-patch accordingly.
Using this I found some strange behaviours as follow:

1) standard and extended0 output for the jsonb_hash case is not same.
2) standard and extended0 output for the hash_range case is not same when input
   is int4range(550274, 1550274)  other case in the patch are fine. This can be
   reproducible with other values as well, for e.g. int8range(1570275, 208112489). 

Will look into this tomorrow.

​Another case which I want to discuss is, extended and standard version of
hashfloat4, hashfloat8 & hash_numeric function will have the same output for zero
value irrespective of seed value. Do you think we need a fix for this?


From your latest version of 0001, I get:

rangetypes.c:1297:8: error: unused variable 'rngtypid'
      [-Werror,-Wunused-variable]
        Oid                     rngtypid = RangeTypeGetOid(r);

​Fixed in the attached version.​
 
I suggest not duplicating the comments from each regular function into
the extended function, but just writing /* Same approach as hashfloat8
*/ when the implementation is non-trivial (no need for this if the
extended function is a single line or the original function had no
comments anyway).

Fixed in the attached version.​
 
hash_aclitem seems embarrassingly stupid.  I suggest that we make the
extended version slightly less stupid -- i.e. if the seed is non-zero,
actually call hash_any_extended on the sum and pass the seed through.

          * Reset info about hash function whenever we pick up new info about
          * equality operator.  This is so we can ensure that the hash function
          * matches the operator.
          */
         typentry->flags &= ~(TCFLAGS_CHECKED_HASH_PROC);
+        typentry->flags &= ~(TCFLAGS_CHECKED_HASH_EXTENDED_PROC);

Adjust comment: "about hash function" -> "about hash functions", "hash
functions matches" -> "hash functions match".

Fixed in the attached version.​
 
+extern Datum
+hash_any_extended(register const unsigned char *k, register int
+                  keylen, uint64 seed);

Formatting.

​Fixed in the attached version.​

​Thanks !

Regards,
Amul​

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: [HACKERS] expanding inheritance in partition bound order
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Parallel worker error