Re: 64-bit hash function for hstore and citext data type - Mailing list pgsql-hackers

From Andrew Gierth
Subject Re: 64-bit hash function for hstore and citext data type
Date
Msg-id 87k1l2r9po.fsf@news-spur.riddles.org.uk
Whole thread Raw
In response to Re: 64-bit hash function for hstore and citext data type  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

 >>> I'm inclined to fix this in hstoreUpgrade rather than complicate
 >>> hstore_hash with historical trivia. Also there have been no field
 >>> complaints - I guess it's unlikely that there is much pg 8.4 hstore
 >>> data in the wild that anyone wants to hash.

 >> Changing hstoreUpgrade at this point seems like wasted/misguided effort.

 Tom> Oh, cancel that --- I was having a momentary brain fade about how
 Tom> that function is used. I agree your proposal is sensible.

Here's what I have queued up to push:

-- 
Andrew (irc:RhodiumToad)

From d5890f49da6a77b1325a3f5822c6b092a2cd41ae Mon Sep 17 00:00:00 2001
From: Andrew Gierth <rhodiumtoad@postgresql.org>
Date: Sat, 24 Nov 2018 09:59:49 +0000
Subject: [PATCH] Fix hstore hash function for empty hstores upgraded from 8.4.

Hstore data generated on pg 8.4 and pg_upgraded to current versions
remains in its original on-disk format unless modified. The same goes
for values generated by the addon hstore-new module on pre-9.0
versions. (The hstoreUpgrade function converts old values on the fly
when read in, but the on-disk value is not modified by this.)

Since old-format empty hstores (and hstore-new hstores) have
representations compatible with the new format, hstoreUpgrade thought
it could get away without modifying such values; but this breaks
hstore_hash (and the new hstore_hash_extended) which assumes
bit-perfect matching between semantically identical hstore values.
Only one bit actually differs (the "new version" flag in the count
field) but that of course is enough to break the hash.

Fix by making hstoreUpgrade unconditionally convert all old values to
new format.

Backpatch all the way, even though this changes a hash value in some
cases, because in those cases the hash value is already failing - for
example, a hash join between old- and new-format empty hstores will be
failing to match, or a hash index on an hstore column containing an
old-format empty value will be failing to find the value since it will
be searching for a hash derived from a new-format datum. (There are no
known field reports of this happening, probably because hashing of
hstores has only been useful in limited circumstances and there
probably isn't much upgraded data being used this way.)

Per concerns arising from discussion of commit eb6f29141be. Original
bug is my fault.

Discussion: https://postgr.es/m/60b1fd3b-7332-40f0-7e7f-f2f04f777747%402ndquadrant.com
---
 contrib/hstore/hstore_compat.c | 47 +++++++++++++++++-------------------------
 1 file changed, 19 insertions(+), 28 deletions(-)

diff --git a/contrib/hstore/hstore_compat.c b/contrib/hstore/hstore_compat.c
index b95ce9b4aa..1d4e7484e4 100644
--- a/contrib/hstore/hstore_compat.c
+++ b/contrib/hstore/hstore_compat.c
@@ -238,34 +238,35 @@ hstoreUpgrade(Datum orig)
     HStore       *hs = (HStore *) PG_DETOAST_DATUM(orig);
     int            valid_new;
     int            valid_old;
-    bool        writable;
 
     /* Return immediately if no conversion needed */
-    if ((hs->size_ & HS_FLAG_NEWVERSION) ||
-        hs->size_ == 0 ||
+    if (hs->size_ & HS_FLAG_NEWVERSION)
+        return hs;
+
+    /* Do we have a writable copy? If not, make one. */
+    if ((void *) hs == (void *) DatumGetPointer(orig))
+        hs = (HStore *) PG_DETOAST_DATUM_COPY(orig);
+
+    if (hs->size_ == 0 ||
         (VARSIZE(hs) < 32768 && HSE_ISFIRST((ARRPTR(hs)[0]))))
+    {
+        HS_SETCOUNT(hs, HS_COUNT(hs));
+        HS_FIXSIZE(hs, HS_COUNT(hs));
         return hs;
+    }
 
     valid_new = hstoreValidNewFormat(hs);
     valid_old = hstoreValidOldFormat(hs);
-    /* Do we have a writable copy? */
-    writable = ((void *) hs != (void *) DatumGetPointer(orig));
 
     if (!valid_old || hs->size_ == 0)
     {
         if (valid_new)
         {
             /*
-             * force the "new version" flag and the correct varlena length,
-             * but only if we have a writable copy already (which we almost
-             * always will, since short new-format values won't come through
-             * here)
+             * force the "new version" flag and the correct varlena length.
              */
-            if (writable)
-            {
-                HS_SETCOUNT(hs, HS_COUNT(hs));
-                HS_FIXSIZE(hs, HS_COUNT(hs));
-            }
+            HS_SETCOUNT(hs, HS_COUNT(hs));
+            HS_FIXSIZE(hs, HS_COUNT(hs));
             return hs;
         }
         else
@@ -302,15 +303,10 @@ hstoreUpgrade(Datum orig)
         elog(WARNING, "ambiguous hstore value resolved as hstore-new");
 
         /*
-         * force the "new version" flag and the correct varlena length, but
-         * only if we have a writable copy already (which we almost always
-         * will, since short new-format values won't come through here)
+         * force the "new version" flag and the correct varlena length.
          */
-        if (writable)
-        {
-            HS_SETCOUNT(hs, HS_COUNT(hs));
-            HS_FIXSIZE(hs, HS_COUNT(hs));
-        }
+        HS_SETCOUNT(hs, HS_COUNT(hs));
+        HS_FIXSIZE(hs, HS_COUNT(hs));
         return hs;
 #else
         elog(WARNING, "ambiguous hstore value resolved as hstore-old");
@@ -318,13 +314,8 @@ hstoreUpgrade(Datum orig)
     }
 
     /*
-     * must have an old-style value. Overwrite it in place as a new-style one,
-     * making sure we have a writable copy first.
+     * must have an old-style value. Overwrite it in place as a new-style one.
      */
-
-    if (!writable)
-        hs = (HStore *) PG_DETOAST_DATUM_COPY(orig);
-
     {
         int            count = hs->size_;
         HEntry       *new_entries = ARRPTR(hs);
-- 
2.11.1


pgsql-hackers by date:

Previous
From: Surafel Temesgen
Date:
Subject: Re: FETCH FIRST clause WITH TIES option
Next
From: Michael Paquier
Date:
Subject: Centralize use of PG_INTXX_MIN/MAX for integer limits