Thread: BUG #17746: Partitioning by hash of a text depends on icu version when text collation is not deterministic

BUG #17746: Partitioning by hash of a text depends on icu version when text collation is not deterministic

From

PG Bug reporting form

Date:

11 January 2023, 15:09:21

The following bug has been logged on the website:

Bug reference:      17746
Logged by:          Andrew Bille
Email address:      andrewbille@gmail.com
PostgreSQL version: 15.1
Operating system:   Ubuntu 20.04
Description:

Hello
I build two postgres (REL_15_STABLE) with different ICU versions (60.3 and
66.1)

In "old-icu-server" I do (exerpt from regression test collate.icu.utf8):

CREATE COLLATION coll (provider = icu, deterministic = false, locale =
'@colStrength=secondary');
CREATE TABLE test (
    a integer,
    b text COLLATE coll
)
PARTITION BY HASH (b);
CREATE TABLE test_0 (
    a integer,
    b text COLLATE coll
);
CREATE TABLE test_1 (
    a integer,
    b text COLLATE coll
);
ALTER TABLE ONLY test ATTACH PARTITION test_0 FOR VALUES WITH (modulus 2,
remainder 0);
ALTER TABLE ONLY test ATTACH PARTITION test_1 FOR VALUES WITH (modulus 2,
remainder 1);
INSERT INTO test VALUES (1, 'def');
SELECT tableoid::regclass::text as part, hashtext(b) as hash, * FROM test;
EXPLAIN SELECT * FROM test WHERE b='def';
SELECT * FROM test WHERE b='def';

and recieve:

CREATE COLLATION
CREATE TABLE
CREATE TABLE
CREATE TABLE
ALTER TABLE
ALTER TABLE
INSERT 0 1
  part  |    hash    | a |  b  
--------+------------+---+-----
 test_0 | -493883174 | 1 | def
(1 row)

                         QUERY PLAN                          
-------------------------------------------------------------
 Seq Scan on test_0 test  (cost=0.00..25.88 rows=6 width=36)
   Filter: (b = 'def'::text)
(2 rows)
 a |  b  
---+-----
 1 | def
(1 row)

Then I stop the server and start this cluster with "new-icu-binaries"
(simulating OS upgrade) and do:
SELECT tableoid::regclass::text as part, hashtext(b) as hash, * FROM test;
EXPLAIN SELECT * FROM test WHERE b='def';
SELECT * FROM test WHERE b='def';

I got the following error and incorrect SELECT result:

WARNING:  collation "coll" has version mismatch
DETAIL:  The collation in the database was created using version 153.80, but
the operating system provides version 153.14.
HINT:  Rebuild all objects affected by this collation and run ALTER
COLLATION public.coll REFRESH VERSION, or build PostgreSQL with the right
library version.
  part  |    hash    | a |  b  
--------+------------+---+-----
 test_0 | 2087909916 | 1 | def
(1 row)

                        QUERY PLAN                          
-------------------------------------------------------------
 Seq Scan on test_1 test  (cost=0.00..25.88 rows=6 width=36)
   Filter: (b = 'def'::text)
(2 rows)

 a | b 
---+---
(0 rows)

---
If I make a dump in the "old-icu-server" and try to upload to the
"new-icu-server", then I get another failure:

psql:dump.sql:89: ERROR:  new row for relation "test_0" violates partition
constraint
DETAIL:  Failing row contains (1, def).
CONTEXT:  COPY test_0, line 1: "1       def"
COPY 0

This case discovered when trying to restore on OracleLinux 8 a dump produced
on Ubuntu 20.04 by the modified sql/collate.icu.utf8.sql

What bothers me is that partitioning depends on the hash that can be
computed differently with the OS upgrade/migration. Also I was surprised not
to find a description of the hashtext()/hashtextextended() in the docs.

Thanks!

Re: BUG #17746: Partitioning by hash of a text depends on icu version when text collation is not deterministic

From

Tom Lane

Date:

11 January 2023, 17:39:06

PG Bug reporting form <noreply@postgresql.org> writes:
> What bothers me is that partitioning depends on the hash that can be
> computed differently with the OS upgrade/migration.

There's basically no way to avoid such problems with a non-deterministic
collation.  The hash function is required to compute the same hash for
all values that compare equal, and that set can change if the collation
does.  Even if the collation hasn't changed in any user-visible way,
what we are hashing for such cases is the result of ucol_getSortKey(),
and the new collation version might well produce a different answer.

Personally, I think hash partitioning is an anti-pattern that ought
to come with bright red warning flags in the docs.  If you think you
want it, you're generally wrong, for a number of reasons beyond this.

(Admittedly, range partitioning can also get broken by collation
updates, but at least that doesn't happen without user-visible
behavioral changes in the collation.)

            regards, tom lane