Thread: BUG #17746: Partitioning by hash of a text depends on icu version when text collation is not deterministic
BUG #17746: Partitioning by hash of a text depends on icu version when text collation is not deterministic
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 17746 Logged by: Andrew Bille Email address: andrewbille@gmail.com PostgreSQL version: 15.1 Operating system: Ubuntu 20.04 Description: Hello I build two postgres (REL_15_STABLE) with different ICU versions (60.3 and 66.1) In "old-icu-server" I do (exerpt from regression test collate.icu.utf8): CREATE COLLATION coll (provider = icu, deterministic = false, locale = '@colStrength=secondary'); CREATE TABLE test ( a integer, b text COLLATE coll ) PARTITION BY HASH (b); CREATE TABLE test_0 ( a integer, b text COLLATE coll ); CREATE TABLE test_1 ( a integer, b text COLLATE coll ); ALTER TABLE ONLY test ATTACH PARTITION test_0 FOR VALUES WITH (modulus 2, remainder 0); ALTER TABLE ONLY test ATTACH PARTITION test_1 FOR VALUES WITH (modulus 2, remainder 1); INSERT INTO test VALUES (1, 'def'); SELECT tableoid::regclass::text as part, hashtext(b) as hash, * FROM test; EXPLAIN SELECT * FROM test WHERE b='def'; SELECT * FROM test WHERE b='def'; and recieve: CREATE COLLATION CREATE TABLE CREATE TABLE CREATE TABLE ALTER TABLE ALTER TABLE INSERT 0 1 part | hash | a | b --------+------------+---+----- test_0 | -493883174 | 1 | def (1 row) QUERY PLAN ------------------------------------------------------------- Seq Scan on test_0 test (cost=0.00..25.88 rows=6 width=36) Filter: (b = 'def'::text) (2 rows) a | b ---+----- 1 | def (1 row) Then I stop the server and start this cluster with "new-icu-binaries" (simulating OS upgrade) and do: SELECT tableoid::regclass::text as part, hashtext(b) as hash, * FROM test; EXPLAIN SELECT * FROM test WHERE b='def'; SELECT * FROM test WHERE b='def'; I got the following error and incorrect SELECT result: WARNING: collation "coll" has version mismatch DETAIL: The collation in the database was created using version 153.80, but the operating system provides version 153.14. HINT: Rebuild all objects affected by this collation and run ALTER COLLATION public.coll REFRESH VERSION, or build PostgreSQL with the right library version. part | hash | a | b --------+------------+---+----- test_0 | 2087909916 | 1 | def (1 row) QUERY PLAN ------------------------------------------------------------- Seq Scan on test_1 test (cost=0.00..25.88 rows=6 width=36) Filter: (b = 'def'::text) (2 rows) a | b ---+--- (0 rows) --- If I make a dump in the "old-icu-server" and try to upload to the "new-icu-server", then I get another failure: psql:dump.sql:89: ERROR: new row for relation "test_0" violates partition constraint DETAIL: Failing row contains (1, def). CONTEXT: COPY test_0, line 1: "1 def" COPY 0 This case discovered when trying to restore on OracleLinux 8 a dump produced on Ubuntu 20.04 by the modified sql/collate.icu.utf8.sql What bothers me is that partitioning depends on the hash that can be computed differently with the OS upgrade/migration. Also I was surprised not to find a description of the hashtext()/hashtextextended() in the docs. Thanks!
Re: BUG #17746: Partitioning by hash of a text depends on icu version when text collation is not deterministic
From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes: > What bothers me is that partitioning depends on the hash that can be > computed differently with the OS upgrade/migration. There's basically no way to avoid such problems with a non-deterministic collation. The hash function is required to compute the same hash for all values that compare equal, and that set can change if the collation does. Even if the collation hasn't changed in any user-visible way, what we are hashing for such cases is the result of ucol_getSortKey(), and the new collation version might well produce a different answer. Personally, I think hash partitioning is an anti-pattern that ought to come with bright red warning flags in the docs. If you think you want it, you're generally wrong, for a number of reasons beyond this. (Admittedly, range partitioning can also get broken by collation updates, but at least that doesn't happen without user-visible behavioral changes in the collation.) regards, tom lane