Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation - Mailing list pgsql-bugs

From Heikki Linnakangas
Subject Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation
Date
Msg-id 6387cb3e-aec8-41a0-acef-bacdbfb435db@iki.fi
Whole thread Raw
In response to Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation
List pgsql-bugs
On 02/12/2025 18:36, Heikki Linnakangas wrote:
> On 02/12/2025 18:24, Laurenz Albe wrote:
>> On Tue, 2025-12-02 at 10:03 +0000, PG Bug reporting form wrote:
>>> PostgreSQL version: 18.1
>>>
>>> When using a nondeterministic ICU collation, the replace() function 
>>> fails to
>>> replace a substring when that substring appears at the end of the input
>>> string.
>>>
>>> Occurrences of the same substring earlier in the string are replaced
>>> normally.
>>>
>>> Specific collation used:
>>> create collation test_nondeterministic (
>>>      provider = icu,
>>>      locale = 'und-u-ks-level2',
>>>      deterministic = false
>>> )
>>>
>>> -- Replace final character under nondeterministic collation
>>> SELECT replace(
>>>      'testx' COLLATE "test_nondeterministic",
>>>      'x'     COLLATE "test_nondeterministic",
>>>      'y') AS res1;
>>
>> I can reproduce the problem, and the attached patch fixes it for me.
> 
> +1, looks good to me. Let's also add a regression test for this.

I added a simple test for this, and I think this is still not quite 
right. I added the following to collate.icu.utf test:

  CREATE TABLE test4nfd (a int, b text);
  INSERT INTO test4nfd VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4, 
'côté');
  UPDATE test4nfd SET b = normalize(b, nfd);
  -- This shows why replace should be greedy.  Otherwise, in the NFD
  -- case, the match would stop before the decomposed accents, which
  -- would leave the accents in the results.
  SELECT a, b, replace(b COLLATE ignore_accents, 'co', 'ma') FROM test4;
   a |  b   | replace
  ---+------+---------
   1 | cote | mate
   2 | côte | mate
   3 | coté | maté
   4 | côté | maté
  (4 rows)

  SELECT a, b, replace(b COLLATE ignore_accents, 'co', 'ma') FROM test4nfd;
   a |  b   | replace
  ---+------+---------
   1 | cote | mate
   2 | côte | mate
   3 | coté | maté
   4 | côté | maté
  (4 rows)

+-- Test for match at the end of the string.  (We had a bug on that
+-- once)
+SELECT a, b, replace(b COLLATE ignore_accents, 'te', 'ma') FROM test4nfd;
+ a |  b   | replace
+---+------+---------
+ 1 | cote | coma
+ 2 | côte | coma
+ 3 | coté | coma
+ 4 | côté | coma
+(4 rows)
+

In the added test query, the accents on the 'o' are stripped, which 
doesn't look correct.

- Heikki



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation
Next
From: Laurenz Albe
Date:
Subject: Re: BUG #19341: REPLACE() fails to match final character when using nondeterministic ICU collation