On 02/12/2025 18:36, Heikki Linnakangas wrote:
> On 02/12/2025 18:24, Laurenz Albe wrote:
>> On Tue, 2025-12-02 at 10:03 +0000, PG Bug reporting form wrote:
>>> PostgreSQL version: 18.1
>>>
>>> When using a nondeterministic ICU collation, the replace() function
>>> fails to
>>> replace a substring when that substring appears at the end of the input
>>> string.
>>>
>>> Occurrences of the same substring earlier in the string are replaced
>>> normally.
>>>
>>> Specific collation used:
>>> create collation test_nondeterministic (
>>> provider = icu,
>>> locale = 'und-u-ks-level2',
>>> deterministic = false
>>> )
>>>
>>> -- Replace final character under nondeterministic collation
>>> SELECT replace(
>>> 'testx' COLLATE "test_nondeterministic",
>>> 'x' COLLATE "test_nondeterministic",
>>> 'y') AS res1;
>>
>> I can reproduce the problem, and the attached patch fixes it for me.
>
> +1, looks good to me. Let's also add a regression test for this.
I added a simple test for this, and I think this is still not quite
right. I added the following to collate.icu.utf test:
CREATE TABLE test4nfd (a int, b text);
INSERT INTO test4nfd VALUES (1, 'cote'), (2, 'côte'), (3, 'coté'), (4,
'côté');
UPDATE test4nfd SET b = normalize(b, nfd);
-- This shows why replace should be greedy. Otherwise, in the NFD
-- case, the match would stop before the decomposed accents, which
-- would leave the accents in the results.
SELECT a, b, replace(b COLLATE ignore_accents, 'co', 'ma') FROM test4;
a | b | replace
---+------+---------
1 | cote | mate
2 | côte | mate
3 | coté | maté
4 | côté | maté
(4 rows)
SELECT a, b, replace(b COLLATE ignore_accents, 'co', 'ma') FROM test4nfd;
a | b | replace
---+------+---------
1 | cote | mate
2 | côte | mate
3 | coté | maté
4 | côté | maté
(4 rows)
+-- Test for match at the end of the string. (We had a bug on that
+-- once)
+SELECT a, b, replace(b COLLATE ignore_accents, 'te', 'ma') FROM test4nfd;
+ a | b | replace
+---+------+---------
+ 1 | cote | coma
+ 2 | côte | coma
+ 3 | coté | coma
+ 4 | côté | coma
+(4 rows)
+
In the added test query, the accents on the 'o' are stripped, which
doesn't look correct.
- Heikki