Thread: patch suggestion: Fix citext_utf8 test's "Turkish I" with ICU collation provider

patch suggestion: Fix citext_utf8 test's "Turkish I" with ICU collation provider

From

Anton Voloshin

Date:

21 October 2022, 17:23:33

Hello, hackers.

In current master, as well as in REL_15_STABLE, installcheck in 
contrib/citext fails in most locales, if we use ICU as a locale provider:

$ rm -fr data; initdb --locale-provider icu --icu-locale en-US -D data 
&& pg_ctl -D data -l logfile start && make -C contrib/citext 
installcheck; pg_ctl -D data stop; cat contrib/citext/regression.diffs
...
test citext                       ... ok          457 ms
test citext_utf8                  ... FAILED       21 ms
...
diff -u 
/home/ashutosh/pg/REL_15_STABLE/contrib/citext/expected/citext_utf8.out 
/home/ashutosh/pg/REL_15_STABLE/contrib/citext/results/citext_utf8.out
--- 
/home/ashutosh/pg/REL_15_STABLE/contrib/citext/expected/citext_utf8.out 
    2022-07-14 17:45:31.747259743 +0300
+++ 
/home/ashutosh/pg/REL_15_STABLE/contrib/citext/results/citext_utf8.out 
    2022-10-21 19:43:21.146044062 +0300
@@ -54,7 +54,7 @@
  SELECT 'i'::citext = 'İ'::citext AS t;
   t
  ---
- t
+ f
  (1 row)

The reason is that in ICU lowercasing Unicode symbol "İ" (U+0130
"LATIN CAPITAL LETTER I WITH DOT ABOVE") can give two valid results:
- "i", i.e. "U+0069 LATIN SMALL LETTER I" in "tr" and "az" locales.
- "i̇", i.e. "U+0069 LATIN SMALL LETTER I" followed by "U+0307 COMBINING
   DOT ABOVE" in all other locales I've tried (including "en-US", "de",
   "ru", etc).
And the way this test is currently written only accepts plain latin "i", 
which might be true in glibc, but is not so in ICU. Verified on ICU 
70.1, but I've seen this on few other ICU versions as well, so I think 
this is probably an ICU's feature, not a bug(?).

Since we probably want installcheck in contrib/citext to pass on
databases with various locales, including reasonable ICU-based ones,
I suggest to fix this test by accepting either of outputs as valid.

I can see two ways of doing that:
1. change SQL in the test to use "IN" instead of "=";
2. add an alternative output.

I think in this case "IN" is better, because that allows a single 
comment to address both possible outputs and to avoid unnecessary 
duplication.

I've attached a patch authored mostly by my colleague, Roman Zharkov, as 
one possible fix.

Only versions 15+ are affected.

Any comments?

-- 
Anton Voloshin
Postgres Professional, The Russian Postgres Company
https://postgrespro.ru

Attachment

0001-Fix-citext_utf8-test-s-Turkish-I-with-ICU-collation-.patch