Thread: patch suggestion: Fix citext_utf8 test's "Turkish I" with ICU collation provider
patch suggestion: Fix citext_utf8 test's "Turkish I" with ICU collation provider
From
Anton Voloshin
Date:
Hello, hackers. In current master, as well as in REL_15_STABLE, installcheck in contrib/citext fails in most locales, if we use ICU as a locale provider: $ rm -fr data; initdb --locale-provider icu --icu-locale en-US -D data && pg_ctl -D data -l logfile start && make -C contrib/citext installcheck; pg_ctl -D data stop; cat contrib/citext/regression.diffs ... test citext ... ok 457 ms test citext_utf8 ... FAILED 21 ms ... diff -u /home/ashutosh/pg/REL_15_STABLE/contrib/citext/expected/citext_utf8.out /home/ashutosh/pg/REL_15_STABLE/contrib/citext/results/citext_utf8.out --- /home/ashutosh/pg/REL_15_STABLE/contrib/citext/expected/citext_utf8.out 2022-07-14 17:45:31.747259743 +0300 +++ /home/ashutosh/pg/REL_15_STABLE/contrib/citext/results/citext_utf8.out 2022-10-21 19:43:21.146044062 +0300 @@ -54,7 +54,7 @@ SELECT 'i'::citext = 'İ'::citext AS t; t --- - t + f (1 row) The reason is that in ICU lowercasing Unicode symbol "İ" (U+0130 "LATIN CAPITAL LETTER I WITH DOT ABOVE") can give two valid results: - "i", i.e. "U+0069 LATIN SMALL LETTER I" in "tr" and "az" locales. - "i̇", i.e. "U+0069 LATIN SMALL LETTER I" followed by "U+0307 COMBINING DOT ABOVE" in all other locales I've tried (including "en-US", "de", "ru", etc). And the way this test is currently written only accepts plain latin "i", which might be true in glibc, but is not so in ICU. Verified on ICU 70.1, but I've seen this on few other ICU versions as well, so I think this is probably an ICU's feature, not a bug(?). Since we probably want installcheck in contrib/citext to pass on databases with various locales, including reasonable ICU-based ones, I suggest to fix this test by accepting either of outputs as valid. I can see two ways of doing that: 1. change SQL in the test to use "IN" instead of "="; 2. add an alternative output. I think in this case "IN" is better, because that allows a single comment to address both possible outputs and to avoid unnecessary duplication. I've attached a patch authored mostly by my colleague, Roman Zharkov, as one possible fix. Only versions 15+ are affected. Any comments? -- Anton Voloshin Postgres Professional, The Russian Postgres Company https://postgrespro.ru