Thread: case insensitive collation of Greek's sigma
Hello,
during our tests of Postgres with ICU we found an issue with ILIKE of upper and lowercase sigma (Σ). The letter has two lowercase variants σ and ς (at the end of a word). I'm working with en_US and en-US-x-icu collations and results are a bit unexpected - they are inverted:
postgres=# SELECT
postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en_US",
postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en_US"
postgres-# ;
?column? | ?column?
----------+----------
t | f
(1 row)
postgres=# SELECT
postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en-US-x-icu",
postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en-US-x-icu";
?column? | ?column?
----------+----------
f | t
(1 row)
postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en_US",
postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en_US"
postgres-# ;
?column? | ?column?
----------+----------
t | f
(1 row)
postgres=# SELECT
postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en-US-x-icu",
postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en-US-x-icu";
?column? | ?column?
----------+----------
f | t
(1 row)
I run those commands on the latest (14.1) official docker image.
Is it possible to unify the behaviour?And which one is correct from the community point of view?
If I could start, I think both results are wrong as both should return True. If I got it right, in the background there is a lower() function running to compare strings, which is not enough for such cases (until the left side isn't taken as a standalone word).
Thanks,
- jj
On 26/11/21 9:37 π.μ., Jakub Jedelsky wrote: > Hello, Thank you for dealing with Greek! > > during our tests of Postgres with ICU we found an issue with ILIKE of upper and lowercase sigma (Σ). The letter has twolowercase variants σ and ς (at the end of a word). I'm working with en_US and > en-US-x-icu collations and results are a bit unexpected - they are inverted: > > postgres=# SELECT > postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en_US", > postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en_US" > postgres-# ; > ?column? | ?column? > ----------+---------- > t | f > (1 row) > > postgres=# SELECT > postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en-US-x-icu", > postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en-US-x-icu"; > ?column? | ?column? > ----------+---------- > f | t > (1 row) > > I run those commands on the latest (14.1) official docker image. > > Is it possible to unify the behaviour?And which one is correct from the community point of view? IMHO all those letters are wrong, the correct S used to be the letter C (equal to the Cyrillic S). Σ is a new invention, same like σ, the final ς looks a lot like the calligraphic version of c . > > If I could start, I think both results are wrong as both should return True. If I got it right, in the background thereis a lower() function running to compare strings, which is not enough for such > cases (until the left side isn't taken as a standalone word). I agree with you all of them should be deducted to a single letter s. Firefox's find recognizes all three (Σ,σ,ς) as thesame letter. > > Thanks, > > - jj -- Achilleas Mantzios DBA, Analyst, IT Lead IT DEPT Dynacom Tankers Mgmt
Jakub Jedelsky <jakub.jedelsky@gooddata.com> writes: > during our tests of Postgres with ICU we found an issue with ILIKE of upper > and lowercase sigma (Σ). The letter has two lowercase variants σ and ς (at > the end of a word). I'm working with en_US and en-US-x-icu collations and > results are a bit unexpected - they are inverted: Not sure why you would expect en_US collation to give appropriate results for Greek? I think you'd need el_GR. Be that as it may, Postgres doesn't define anything about this --- we just rely on the collations supplied by the respective system libraries (libc or ICU). You'd need to file bugs against those libraries if you think their behavior should change. regards, tom lane
Am 26.11.21 um 08:37 schrieb Jakub Jedelsky: > Hello, > > during our tests of Postgres with ICU we found an issue with ILIKE of > upper and lowercase sigma (Σ). The letter has two lowercase variants σ > and ς (at the end of a word). I'm working with en_US and en-US-x-icu > collations and results are a bit unexpected - they are inverted: > > postgres=# SELECT > postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en_US", > postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en_US" > postgres-# ; > ?column? | ?column? > ----------+---------- > t | f > (1 row) > > postgres=# SELECT > postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en-US-x-icu", > postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en-US-x-icu"; > ?column? | ?column? > ----------+---------- > f | t > (1 row) > > I run those commands on the latest (14.1) official docker image. > > Is it possible to unify the behaviour?And which one is correct from > the community point of view? > > If I could start, I think both results are wrong as both should return > True. If I got it right, in the background there is a lower() function > running to compare strings, which is not enough for such cases (until > the left side isn't taken as a standalone word). > > Thanks, > > - jj Have you seen the subtle intricacies in this example? => SELECT 'ΣΣ Μ' ILIKE 'σσ Μ' COLLATE "en_US" AS c0, 'ΣΣ Μ' ILIKE 'σς Μ' COLLATE "en_US" AS c1, 'ΣΣ Μ' ~* 'σσ Μ' COLLATE "el-GR-x-icu" AS c2, 'ΣΣ Μ' ~* 'σς Μ' COLLATE "el-GR-x-icu" AS c3, 'ΣΣ Μ' ILIKE 'σσ Μ' COLLATE "el-GR-x-icu" AS c4, 'ΣΣ Μ' ILIKE 'σς Μ' COLLATE "el-GR-x-icu" AS c5, 'ΣΣ Μ' ~* 'σσ Μ' COLLATE "en-US-x-icu" AS c6, 'ΣΣ Μ' ~* 'σς Μ' COLLATE "en-US-x-icu" AS c7, 'ΣΣ Μ' ILIKE 'σσ Μ' COLLATE "en-US-x-icu" AS c8, 'ΣΣ Μ' ILIKE 'σς Μ' COLLATE "en-US-x-icu" AS c9; c0 | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9 ----+----+----+----+----+----+----+----+----+---- t | f | t | t | f | t | t | t | f | t (1 row) Obviously, the ILIKE operator is really strict regarding to the correct letter at the end of the word. The regular expression operator works as you expected. Happy computing... Frank
On 26.11.21 08:37, Jakub Jedelsky wrote: > postgres=# SELECT > postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en_US", > postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en_US" > postgres-# ; > ?column? | ?column? > ----------+---------- > t | f > (1 row) > > postgres=# SELECT > postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en-US-x-icu", > postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en-US-x-icu"; > ?column? | ?column? > ----------+---------- > f | t > (1 row) > > If I could start, I think both results are wrong as both should return > True. If I got it right, in the background there is a lower() function > running to compare strings, which is not enough for such cases (until > the left side isn't taken as a standalone word). The reason for these results is that for multibyte encodings, a ILIKE b basically does lower(a) LIKE lower(b), and select lower('ΣΣ' COLLATE "en_US"), lower('ΣΣ' COLLATE "en-US-x-icu"); lower | lower -------+------- σσ | σς Running lower() like this is really the wrong thing to do. We should be doing "case folding" instead, which normalizes these differences for the purpose of case-insensitive comparisons.
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > Running lower() like this is really the wrong thing to do. We should be > doing "case folding" instead, which normalizes these differences for the > purpose of case-insensitive comparisons. That just begs the question: if tolower (or towlower) isn't the appropriate API, what is? Perhaps ICU has something for a more generalized notion of case-similarity, but I'm not aware of any such thing in the POSIX API. BTW, I think it's only accidental that the regex example shown upthread gets the right answer. In that example, what's happening is that we consider a letter in a case-insensitive regex to match itself, or tolower() of itself, or toupper() of itself. Both σ and ς have Σ as toupper() so they both work. But if you'd written Σ in the regex, only one of σ and ς would match that as a data character. (Haven't actually tested this, but given the way the code works I'm pretty sure it's so.) Again, it's hard to see how to do better atop a POSIX locale library. regards, tom lane
On Wed, Dec 1, 2021 at 8:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> Running lower() like this is really the wrong thing to do. We should be
> doing "case folding" instead, which normalizes these differences for the
> purpose of case-insensitive comparisons.
That just begs the question: if tolower (or towlower) isn't the
appropriate API, what is? Perhaps ICU has something for a more
generalized notion of case-similarity, but I'm not aware of any such
thing in the POSIX API.
BTW, I think it's only accidental that the regex example shown upthread
gets the right answer. In that example, what's happening is that we
consider a letter in a case-insensitive regex to match itself, or
tolower() of itself, or toupper() of itself. Both σ and ς have Σ
as toupper() so they both work. But if you'd written Σ in the regex,
only one of σ and ς would match that as a data character. (Haven't
actually tested this, but given the way the code works I'm pretty
sure it's so.) Again, it's hard to see how to do better atop a POSIX
locale library.
Thanks for digging into the issue.
Based on GNU docs [1] the POSIX APIs are not ready for that. Anyway, is it possible to keep current behaviour with lowercase in POSIX as a fallback and have the correct solution for ICU? I think (not an expert though) there should be already working code for case folding for some time already.
Text files are nowadays usually encoded in Unicode, and may consist of very different scripts – from Latin letters to Chinese Hanzi –, with many kinds of special characters – accents, right-to-left writing marks, hyphens, Roman numbers, and much more. But the POSIX platform APIs for text do not contain adequate functions for dealing with particular properties of many Unicode characters. In fact, the POSIX APIs for text have several assumptions at their base which don't hold for Unicode text.
"""I realise this may not be applicable to the original problem, but non-deterministic collations seems to offer a solution:: dakkar@[local] dakkar=> create collation "en-US-ins-icu" ( provider=icu, locale='en-US-u-ks-level2', deterministic=false ); dakkar@[local] dakkar=> select 'ΣΣ' = 'σσ' collate "en-US-ins-icu"; ┌──────────┐ │ ?column? │ ├──────────┤ │ t │ └──────────┘ (1 row) dakkar@[local] dakkar=> select 'ΣΣ' = 'σς' collate "en-US-ins-icu"; ┌──────────┐ │ ?column? │ ├──────────┤ │ t │ └──────────┘ (1 row) dakkar@[local] dakkar=> select 'ΣΣ' = 'α' collate "en-US-ins-icu"; ┌──────────┐ │ ?column? │ ├──────────┤ │ f │ └──────────┘ (1 row) Notice, though: * I don't understand what that ``-u-`` is doing in ``locale``, but it's necessary * as the docs https://www.postgresql.org/docs/13/collation.html#COLLATION-NONDETERMINISTIC say: - B-tree cannot use deduplication with indexes that use a nondeterministic collation - certain operations are not possible with nondeterministic collations, such as pattern matching operations (this means you can't use ``LIKE``) -- Dakkar - <Mobilis in mobile> GPG public key fingerprint = A071 E618 DD2C 5901 9574 6FE2 40EA 9883 7519 3F88 key id = 0x75193F88