Thread: case insensitive collation of Greek's sigma

case insensitive collation of Greek's sigma

From
Jakub Jedelsky
Date:
Hello,

during our tests of Postgres with ICU we found an issue with ILIKE of upper and lowercase sigma (Σ). The letter has two lowercase variants σ and ς (at the end of a word). I'm working with en_US and en-US-x-icu collations and results are a bit unexpected - they are inverted:

postgres=# SELECT
postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en_US",
postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en_US"
postgres-# ;
 ?column? | ?column?
----------+----------
 t        | f
(1 row)

postgres=# SELECT
postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en-US-x-icu",
postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en-US-x-icu";
 ?column? | ?column?
----------+----------
 f        | t
(1 row)

I run those commands on the latest (14.1) official docker image.

Is it possible to unify the behaviour?And which one is correct from the community point of view?

If I could start, I think both results are wrong as both should return True. If I got it right, in the background there is a lower() function running to compare strings, which is not enough for such cases (until the left side isn't taken as a standalone word).

Thanks,

- jj

Re: case insensitive collation of Greek's sigma

From
Achilleas Mantzios
Date:
On 26/11/21 9:37 π.μ., Jakub Jedelsky wrote:
> Hello,
Thank you for dealing with Greek!
>
> during our tests of Postgres with ICU we found an issue with ILIKE of upper and lowercase sigma (Σ). The letter has
twolowercase variants σ and ς (at the end of a word). I'm working with en_US and 
 
> en-US-x-icu collations and results are a bit unexpected - they are inverted:
>
> postgres=# SELECT
> postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en_US",
> postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en_US"
> postgres-# ;
>  ?column? | ?column?
> ----------+----------
>  t        | f
> (1 row)
>
> postgres=# SELECT
> postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en-US-x-icu",
> postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en-US-x-icu";
>  ?column? | ?column?
> ----------+----------
>  f        | t
> (1 row)
>
> I run those commands on the latest (14.1) official docker image.
>
> Is it possible to unify the behaviour?And which one is correct from the community point of view?
IMHO all those letters are wrong, the correct S used to be the letter C (equal to the Cyrillic S).
Σ is a new invention, same like σ, the final ς looks a lot like the calligraphic version of c .
>
> If I could start, I think both results are wrong as both should return True. If I got it right, in the background
thereis a lower() function running to compare strings, which is not enough for such 
 
> cases (until the left side isn't taken as a standalone word).
I agree with you all of them should be deducted to a single letter s. Firefox's find recognizes all three (Σ,σ,ς) as
thesame letter.
 
>
> Thanks,
>
> - jj


-- 
Achilleas Mantzios
DBA, Analyst, IT Lead
IT DEPT
Dynacom Tankers Mgmt




Re: case insensitive collation of Greek's sigma

From
Tom Lane
Date:
Jakub Jedelsky <jakub.jedelsky@gooddata.com> writes:
> during our tests of Postgres with ICU we found an issue with ILIKE of upper
> and lowercase sigma (Σ). The letter has two lowercase variants σ and ς (at
> the end of a word). I'm working with en_US and en-US-x-icu collations and
> results are a bit unexpected - they are inverted:

Not sure why you would expect en_US collation to give appropriate results
for Greek?  I think you'd need el_GR.

Be that as it may, Postgres doesn't define anything about this --- we
just rely on the collations supplied by the respective system libraries
(libc or ICU).  You'd need to file bugs against those libraries if you
think their behavior should change.

            regards, tom lane



Re: case insensitive collation of Greek's sigma

From
Frank Limpert
Date:
Am 26.11.21 um 08:37 schrieb Jakub Jedelsky:
> Hello,
>
> during our tests of Postgres with ICU we found an issue with ILIKE of 
> upper and lowercase sigma (Σ). The letter has two lowercase variants σ 
> and ς (at the end of a word). I'm working with en_US and en-US-x-icu 
> collations and results are a bit unexpected - they are inverted:
>
> postgres=# SELECT
> postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en_US",
> postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en_US"
> postgres-# ;
>  ?column? | ?column?
> ----------+----------
>  t        | f
> (1 row)
>
> postgres=# SELECT
> postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en-US-x-icu",
> postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en-US-x-icu";
>  ?column? | ?column?
> ----------+----------
>  f        | t
> (1 row)
>
> I run those commands on the latest (14.1) official docker image.
>
> Is it possible to unify the behaviour?And which one is correct from 
> the community point of view?
>
> If I could start, I think both results are wrong as both should return 
> True. If I got it right, in the background there is a lower() function 
> running to compare strings, which is not enough for such cases (until 
> the left side isn't taken as a standalone word).
>
> Thanks,
>
> - jj

Have you seen the subtle intricacies in this example?

=> SELECT 'ΣΣ Μ' ILIKE 'σσ Μ' COLLATE "en_US" AS c0,
           'ΣΣ Μ' ILIKE 'σς Μ' COLLATE "en_US" AS c1,
           'ΣΣ Μ' ~*    'σσ Μ' COLLATE "el-GR-x-icu" AS c2,
           'ΣΣ Μ' ~*    'σς Μ' COLLATE "el-GR-x-icu" AS c3,
           'ΣΣ Μ' ILIKE 'σσ Μ' COLLATE "el-GR-x-icu" AS c4,
           'ΣΣ Μ' ILIKE 'σς Μ' COLLATE "el-GR-x-icu" AS c5,
           'ΣΣ Μ' ~*    'σσ Μ' COLLATE "en-US-x-icu" AS c6,
           'ΣΣ Μ' ~*    'σς Μ' COLLATE "en-US-x-icu" AS c7,
           'ΣΣ Μ' ILIKE 'σσ Μ' COLLATE "en-US-x-icu" AS c8,
           'ΣΣ Μ' ILIKE 'σς Μ' COLLATE "en-US-x-icu" AS c9;
  c0 | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | c9
----+----+----+----+----+----+----+----+----+----
  t  | f  | t  | t  | f  | t  | t  | t  | f  | t
(1 row)

Obviously, the ILIKE operator is really strict regarding to the correct 
letter at the end of the word. The regular expression operator works as 
you expected.

Happy computing...
Frank



Re: case insensitive collation of Greek's sigma

From
Peter Eisentraut
Date:
On 26.11.21 08:37, Jakub Jedelsky wrote:
> postgres=# SELECT
> postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en_US",
> postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en_US"
> postgres-# ;
>   ?column? | ?column?
> ----------+----------
>   t        | f
> (1 row)
> 
> postgres=# SELECT
> postgres-# 'ΣΣ' ILIKE 'σσ' COLLATE "en-US-x-icu",
> postgres-# 'ΣΣ' ILIKE 'σς' COLLATE "en-US-x-icu";
>   ?column? | ?column?
> ----------+----------
>   f        | t
> (1 row)
> 
> If I could start, I think both results are wrong as both should return 
> True. If I got it right, in the background there is a lower() function 
> running to compare strings, which is not enough for such cases (until 
> the left side isn't taken as a standalone word).

The reason for these results is that for multibyte encodings, a ILIKE b 
basically does lower(a) LIKE lower(b), and

select lower('ΣΣ' COLLATE "en_US"), lower('ΣΣ' COLLATE "en-US-x-icu");
  lower | lower
-------+-------
  σσ    | σς

Running lower() like this is really the wrong thing to do.  We should be 
doing "case folding" instead, which normalizes these differences for the 
purpose of case-insensitive comparisons.



Re: case insensitive collation of Greek's sigma

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> Running lower() like this is really the wrong thing to do.  We should be 
> doing "case folding" instead, which normalizes these differences for the 
> purpose of case-insensitive comparisons.

That just begs the question: if tolower (or towlower) isn't the
appropriate API, what is?  Perhaps ICU has something for a more
generalized notion of case-similarity, but I'm not aware of any such
thing in the POSIX API.

BTW, I think it's only accidental that the regex example shown upthread
gets the right answer.  In that example, what's happening is that we
consider a letter in a case-insensitive regex to match itself, or
tolower() of itself, or toupper() of itself.  Both σ and ς have Σ
as toupper() so they both work.  But if you'd written Σ in the regex,
only one of σ and ς would match that as a data character.  (Haven't
actually tested this, but given the way the code works I'm pretty
sure it's so.)  Again, it's hard to see how to do better atop a POSIX
locale library.

            regards, tom lane



Re: case insensitive collation of Greek's sigma

From
Jakub Jedelsky
Date:
On Wed, Dec 1, 2021 at 8:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> Running lower() like this is really the wrong thing to do.  We should be
> doing "case folding" instead, which normalizes these differences for the
> purpose of case-insensitive comparisons.

That just begs the question: if tolower (or towlower) isn't the
appropriate API, what is?  Perhaps ICU has something for a more
generalized notion of case-similarity, but I'm not aware of any such
thing in the POSIX API.

BTW, I think it's only accidental that the regex example shown upthread
gets the right answer.  In that example, what's happening is that we
consider a letter in a case-insensitive regex to match itself, or
tolower() of itself, or toupper() of itself.  Both σ and ς have Σ
as toupper() so they both work.  But if you'd written Σ in the regex,
only one of σ and ς would match that as a data character.  (Haven't
actually tested this, but given the way the code works I'm pretty
sure it's so.)  Again, it's hard to see how to do better atop a POSIX
locale library.

Thanks for digging into the issue.
 
Based on GNU docs [1] the POSIX APIs are not ready for that. Anyway, is it possible to keep current behaviour with lowercase in POSIX as a fallback and have the correct solution for ICU? I think (not an expert though) there should be already working code for case folding for some time already.

Text files are nowadays usually encoded in Unicode, and may consist of very different scripts – from Latin letters to Chinese Hanzi –, with many kinds of special characters – accents, right-to-left writing marks, hyphens, Roman numbers, and much more. But the POSIX platform APIs for text do not contain adequate functions for dealing with particular properties of many Unicode characters. In fact, the POSIX APIs for text have several assumptions at their base which don't hold for Unicode text.
"""

Re: case insensitive collation of Greek's sigma

From
Gianni Ceccarelli
Date:
I realise this may not be applicable to the original problem, but
non-deterministic collations seems to offer a solution::


  dakkar@[local] dakkar=> create collation "en-US-ins-icu" (
         provider=icu,
         locale='en-US-u-ks-level2',
         deterministic=false
         );

  dakkar@[local] dakkar=> select 'ΣΣ' = 'σσ' collate "en-US-ins-icu";
  ┌──────────┐
  │ ?column? │
  ├──────────┤
  │ t        │
  └──────────┘
  (1 row)

  dakkar@[local] dakkar=> select 'ΣΣ' = 'σς' collate "en-US-ins-icu";
  ┌──────────┐
  │ ?column? │
  ├──────────┤
  │ t        │
  └──────────┘
  (1 row)

  dakkar@[local] dakkar=> select 'ΣΣ' = 'α' collate "en-US-ins-icu";
  ┌──────────┐
  │ ?column? │
  ├──────────┤
  │ f        │
  └──────────┘
  (1 row)

Notice, though:

* I don't understand what that ``-u-`` is doing in ``locale``, but
  it's necessary
* as the docs
  https://www.postgresql.org/docs/13/collation.html#COLLATION-NONDETERMINISTIC
  say:

  - B-tree cannot use deduplication with indexes that use a
    nondeterministic collation
  - certain operations are not possible with nondeterministic
    collations, such as pattern matching operations (this means you
    can't use ``LIKE``)

--
    Dakkar - <Mobilis in mobile>
    GPG public key fingerprint = A071 E618 DD2C 5901 9574
                                 6FE2 40EA 9883 7519 3F88
                        key id = 0x75193F88