Thread: Case Insensitive Comparison with Postgres 12

Case Insensitive Comparison with Postgres 12

From

Igal Sapir

Date:

08 October 2019, 22:51:52

I am trying to test a simple case insensitive comparison. Most likely the collation that I chose is wrong, but I'm not sure how to choose the correct one (for English/US?). Here is my snippet:

create collation case_insensitive(

provider=icu, locale='en-US-x-icu', deterministic=false

);
select 'Abc' = 'abc' collate case_insensitive;

I expected true but am getting false.

Any thoughts?

Thanks,

Igal

Re: Case Insensitive Comparison with Postgres 12

From

Morris de Oryx

Date:

09 October 2019, 02:55:53

As I understand it, custom collation are not applied globally. Meaning, you have to associate a collation with a column or en expression with COLLATE.

Re: Case Insensitive Comparison with Postgres 12

From

Pavel Křehula

Date:

09 October 2019, 07:10:36

Hello,
use correct locale identifier, in your case it should be:
create collation "case_insensitive" (provider=icu, locale="en-US-u-ks-level2", deterministic = false);

See http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
for available options.

--
Pavel

Dne 09.10.2019 0:51:52, "Igal Sapir" <igal@lucee.org> napsal:

I am trying to test a simple case insensitive comparison. Most likely the collation that I chose is wrong, but I'm not sure how to choose the correct one (for English/US?). Here is my snippet:

create collation case_insensitive(
provider=icu, locale='en-US-x-icu', deterministic=false
);
select 'Abc' = 'abc' collate case_insensitive;

I expected true but am getting false.

Any thoughts?

Thanks,

Igal

Re: Case Insensitive Comparison with Postgres 12

From

Wim Bertels

Date:

09 October 2019, 07:19:29

Using the datatype citext might be an alternative solution

Igal Sapir <igal@lucee.org> schreef op October 8, 2019 10:51:52 PM UTC:

I am trying to test a simple case insensitive comparison. Most likely the collation that I chose is wrong, but I'm not sure how to choose the correct one (for English/US?). Here is my snippet:

create collation case_insensitive(
provider=icu, locale='en-US-x-icu', deterministic=false
);
select 'Abc' = 'abc' collate case_insensitive;

I expected true but am getting false.

Any thoughts?

Thanks,

Igal

--
Verstuurd vanaf een fairphone met K-9 Mail. Excuseer mijn beknoptheid.

Re: Case Insensitive Comparison with Postgres 12

From

Laurenz Albe

Date:

09 October 2019, 07:34:31

Igal Sapir wrote:
> I am trying to test a simple case insensitive comparison.  Most likely the
> collation that I chose is wrong, but I'm not sure how to choose the correct
> one (for English/US?).  Here is my snippet:
> 
> create collation case_insensitive(
>     provider=icu, locale='en-US-x-icu', deterministic=false
> );
> select 'Abc' = 'abc' collate case_insensitive;
> 
> I expected true but am getting false.
> 
> Any thoughts?

Yes, the LOCALE is wrong. Use

create collation case_insensitive (
   provider=icu, locale='en-US-u-ks-level2', deterministic=false
);

The name of the locale defines it.

My blog post can give a simple introduction:
https://www.cybertec-postgresql.com/en/icu-collations-against-glibc-2-28-data-corruption/

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com

Re: Case Insensitive Comparison with Postgres 12

From

"Igal @ Lucee.org"

Date:

10 October 2019, 12:41:47

On 10/9/2019 12:34 AM, Laurenz Albe wrote:
> Igal Sapir wrote:
>> I am trying to test a simple case insensitive comparison.  Most likely the
>> collation that I chose is wrong, but I'm not sure how to choose the correct
>> one (for English/US?).  Here is my snippet:
>>
>> create collation case_insensitive(
>>      provider=icu, locale='en-US-x-icu', deterministic=false
>> );
>> select 'Abc' = 'abc' collate case_insensitive;
>>
>> I expected true but am getting false.
>>
>> Any thoughts?
> Yes, the LOCALE is wrong. Use
>
> create collation case_insensitive (
>     provider=icu, locale='en-US-u-ks-level2', deterministic=false
> );
>
> The name of the locale defines it.
>
> My blog post can give a simple introduction:
> https://www.cybertec-postgresql.com/en/icu-collations-against-glibc-2-28-data-corruption/

Thank you all for replying.  I tried to use the locale suggested by both 
Laurenz and Pavel, en-US-u-ks-level2, but I'm still getting false for a 
simple comparison of 'Abc' = 'abc'.  I tried the locale both as a 
'string' and as an "identifier":

 > select version();

version |
-------------------------------------------------------------------------------------------------------|
PostgreSQL 12.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 
20150623 (Red Hat 4.8.5-39), 64-bit|

 > drop collation if exists case_insensitive;

 > create collation case_insensitive (
    provider=icu, locale="en-US-u-ks-level2", deterministic=false
);

 > select 'Abc' = 'abc' collate case_insensitive as is_equal;

is_equal|
--------|
false   |

What am I doing wrong here?

Thanks,

Igal

Re: Case Insensitive Comparison with Postgres 12

From

Thomas Kellerer

Date:

10 October 2019, 13:22:41

Igal @ Lucee.org schrieb am 10.10.2019 um 14:41:
> Thank you all for replying.  I tried to use the locale suggested by
> both Laurenz and Pavel, en-US-u-ks-level2, but I'm still getting
> false for a simple comparison of 'Abc' = 'abc'.  I tried the locale
> both as a 'string' and as an "identifier":
> 
>> drop collation if exists case_insensitive;
> 
>> create collation case_insensitive (
>    provider=icu, locale="en-US-u-ks-level2", deterministic=false
> );
> 
>> select 'Abc' = 'abc' collate case_insensitive as is_equal;
> 
> is_equal|
> --------|
> false   |
> 
> What am I doing wrong here?

Check the version of libicu that your Linux is using. 
That locale format requires version 54 or later.
(My up-to-date CentOS 7.7 for example is still using version 50 and the EDB Windows binaries include version 53). 

In another thread about ICU problems, Daniel Verite explained that in more detail:

> With ICU 53 or older, instead of the locale above, we must use the old-style syntax:
> 
>  locale = 'de-DE@colStrength=secondary'

In your case I guess, it should be 

   locale = 'en-US@colStrength=secondary'

Thomas

Re: Case Insensitive Comparison with Postgres 12

From

"Igal @ Lucee.org"

Date:

10 October 2019, 15:34:26

Thomas,

On 10/10/2019 6:22 AM, Thomas Kellerer wrote:

> Igal @ Lucee.org schrieb am 10.10.2019 um 14:41:
>> Thank you all for replying.  I tried to use the locale suggested by
>> both Laurenz and Pavel, en-US-u-ks-level2, but I'm still getting
>> false for a simple comparison of 'Abc' = 'abc'.  I tried the locale
>> both as a 'string' and as an "identifier":
>>
>>> drop collation if exists case_insensitive;
>>> create collation case_insensitive (
>>     provider=icu, locale="en-US-u-ks-level2", deterministic=false
>> );
>>
>>> select 'Abc' = 'abc' collate case_insensitive as is_equal;
>> is_equal|
>> --------|
>> false   |
>>
>> What am I doing wrong here?
> Check the version of libicu that your Linux is using.
> That locale format requires version 54 or later.
> (My up-to-date CentOS 7.7 for example is still using version 50 and the EDB Windows binaries include version 53).
>
> In another thread about ICU problems, Daniel Verite explained that in more detail:
>
>> With ICU 53 or older, instead of the locale above, we must use the old-style syntax:
>>
>>   locale = 'de-DE@colStrength=secondary'
> In your case I guess, it should be
>
>     locale = 'en-US@colStrength=secondary'

That works, thank you!

I also have CentOS installed on that machine: CentOS Linux release 
7.7.1908 (Core), showing libicu Version 50.2 via `yum info libicu`.

Best,

Igal

Re: Case Insensitive Comparison with Postgres 12

From

Igal Sapir

Date:

11 October 2019, 16:10:52

On Fri, Oct 11, 2019 at 1:09 AM stan <stanb@panix.com> wrote:

On Thu, Oct 10, 2019 at 05:41:47AM -0700, Igal @ Lucee.org wrote:
> On 10/9/2019 12:34 AM, Laurenz Albe wrote:
> > Igal Sapir wrote:
> > > I am trying to test a simple case insensitive comparison. Most likely the
> > > collation that I chose is wrong, but I'm not sure how to choose the correct
> > > one (for English/US?). Here is my snippet:
> > >
> > > create collation case_insensitive(
> > > provider=icu, locale='en-US-x-icu', deterministic=false
> > > );
> > > select 'Abc' = 'abc' collate case_insensitive;
> > >
> > > I expected true but am getting false.
> > >
> > > Any thoughts?
> > Yes, the LOCALE is wrong. Use
> >
> > create collation case_insensitive (
> > provider=icu, locale='en-US-u-ks-level2', deterministic=false
> > );
> >
> > The name of the locale defines it.
> >
> > My blog post can give a simple introduction:
> > https://www.cybertec-postgresql.com/en/icu-collations-against-glibc-2-28-data-corruption/
>
> Thank you all for replying.?? I tried to use the locale suggested by both
> Laurenz and Pavel, en-US-u-ks-level2, but I'm still getting false for a
> simple comparison of 'Abc' = 'abc'.?? I tried the locale both as a 'string'
> and as an "identifier":
>
> > select version();
>
> version |
> -------------------------------------------------------------------------------------------------------|
> PostgreSQL 12.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623
> (Red Hat 4.8.5-39), 64-bit|
>
> > drop collation if exists case_insensitive;
>
> > create collation case_insensitive (
> ???? provider=icu, locale="en-US-u-ks-level2", deterministic=false
> );
>
> > select 'Abc' = 'abc' collate case_insensitive as is_equal;
>
> is_equal|
> --------|
> false???? |
>
> What am I doing wrong here?
>
Out of curiosity is there a eason not to use the citext type for th?

Using the collation seems like a much cleaner approach, and I trust ICU to do a better job at comparing strings according to language rules etc.

Igal

Re: Case Insensitive Comparison with Postgres 12

From

"Daniel Verite"

Date:

12 October 2019, 13:17:55

    Igal Sapir wrote:

> > Out of curiosity is there a eason not to use the citext type for th?
> >
> >
> Using the collation seems like a much cleaner approach, and I trust ICU to
> do a better job at comparing strings according to language rules etc.

One notable difference between citext and case-insensitive collations
by ICU is that the latter recognizes canonically equivalent sequences
of codepoints [1] as equal, while the former does not.

For instance:

=# CREATE COLLATION ci (locale='und@colStrength=secondary',
      provider='icu', deterministic=false);

=# SELECT E'E\u0302TES'::citext = 'Êtes'::citext AS "citext-equal",
      E'E\u0302TES' = 'Êtes' collate "ci" AS "ci-equal";
 citext-equal | ci-equal
--------------+----------
 f          | t

Another significant difference is that building or rebuilding an index on a
text column with a CI collation appears to be way faster than with citext
(I've seen 10:1 ratios, but do your own tests).

On the minus side, substring matching with LIKE or other methods
is not possible with CI collations whereas it does work with citext.

[1] https://en.wikipedia.org/wiki/Unicode_equivalence

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite