Thread: Case Insensitive Comparison with Postgres 12
I am trying to test a simple case insensitive comparison. Most likely the collation that I chose is wrong, but I'm not sure how to choose the correct one (for English/US?). Here is my snippet:
create collation case_insensitive(
provider=icu, locale='en-US-x-icu', deterministic=false
);
select 'Abc' = 'abc' collate case_insensitive;
select 'Abc' = 'abc' collate case_insensitive;
I expected true but am getting false.
Any thoughts?
Thanks,
Igal
As I understand it, custom collation are not applied globally. Meaning, you have to associate a collation with a column or en expression with COLLATE.
Hello,
use correct locale identifier, in your case it should be:
create collation "case_insensitive" (provider=icu, locale="en-US-u-ks-level2", deterministic = false);
See http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
for available options.
--
Pavel
use correct locale identifier, in your case it should be:
create collation "case_insensitive" (provider=icu, locale="en-US-u-ks-level2", deterministic = false);
See http://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options
for available options.
--
Pavel
Dne 09.10.2019 0:51:52, "Igal Sapir" <igal@lucee.org> napsal:
I am trying to test a simple case insensitive comparison. Most likely the collation that I chose is wrong, but I'm not sure how to choose the correct one (for English/US?). Here is my snippet:create collation case_insensitive(provider=icu, locale='en-US-x-icu', deterministic=false);
select 'Abc' = 'abc' collate case_insensitive;I expected true but am getting false.Any thoughts?Thanks,Igal
Using the datatype citext might be an alternative solution
--
Verstuurd vanaf een fairphone met K-9 Mail. Excuseer mijn beknoptheid.
Igal Sapir <igal@lucee.org> schreef op October 8, 2019 10:51:52 PM UTC:
I am trying to test a simple case insensitive comparison. Most likely the collation that I chose is wrong, but I'm not sure how to choose the correct one (for English/US?). Here is my snippet:create collation case_insensitive(provider=icu, locale='en-US-x-icu', deterministic=false);
select 'Abc' = 'abc' collate case_insensitive;I expected true but am getting false.Any thoughts?Thanks,Igal
--
Verstuurd vanaf een fairphone met K-9 Mail. Excuseer mijn beknoptheid.
Igal Sapir wrote: > I am trying to test a simple case insensitive comparison. Most likely the > collation that I chose is wrong, but I'm not sure how to choose the correct > one (for English/US?). Here is my snippet: > > create collation case_insensitive( > provider=icu, locale='en-US-x-icu', deterministic=false > ); > select 'Abc' = 'abc' collate case_insensitive; > > I expected true but am getting false. > > Any thoughts? Yes, the LOCALE is wrong. Use create collation case_insensitive ( provider=icu, locale='en-US-u-ks-level2', deterministic=false ); The name of the locale defines it. My blog post can give a simple introduction: https://www.cybertec-postgresql.com/en/icu-collations-against-glibc-2-28-data-corruption/ Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
On 10/9/2019 12:34 AM, Laurenz Albe wrote: > Igal Sapir wrote: >> I am trying to test a simple case insensitive comparison. Most likely the >> collation that I chose is wrong, but I'm not sure how to choose the correct >> one (for English/US?). Here is my snippet: >> >> create collation case_insensitive( >> provider=icu, locale='en-US-x-icu', deterministic=false >> ); >> select 'Abc' = 'abc' collate case_insensitive; >> >> I expected true but am getting false. >> >> Any thoughts? > Yes, the LOCALE is wrong. Use > > create collation case_insensitive ( > provider=icu, locale='en-US-u-ks-level2', deterministic=false > ); > > The name of the locale defines it. > > My blog post can give a simple introduction: > https://www.cybertec-postgresql.com/en/icu-collations-against-glibc-2-28-data-corruption/ Thank you all for replying. I tried to use the locale suggested by both Laurenz and Pavel, en-US-u-ks-level2, but I'm still getting false for a simple comparison of 'Abc' = 'abc'. I tried the locale both as a 'string' and as an "identifier": > select version(); version | -------------------------------------------------------------------------------------------------------| PostgreSQL 12.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit| > drop collation if exists case_insensitive; > create collation case_insensitive ( provider=icu, locale="en-US-u-ks-level2", deterministic=false ); > select 'Abc' = 'abc' collate case_insensitive as is_equal; is_equal| --------| false | What am I doing wrong here? Thanks, Igal
Igal @ Lucee.org schrieb am 10.10.2019 um 14:41: > Thank you all for replying. I tried to use the locale suggested by > both Laurenz and Pavel, en-US-u-ks-level2, but I'm still getting > false for a simple comparison of 'Abc' = 'abc'. I tried the locale > both as a 'string' and as an "identifier": > >> drop collation if exists case_insensitive; > >> create collation case_insensitive ( > provider=icu, locale="en-US-u-ks-level2", deterministic=false > ); > >> select 'Abc' = 'abc' collate case_insensitive as is_equal; > > is_equal| > --------| > false | > > What am I doing wrong here? Check the version of libicu that your Linux is using. That locale format requires version 54 or later. (My up-to-date CentOS 7.7 for example is still using version 50 and the EDB Windows binaries include version 53). In another thread about ICU problems, Daniel Verite explained that in more detail: > With ICU 53 or older, instead of the locale above, we must use the old-style syntax: > > locale = 'de-DE@colStrength=secondary' In your case I guess, it should be locale = 'en-US@colStrength=secondary' Thomas
Thomas, On 10/10/2019 6:22 AM, Thomas Kellerer wrote: > Igal @ Lucee.org schrieb am 10.10.2019 um 14:41: >> Thank you all for replying. I tried to use the locale suggested by >> both Laurenz and Pavel, en-US-u-ks-level2, but I'm still getting >> false for a simple comparison of 'Abc' = 'abc'. I tried the locale >> both as a 'string' and as an "identifier": >> >>> drop collation if exists case_insensitive; >>> create collation case_insensitive ( >> provider=icu, locale="en-US-u-ks-level2", deterministic=false >> ); >> >>> select 'Abc' = 'abc' collate case_insensitive as is_equal; >> is_equal| >> --------| >> false | >> >> What am I doing wrong here? > Check the version of libicu that your Linux is using. > That locale format requires version 54 or later. > (My up-to-date CentOS 7.7 for example is still using version 50 and the EDB Windows binaries include version 53). > > In another thread about ICU problems, Daniel Verite explained that in more detail: > >> With ICU 53 or older, instead of the locale above, we must use the old-style syntax: >> >> locale = 'de-DE@colStrength=secondary' > In your case I guess, it should be > > locale = 'en-US@colStrength=secondary' That works, thank you! I also have CentOS installed on that machine: CentOS Linux release 7.7.1908 (Core), showing libicu Version 50.2 via `yum info libicu`. Best, Igal
On Fri, Oct 11, 2019 at 1:09 AM stan <stanb@panix.com> wrote:
On Thu, Oct 10, 2019 at 05:41:47AM -0700, Igal @ Lucee.org wrote:
> On 10/9/2019 12:34 AM, Laurenz Albe wrote:
> > Igal Sapir wrote:
> > > I am trying to test a simple case insensitive comparison. Most likely the
> > > collation that I chose is wrong, but I'm not sure how to choose the correct
> > > one (for English/US?). Here is my snippet:
> > >
> > > create collation case_insensitive(
> > > provider=icu, locale='en-US-x-icu', deterministic=false
> > > );
> > > select 'Abc' = 'abc' collate case_insensitive;
> > >
> > > I expected true but am getting false.
> > >
> > > Any thoughts?
> > Yes, the LOCALE is wrong. Use
> >
> > create collation case_insensitive (
> > provider=icu, locale='en-US-u-ks-level2', deterministic=false
> > );
> >
> > The name of the locale defines it.
> >
> > My blog post can give a simple introduction:
> > https://www.cybertec-postgresql.com/en/icu-collations-against-glibc-2-28-data-corruption/
>
> Thank you all for replying.?? I tried to use the locale suggested by both
> Laurenz and Pavel, en-US-u-ks-level2, but I'm still getting false for a
> simple comparison of 'Abc' = 'abc'.?? I tried the locale both as a 'string'
> and as an "identifier":
>
> > select version();
>
> version |
> -------------------------------------------------------------------------------------------------------|
> PostgreSQL 12.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623
> (Red Hat 4.8.5-39), 64-bit|
>
> > drop collation if exists case_insensitive;
>
> > create collation case_insensitive (
> ???? provider=icu, locale="en-US-u-ks-level2", deterministic=false
> );
>
> > select 'Abc' = 'abc' collate case_insensitive as is_equal;
>
> is_equal|
> --------|
> false???? |
>
> What am I doing wrong here?
>
Out of curiosity is there a eason not to use the citext type for th?
Using the collation seems like a much cleaner approach, and I trust ICU to do a better job at comparing strings according to language rules etc.
Igal
Igal Sapir wrote: > > Out of curiosity is there a eason not to use the citext type for th? > > > > > Using the collation seems like a much cleaner approach, and I trust ICU to > do a better job at comparing strings according to language rules etc. One notable difference between citext and case-insensitive collations by ICU is that the latter recognizes canonically equivalent sequences of codepoints [1] as equal, while the former does not. For instance: =# CREATE COLLATION ci (locale='und@colStrength=secondary', provider='icu', deterministic=false); =# SELECT E'E\u0302TES'::citext = 'Êtes'::citext AS "citext-equal", E'E\u0302TES' = 'Êtes' collate "ci" AS "ci-equal"; citext-equal | ci-equal --------------+---------- f | t Another significant difference is that building or rebuilding an index on a text column with a CI collation appears to be way faster than with citext (I've seen 10:1 ratios, but do your own tests). On the minus side, substring matching with LIKE or other methods is not possible with CI collations whereas it does work with citext. [1] https://en.wikipedia.org/wiki/Unicode_equivalence Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite