Re: insensitive collations - Mailing list pgsql-hackers

From Daniel Verite
Subject Re: insensitive collations
Date
Msg-id ef84c67b-cfa9-4a3f-b0ae-e9ff81e9d948@manitou-mail.org
Whole thread Raw
In response to Re: insensitive collations  (Andreas Karlsson <andreas@proxel.se>)
List pgsql-hackers
    Andreas Karlsson wrote:

> > Nondeterministic collations do address this by allowing canonically
> > equivalent code point sequences to compare as equal.  You still need a
> > collation implementation that actually does compare them as equal; ICU
> > does this, glibc does not AFAICT.
>
> Ah, right! You could use -ks-identic[1] for this.

Strings that differ like that are considered equal even at this level:

postgres=# create collation identic  (locale='und-u-ks-identic',
    provider='icu', deterministic=false);
CREATE COLLATION

postgres=# select 'é' = E'e\u0301' collate "identic";
 ?column?
----------
 t
(1 row)


There's a separate setting "colNormalization", or "kk" in BCP 47

From
http://www.unicode.org/reports/tr35/tr35-collation.html#Normalization_Setting

  "The UCA always normalizes input strings into NFD form before the
  rest of the algorithm. However, this results in poor performance.
  With normalization=off, strings that are in [FCD] and do not contain
  Tibetan precomposed vowels (U+0F73, U+0F75, U+0F81) should sort
  correctly. With normalization=on, an implementation that does not
  normalize to NFD must at least perform an incremental FCD check and
  normalize substrings as necessary"

But even setting this to false does not mean that NFD and NFC forms
of the same text compare as different:

postgres=# create collation identickk  (locale='und-u-ks-identic-kk-false',
    provider='icu', deterministic=false);
CREATE COLLATION

postgres=# select 'é' = E'e\u0301' collate "identickk";
 ?column?
----------
 t
(1 row)

AFAIU such strings may only compare as different when they're not
in FCD form (http://unicode.org/notes/tn5/#FCD)

There are also ICU-specific explanations about FCD here:
http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm#Normalization

It looks like setting colNormalization to false might provide a
performance benefit when you know your contents are in FCD
form, which is mostly the case according to ICU:

  "Note that all NFD strings are in FCD, and in practice most NFC
  strings will also be in FCD; for that matter most strings (of whatever
  ilk) will be in FCD.
  We guarantee that if any input strings are in FCD, that we will get
  the right results in collation without having to normalize".


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's
Next
From: James Coleman
Date:
Subject: Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's