Home > mailing lists

Re: pg_collation.collversion for C.UTF-8 - Mailing list pgsql-hackers

From	Daniel Verite
Subject	Re: pg_collation.collversion for C.UTF-8
Date	June 7, 2023 18:08:10
Msg-id	e22d6e2e-9981-4f8c-8351-0c6c9e84b63c@manitou-mail.org Whole thread Raw
In response to	Re: pg_collation.collversion for C.UTF-8 ("Daniel Verite" <daniel@manitou-mail.org>)
List	pgsql-hackers

Tree view

    I wrote:

> Consider matching '\d' in a regexp. With C.UTF-8 (glibc-2.35), we
> only match ASCII characters 0-9, or 10 codepoints.  With
> "en-US-u-va-posix-x-icu" we match 660 codepoints comprising all the
> digit characters in all languages, plus a bunch of variants for
> mathematical symbols.

BTW this not specifically a C.UTF-8 versus "en-US-u-va-posix-x-icu"
difference.
If think that any glibc-based locale will consider that \d
in a regexp means [0-9], and that any ICU locale
will make \d match a much larger variety of characters.

While moving to ICU by default, we should expect that
differences like that will affect apps in a way that might be
more or less disruptive.

Another known difference it that upper() with ICU does not do a
character-by-character conversion, for instance:

WITH words(w) as  (values('muß'),('ﬁnal'))
 SELECT
  w,
  length(w),
  upper(w collate "C.utf8") as "upper (libc)",
  length(upper(w collate "C.utf8")),
  upper(w collate "en-x-icu") as "upper (ICU)",
  length(upper(w collate "en-x-icu"))
FROM words;

  w   | length | upper libc | length | upper ICU | length
------+--------+------------+--------+-----------+--------
 muß  |      3 | MUß        |       3 | MUSS     |    4
 ﬁnal |      4 | ﬁNAL        |       4 | FINAL     |    5

The fact that the resulting string is larger that the original
might cause problems.

In general, we can't abstract from the fact that ICU semantics
are different.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

pgsql-hackers by date:

From: 謝東霖
Date: 07 June 2023, 18:05:19
Subject: Re: Improve join_search_one_level readibilty (one line change)

From: "Tristan Partin"
Date: 07 June 2023, 18:26:59
Subject: Re: Improve join_search_one_level readibilty (one line change)

Re: pg_collation.collversion for C.UTF-8 - Mailing list pgsql-hackers

Previous

Next