Home > mailing lists

Re: Upgrading locale issues - Mailing list pgsql-general

From	Daniel Verite
Subject	Re: Upgrading locale issues
Date	May 3, 2019 13:35:47
Msg-id	fce33d4f-458a-4208-8a4e-4207de86eb4d@manitou-mail.org Whole thread Raw
In response to	Re: Upgrading locale issues (rihad <rihad@mail.ru>)
Responses	Re: Upgrading locale issues Re: Upgrading locale issues Re: Upgrading locale issues
List	pgsql-general

Tree view

    rihad wrote:

> Thanks, I'm a bit confused here. AFAIK indexes are used for at least two
> things: for speed and for skipping the ORDER BY step (since btree
> indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index
> still work correctly for table lookups?

If the lookup is based on a equality test or a pattern match with LIKE
or a regexp, it makes no difference.  But the locale makes a
difference with inequality tests, such as < > or BETWEEN.

Around version 9.1 and in the pre-ICU days, Robert Haas wrote
this post that explained it pretty well, I think:
http://rhaas.blogspot.com/2012/03/perils-of-collation-aware-comparisons.html

Quote:

  If you happen to need the particular sorting behavior that
  collation-aware sorting and comparisons provide, then you may find
  this price worth paying, but I suspect there are a lot of people out
  there who are paying it more or less accidentally and don't really
  care very much about the underlying sorting behavior.  If, for
  example, all of your queries are based on equality, and you don't
  use greater-than or less-than tests, then it doesn't matter what
  collation is in use.    You might as well use "C" instead of whatever
  your local default may be, because it's faster.

For non-English text, I would recommend C.UTF-8 over "C" because of
its better handling of Unicode characters. For instance:

=# select upper('été' collate "C"), upper('été' collate "C.UTF-8");
 upper | upper
-------+-------
 éTé   | ÉTÉ

The "price" of linguistic comparisons that Robert mentioned was about
performance, but the troubles we have with the lack of long-term
immutability of these locales are worth being added to that.

> And can the existing en_US.UTF-8 fields' definition be altered in
> place, without a dump+restore?

Changing the COLLATE clause of a column with ALTER TABLE does
implicitly rebuild an index on this column if there is one,
A dump+restore is not needed, nor an explicit REINDEX.

The dump+restore is needed in another scenario, where you would
decide to change the LC_COLLATE and LC_CTYPE of the database,
instead of doing it only for some columns.
This scenario makes perfect sense if the locale of the database
has been set implicitly and it uses linguistic sorts without
really needing them ("accidentally" as said in the post).

> en_US.UTF-8 is the default encoding+locale+collation, it
> isn't set explicitly for any of our string columns. I assume there's
> some "catch-all" ordering taking place even for the C locale, so there
> won't be any bizarre things like b coming before a, or generally for any
> language, the second letter of its alphabet coming before the first?

'b' < 'a' is never true for any locale AFAIK, it is 'B' < 'a' that is
true for some locales such as C or C.UTF-8.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

pgsql-general by date:

From: Adrian Klaver
Date: 03 May 2019, 13:33:58
Subject: Re: Migrating database(s) from Sybase ASE 15.7 to PostgreSQL 10.6 onLinux

From: Adrian Klaver
Date: 03 May 2019, 13:36:51
Subject: Re: Migrating database(s) from Sybase ASE 15.7 to PostgreSQL 10.6 onLinux

Re: Upgrading locale issues - Mailing list pgsql-general

Previous

Next