Re: Upgrading locale issues - Mailing list pgsql-general

From rihad
Subject Re: Upgrading locale issues
Date
Msg-id 7833dd2f-633f-d41d-0a93-26155979f584@mail.ru
Whole thread Raw
In response to Re: Upgrading locale issues  ("Daniel Verite" <daniel@manitou-mail.org>)
List pgsql-general
On 05/03/2019 05:35 PM, Daniel Verite wrote:
>     rihad wrote:
>
>> Thanks, I'm a bit confused here. AFAIK indexes are used for at least two
>> things: for speed and for skipping the ORDER BY step (since btree
>> indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index
>> still work correctly for table lookups?
> If the lookup is based on a equality test or a pattern match with LIKE
> or a regexp, it makes no difference.  But the locale makes a
> difference with inequality tests, such as < > or BETWEEN.
>
> Around version 9.1 and in the pre-ICU days, Robert Haas wrote
> this post that explained it pretty well, I think:
> http://rhaas.blogspot.com/2012/03/perils-of-collation-aware-comparisons.html
>
> Quote:
>
>    If you happen to need the particular sorting behavior that
>    collation-aware sorting and comparisons provide, then you may find
>    this price worth paying, but I suspect there are a lot of people out
>    there who are paying it more or less accidentally and don't really
>    care very much about the underlying sorting behavior.  If, for
>    example, all of your queries are based on equality, and you don't
>    use greater-than or less-than tests, then it doesn't matter what
>    collation is in use.    You might as well use "C" instead of whatever
>    your local default may be, because it's faster.
>
> For non-English text, I would recommend C.UTF-8 over "C" because of
> its better handling of Unicode characters. For instance:
>
> =# select upper('été' collate "C"), upper('été' collate "C.UTF-8");
>   upper | upper
> -------+-------
>   éTé   | ÉTÉ
>
> The "price" of linguistic comparisons that Robert mentioned was about
> performance, but the troubles we have with the lack of long-term
> immutability of these locales are worth being added to that.
>
>> And can the existing en_US.UTF-8 fields' definition be altered in
>> place, without a dump+restore?
> Changing the COLLATE clause of a column with ALTER TABLE does
> implicitly rebuild an index on this column if there is one,
> A dump+restore is not needed, nor an explicit REINDEX.
>
> The dump+restore is needed in another scenario, where you would
> decide to change the LC_COLLATE and LC_CTYPE of the database,
> instead of doing it only for some columns.
> This scenario makes perfect sense if the locale of the database
> has been set implicitly and it uses linguistic sorts without
> really needing them ("accidentally" as said in the post).
>
>
>> en_US.UTF-8 is the default encoding+locale+collation, it
>> isn't set explicitly for any of our string columns. I assume there's
>> some "catch-all" ordering taking place even for the C locale, so there
>> won't be any bizarre things like b coming before a, or generally for any
>> language, the second letter of its alphabet coming before the first?
> 'b' < 'a' is never true for any locale AFAIK, it is 'B' < 'a' that is
> true for some locales such as C or C.UTF-8.
>
>
> Best regards,


Thanks a lot for sharing your insights!




pgsql-general by date:

Previous
From: Ravi Krishna
Date:
Subject: Re: Back Slash \ issue
Next
From: Adrian Klaver
Date:
Subject: Re: Back Slash \ issue