Re: Upgrading locale issues - Mailing list pgsql-general
From | rihad |
---|---|
Subject | Re: Upgrading locale issues |
Date | |
Msg-id | 7833dd2f-633f-d41d-0a93-26155979f584@mail.ru Whole thread Raw |
In response to | Re: Upgrading locale issues ("Daniel Verite" <daniel@manitou-mail.org>) |
List | pgsql-general |
On 05/03/2019 05:35 PM, Daniel Verite wrote: > rihad wrote: > >> Thanks, I'm a bit confused here. AFAIK indexes are used for at least two >> things: for speed and for skipping the ORDER BY step (since btree >> indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index >> still work correctly for table lookups? > If the lookup is based on a equality test or a pattern match with LIKE > or a regexp, it makes no difference. But the locale makes a > difference with inequality tests, such as < > or BETWEEN. > > Around version 9.1 and in the pre-ICU days, Robert Haas wrote > this post that explained it pretty well, I think: > http://rhaas.blogspot.com/2012/03/perils-of-collation-aware-comparisons.html > > Quote: > > If you happen to need the particular sorting behavior that > collation-aware sorting and comparisons provide, then you may find > this price worth paying, but I suspect there are a lot of people out > there who are paying it more or less accidentally and don't really > care very much about the underlying sorting behavior. If, for > example, all of your queries are based on equality, and you don't > use greater-than or less-than tests, then it doesn't matter what > collation is in use. You might as well use "C" instead of whatever > your local default may be, because it's faster. > > For non-English text, I would recommend C.UTF-8 over "C" because of > its better handling of Unicode characters. For instance: > > =# select upper('été' collate "C"), upper('été' collate "C.UTF-8"); > upper | upper > -------+------- > éTé | ÉTÉ > > The "price" of linguistic comparisons that Robert mentioned was about > performance, but the troubles we have with the lack of long-term > immutability of these locales are worth being added to that. > >> And can the existing en_US.UTF-8 fields' definition be altered in >> place, without a dump+restore? > Changing the COLLATE clause of a column with ALTER TABLE does > implicitly rebuild an index on this column if there is one, > A dump+restore is not needed, nor an explicit REINDEX. > > The dump+restore is needed in another scenario, where you would > decide to change the LC_COLLATE and LC_CTYPE of the database, > instead of doing it only for some columns. > This scenario makes perfect sense if the locale of the database > has been set implicitly and it uses linguistic sorts without > really needing them ("accidentally" as said in the post). > > >> en_US.UTF-8 is the default encoding+locale+collation, it >> isn't set explicitly for any of our string columns. I assume there's >> some "catch-all" ordering taking place even for the C locale, so there >> won't be any bizarre things like b coming before a, or generally for any >> language, the second letter of its alphabet coming before the first? > 'b' < 'a' is never true for any locale AFAIK, it is 'B' < 'a' that is > true for some locales such as C or C.UTF-8. > > > Best regards, Thanks a lot for sharing your insights!
pgsql-general by date: