Thread: Upgrading locale issues
Hi. Today we run pg_ctl promote on a slave server (10.7) and started using it as a master. The OS also got upgraded FreeBSD 10.4 -> FreeBSD 11.2. And you guessed it, most varchar indexes got corrupted because system local changed in subtle ways. So I created the extension amcheck and reindexed all bad indexes one by one. Is there any way to prevent such things in the future? Will switching to ICU fix all such issues? The problem with it is that ICU collations are absent in pg_collation, initdb should be run to create them, but pg_basebackup only runs on an empty base directory, so I couldn't run initdb + pg_basebackup to prepare the replica server. I believe I can run the create collation command manually, but what would it look like for en-x-icu? CREATE COLLATION "en-ix-icu" (provider = icu, locale = 'en-x-icu'); is that it? But what about version etc? collname | collnamespace | collowner | collprovider | collencoding | collcollate | collctype | collversion en-x-icu | 11 | 10 | i | -1 | en | en | 153.88 Thanks.
On Mon, Apr 29, 2019 at 7:45 AM rihad <rihad@mail.ru> wrote: > Hi. Today we run pg_ctl promote on a slave server (10.7) and started > using it as a master. The OS also got upgraded FreeBSD 10.4 -> FreeBSD > 11.2. And you guessed it, most varchar indexes got corrupted because > system local changed in subtle ways. So I created the extension amcheck > and reindexed all bad indexes one by one. Is there any way to prevent > such things in the future? Will switching to ICU fix all such issues? Not necessarily, but it will detect the incompatibility more or less automatically, making it far more likely that the problem will be caught before it does any harm. ICU versions collations, giving Postgres a way to reason about their compatibility over time. The libc collations are not versioned, though (at least not in any standard way that Postgres can take advantage of). > The problem with it is that ICU collations are absent in pg_collation, > initdb should be run to create them, but pg_basebackup only runs on an > empty base directory, so I couldn't run initdb + pg_basebackup to > prepare the replica server. I believe I can run the create collation > command manually, but what would it look like for en-x-icu? It is safe to call pg_import_system_collations() directly, which is all that initdb does. This is documented, so you wouldn't be relying on a hack. -- Peter Geoghegan
On Thu, May 2, 2019 at 8:26 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Apr 29, 2019 at 7:45 AM rihad <rihad@mail.ru> wrote: > > Hi. Today we run pg_ctl promote on a slave server (10.7) and started > > using it as a master. The OS also got upgraded FreeBSD 10.4 -> FreeBSD > > 11.2. And you guessed it, most varchar indexes got corrupted because > > system local changed in subtle ways. So I created the extension amcheck > > and reindexed all bad indexes one by one. Is there any way to prevent > > such things in the future? Will switching to ICU fix all such issues? > > Not necessarily, but it will detect the incompatibility more or less > automatically, making it far more likely that the problem will be > caught before it does any harm. ICU versions collations, giving > Postgres a way to reason about their compatibility over time. The libc > collations are not versioned, though (at least not in any standard way > that Postgres can take advantage of). As discussed over on -hackers[1], I think it's worth pursuing that though. FWIW I've proposed locale versioning for FreeBSD's libc[2]. The reason I haven't gone further with that yet even though the code change has been accepted in principle by FreeBSD reviewers is because I got stuck on the question of how exactly to model the versions. If, say, just Turkish changes, I don't want to be rebuilding my French indexes, which means that I don't think you can use the CLDR version string. Frustratingly, you could probably do a good job of that by just checksumming the collation definition files, but that seems a bit too crude. There is also the question of how PostgreSQL should model versions, and as I've argued in [1], I think we should track them at the level of database object dependencies. I'm hoping to reopen this can of worms for PostgreSQL 13 (and the corresponding support could in theory be in FreeBSD 13... coincidence, or a sign!?) > > The problem with it is that ICU collations are absent in pg_collation, > > initdb should be run to create them, but pg_basebackup only runs on an > > empty base directory, so I couldn't run initdb + pg_basebackup to > > prepare the replica server. I believe I can run the create collation > > command manually, but what would it look like for en-x-icu? > > It is safe to call pg_import_system_collations() directly, which is > all that initdb does. This is documented, so you wouldn't be relying > on a hack. Unfortunately you can't use ICU collations as a database default yet (though there was some WIP code[3]), so ICU only saves you from versioning problems if you explicitly set collations for columns or expressions, and even then the version tracking is currently just a warning that you clear manually with a command, not a mechanism that really tracks which database objects were last rebuilt/validated with a given version. [1] https://www.postgresql.org/message-id/flat/CAEepm%3D0uEQCpfq_%2BLYFBdArCe4Ot98t1aR4eYiYTe%3DyavQygiQ%40mail.gmail.com [2] https://reviews.freebsd.org/D17166 [3] https://www.postgresql.org/message-id/flat/3366.1498183854%40sss.pgh.pa.us -- Thomas Munro https://enterprisedb.com
On 05/02/2019 12:26 AM, Peter Geoghegan wrote: > On Mon, Apr 29, 2019 at 7:45 AM rihad <rihad@mail.ru> wrote: >> Hi. Today we run pg_ctl promote on a slave server (10.7) and started >> using it as a master. The OS also got upgraded FreeBSD 10.4 -> FreeBSD >> 11.2. And you guessed it, most varchar indexes got corrupted because >> system local changed in subtle ways. So I created the extension amcheck >> and reindexed all bad indexes one by one. Is there any way to prevent >> such things in the future? Will switching to ICU fix all such issues? > Not necessarily, but it will detect the incompatibility more or less > automatically, making it far more likely that the problem will be > caught before it does any harm. ICU versions collations, giving > Postgres a way to reason about their compatibility over time. The libc > collations are not versioned, though (at least not in any standard way > that Postgres can take advantage of). > >> The problem with it is that ICU collations are absent in pg_collation, >> initdb should be run to create them, but pg_basebackup only runs on an >> empty base directory, so I couldn't run initdb + pg_basebackup to >> prepare the replica server. I believe I can run the create collation >> command manually, but what would it look like for en-x-icu? > It is safe to call pg_import_system_collations() directly, which is > all that initdb does. This is documented, so you wouldn't be relying > on a hack. > Thanks for the reply. Do you know what would a "decent" ICU collation be to bind to a field's schema definition so it would mimic a UTF-8 encoding for a multilingual column? Maybe und-x-icu? We aren't as much concerned about their sortability in most cases, we just want indexes to better handle future PG/ICU upgrades. But what does und(efined) even mean with respect to collations? With UTF-8 at least some default collation is specified, like en_US.UTF-8. Will results be in a completely undefined order as a result of ORDER BY "icu_und_column"?
rihad wrote: > Thanks for the reply. Do you know what would a "decent" ICU collation be > to bind to a field's schema definition so it would mimic a UTF-8 > encoding for a multilingual column? Maybe und-x-icu? We aren't as much > concerned about their sortability in most cases, we just want indexes to > better handle future PG/ICU upgrades. But what does und(efined) even > mean with respect to collations? "undefined" in this context means unspecified language and unspecified country or region. It implies that no language-specific nor regional rule will be applied to compare strings. Using C.UTF-8 as the collation for text fields to index may be the best trade-off in your case. It should be immune to libc and ICU upgrades. With C.UTF-8, a string like 'BC' will sort before 'ab', and punctuation and accents will also sort differently than with a linguistic-aware collation. If your applications care about that, it can be fixed by simply adding COLLATE "default" to the ORDER BY clause of the queries that are meant to present data to users. COLLATE "default" means the collation of the database, which presumably would be something like "language_REGION.UTF-8" in your case. If you never specified it explicitly, it came from initdb which itself got it from the environment of the server. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On 05/02/2019 05:36 PM, Daniel Verite wrote: > rihad wrote: > >> Thanks for the reply. Do you know what would a "decent" ICU collation be >> to bind to a field's schema definition so it would mimic a UTF-8 >> encoding for a multilingual column? Maybe und-x-icu? We aren't as much >> concerned about their sortability in most cases, we just want indexes to >> better handle future PG/ICU upgrades. But what does und(efined) even >> mean with respect to collations? > "undefined" in this context means unspecified language and > unspecified country or region. It implies that no language-specific > nor regional rule will be applied to compare strings. > > Using C.UTF-8 as the collation for text fields to index may be the > best trade-off in your case. It should be immune to libc and ICU > upgrades. > > With C.UTF-8, a string like 'BC' will sort before 'ab', and punctuation > and accents will also sort differently than with a linguistic-aware > collation. Thanks, I'm a bit confused here. AFAIK indexes are used for at least two things: for speed and for skipping the ORDER BY step (since btree indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index still work correctly for table lookups? And can the existing en_US.UTF-8 fields' definition be altered in place, without a dump+restore? en_US.UTF-8 is the default encoding+locale+collation, it isn't set explicitly for any of our string columns. I assume there's some "catch-all" ordering taking place even for the C locale, so there won't be any bizarre things like b coming before a, or generally for any language, the second letter of its alphabet coming before the first? > If your applications care about that, it can be fixed by simply > adding COLLATE "default" to the ORDER BY clause of the queries that > are meant to present data to users. > COLLATE "default" means the collation of the database, which > presumably would be something like "language_REGION.UTF-8" in your > case. If you never specified it explicitly, it came from initdb which > itself got it from the environment of the server. > > > Best regards,
rihad wrote: > Thanks, I'm a bit confused here. AFAIK indexes are used for at least two > things: for speed and for skipping the ORDER BY step (since btree > indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index > still work correctly for table lookups? If the lookup is based on a equality test or a pattern match with LIKE or a regexp, it makes no difference. But the locale makes a difference with inequality tests, such as < > or BETWEEN. Around version 9.1 and in the pre-ICU days, Robert Haas wrote this post that explained it pretty well, I think: http://rhaas.blogspot.com/2012/03/perils-of-collation-aware-comparisons.html Quote: If you happen to need the particular sorting behavior that collation-aware sorting and comparisons provide, then you may find this price worth paying, but I suspect there are a lot of people out there who are paying it more or less accidentally and don't really care very much about the underlying sorting behavior. If, for example, all of your queries are based on equality, and you don't use greater-than or less-than tests, then it doesn't matter what collation is in use. You might as well use "C" instead of whatever your local default may be, because it's faster. For non-English text, I would recommend C.UTF-8 over "C" because of its better handling of Unicode characters. For instance: =# select upper('été' collate "C"), upper('été' collate "C.UTF-8"); upper | upper -------+------- éTé | ÉTÉ The "price" of linguistic comparisons that Robert mentioned was about performance, but the troubles we have with the lack of long-term immutability of these locales are worth being added to that. > And can the existing en_US.UTF-8 fields' definition be altered in > place, without a dump+restore? Changing the COLLATE clause of a column with ALTER TABLE does implicitly rebuild an index on this column if there is one, A dump+restore is not needed, nor an explicit REINDEX. The dump+restore is needed in another scenario, where you would decide to change the LC_COLLATE and LC_CTYPE of the database, instead of doing it only for some columns. This scenario makes perfect sense if the locale of the database has been set implicitly and it uses linguistic sorts without really needing them ("accidentally" as said in the post). > en_US.UTF-8 is the default encoding+locale+collation, it > isn't set explicitly for any of our string columns. I assume there's > some "catch-all" ordering taking place even for the C locale, so there > won't be any bizarre things like b coming before a, or generally for any > language, the second letter of its alphabet coming before the first? 'b' < 'a' is never true for any locale AFAIK, it is 'B' < 'a' that is true for some locales such as C or C.UTF-8. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On 05/03/2019 05:35 PM, Daniel Verite wrote: > rihad wrote: > >> Thanks, I'm a bit confused here. AFAIK indexes are used for at least two >> things: for speed and for skipping the ORDER BY step (since btree >> indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index >> still work correctly for table lookups? > If the lookup is based on a equality test or a pattern match with LIKE > or a regexp, it makes no difference. But the locale makes a > difference with inequality tests, such as < > or BETWEEN. > > Around version 9.1 and in the pre-ICU days, Robert Haas wrote > this post that explained it pretty well, I think: > http://rhaas.blogspot.com/2012/03/perils-of-collation-aware-comparisons.html > > Quote: > > If you happen to need the particular sorting behavior that > collation-aware sorting and comparisons provide, then you may find > this price worth paying, but I suspect there are a lot of people out > there who are paying it more or less accidentally and don't really > care very much about the underlying sorting behavior. If, for > example, all of your queries are based on equality, and you don't > use greater-than or less-than tests, then it doesn't matter what > collation is in use. You might as well use "C" instead of whatever > your local default may be, because it's faster. > > For non-English text, I would recommend C.UTF-8 over "C" because of > its better handling of Unicode characters. For instance: > > =# select upper('été' collate "C"), upper('été' collate "C.UTF-8"); > upper | upper > -------+------- > éTé | ÉTÉ > > The "price" of linguistic comparisons that Robert mentioned was about > performance, but the troubles we have with the lack of long-term > immutability of these locales are worth being added to that. > >> And can the existing en_US.UTF-8 fields' definition be altered in >> place, without a dump+restore? > Changing the COLLATE clause of a column with ALTER TABLE does > implicitly rebuild an index on this column if there is one, > A dump+restore is not needed, nor an explicit REINDEX. > > The dump+restore is needed in another scenario, where you would > decide to change the LC_COLLATE and LC_CTYPE of the database, > instead of doing it only for some columns. > This scenario makes perfect sense if the locale of the database > has been set implicitly and it uses linguistic sorts without > really needing them ("accidentally" as said in the post). > > >> en_US.UTF-8 is the default encoding+locale+collation, it >> isn't set explicitly for any of our string columns. I assume there's >> some "catch-all" ordering taking place even for the C locale, so there >> won't be any bizarre things like b coming before a, or generally for any >> language, the second letter of its alphabet coming before the first? > 'b' < 'a' is never true for any locale AFAIK, it is 'B' < 'a' that is > true for some locales such as C or C.UTF-8. > > > Best regards, Thanks a lot for sharing your insights!
On 05/03/2019 05:35 PM, Daniel Verite wrote: > For non-English text, I would recommend C.UTF-8 over "C" because of BTW, there's no C.UTF-8 inside pg_collation, and running select pg_import_system_collations('pg_catalog') doesn't bring it in, at least not on FreeBSD 11.2.
rihad wrote: > On 05/03/2019 05:35 PM, Daniel Verite wrote: > > For non-English text, I would recommend C.UTF-8 over "C" because of > > BTW, there's no C.UTF-8 inside pg_collation, and running select > pg_import_system_collations('pg_catalog') doesn't bring it in, at least > not on FreeBSD 11.2. Yes, aside from "C", locales are quite system-dependent, unfortunately. It looks like FreeBSD 13 does provide C.UTF-8: https://unix.stackexchange.com/questions/485073/how-to-backport-freebsd-13-current-c-utf-8-locale-to-11-2-release/485077#485077 Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On Wed, May 1, 2019 at 3:09 PM Thomas Munro <thomas.munro@gmail.com> wrote: > As discussed over on -hackers[1], I think it's worth pursuing that > though. FWIW I've proposed locale versioning for FreeBSD's libc[2]. > The reason I haven't gone further with that yet even though the code > change has been accepted in principle by FreeBSD reviewers is because > I got stuck on the question of how exactly to model the versions. If, > say, just Turkish changes, I don't want to be rebuilding my French > indexes, which means that I don't think you can use the CLDR version > string. The ICU versions can handle that, though. Importantly, ICU decouples implementation details from actual versioning. I must say that I am not enthused about the idea of trying to get libc people of any variety on board. I don't have an objection to it if it can work for FreeBSD, but I don't think it can scale. ICU is built around a culture that takes our concerns seriously already, which is what it boils down to. Also, we can imagine a package manager taking it upon themselves to vendor their own ICU, with platform-specific guarantees around stability. That seems like a nice option to have, at least. > There is also the question of how PostgreSQL should model versions, > and as I've argued in [1], I think we should track them at the level > of database object dependencies. I think you're probably right about that. > I'm hoping to reopen this can of worms for PostgreSQL 13 (and the > corresponding support could in theory be in FreeBSD 13... coincidence, > or a sign!?) Maybe we should do what Oracle did, and call it PostgreSQL 18c instead. Actually, any number that isn't of interest to numerologists will do. > Unfortunately you can't use ICU collations as a database default yet > (though there was some WIP code[3]), so ICU only saves you from > versioning problems if you explicitly set collations for columns or > expressions, and even then the version tracking is currently just a > warning that you clear manually with a command, not a mechanism that > really tracks which database objects were last rebuilt/validated with > a given version. Yes, that does seem like a big remaining weakness. -- Peter Geoghegan
On 2019-05-03 15:35, Daniel Verite wrote: > 'b' < 'a' is never true for any locale AFAIK, But there is 'bb' < 'aa' in Danish. :-) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services