Thread: Issues with german locale on CentOS 5,6,7
The last day we've encountered an issue what i think is somewhat severe if you want to do either OS upgrades with CentOS or even binary upgrades with an existing PostgreSQL instance to a new machine with locale de_DE.UTF-8 and thus i'd like to share here. Here are the details: Originally a Postgres 9.4 was running on CentOS 5.11/x86_64. The database in question was initialized with locale de_DE.UTF-8 and previously upgraded via pg_upgrade from 9.2 and then running without any issues for a while. After that the customer migrated to new hardware with an OS upgrade to CentOS 6.6/x86_64. This was done by just remounting the SAN LUN on the new machine. So far so good, no issues. However, after a while developers realized duplicate values in unique keys with certain types of string values (the format is described in the examples below). So the suspicion was that this has to do with locales. And yes, the german locale collation order changed: CentOS 5.11 has: echo -e '156\n1-5-6\n110\n1-1-0' | LANG=de_DE.UTF-8 sort 110 1-1-0 156 1-5-6 CentOS 6.6 does: echo -e '159\n1-5-9\n110\n1-1-0' | LANG=de_DE.UTF-8 sort 1-1-0 110 1-5-9 159 Interestingly CentOS 7.1 restores the behavior from CentOS 5.11 echo -e '159\n1-5-9\n110\n1-1-0' | LANG=de_DE.UTF-8 sort 110 1-1-0 159 1-5-9 There are entries in the CentOS bugtracker regarding other locales: https://bugs.centos.org/view.php?id=7009 https://bugs.centos.org/view.php?id=6210 So users are encouraged to carefully test their platforms when upgrading. Checks show that at least RHEL6 and RHEL7 have the same issue, too. -- Thanks Bernd
On Wed, Oct 7, 2015 at 2:20 AM, Bernd Helmle <mailings@oopsware.de> wrote: > The last day we've encountered an issue what i think is somewhat severe if > you want to do either OS upgrades with CentOS or even binary upgrades with > an existing PostgreSQL instance to a new machine with locale de_DE.UTF-8 > and thus i'd like to share here. At the risk of getting flamed: I think that this is a bug in PostgreSQL, not CentOS. I've said why I think that is at least once already [1]. Simply put, there is no justification for the belief that some people have that collations should be immutable, and there is some justification for OS vendors updating them in a way that disregards our use of the collations. Glibc actually versions collations, although that isn't in any standard, and it isn't clear that we can benefit from that beyond refusing to start the server when a collation was superseded by a new version. [1] http://www.postgresql.org/message-id/CAEYLb_UTMgM2V_pP7qnuKZYmTYXoym-zNYVbwoU79=TuP8HE3A@mail.gmail.com -- Regards, Peter Geoghegan
--On 7. Oktober 2015 02:33:59 -0700 Peter Geoghegan <peter.geoghegan86@gmail.com> wrote: > > At the risk of getting flamed: I think that this is a bug in > PostgreSQL, not CentOS. I've said why I think that is at least once > already [1]. Simply put, there is no justification for the belief that > some people have that collations should be immutable, and there is > some justification for OS vendors updating them in a way that > disregards our use of the collations. Glibc actually versions > collations, although that isn't in any standard, and it isn't clear > that we can benefit from that beyond refusing to start the server when > a collation was superseded by a new version. > We had exactly the same discussion here. I tried it to express carefully since i just wanted to hint any german users, but given that many database vendors out there have their own collation definitions shipped shows clearly that this is something we cannot ignore anymore. > [1] > http://www.postgresql.org/message-id/CAEYLb_UTMgM2V_pP7qnuKZYmTYXoym-zNYV > bwoU79=TuP8HE3A@mail.gmail.com Yeah, i know this discussion and SR clearly widens the window for such things to happen. -- Thanks Bernd
Peter Geoghegan schrieb am 07.10.2015 um 11:33: > On Wed, Oct 7, 2015 at 2:20 AM, Bernd Helmle <mailings@oopsware.de> wrote: >> The last day we've encountered an issue what i think is somewhat severe if >> you want to do either OS upgrades with CentOS or even binary upgrades with >> an existing PostgreSQL instance to a new machine with locale de_DE.UTF-8 >> and thus i'd like to share here. > > At the risk of getting flamed: I think that this is a bug in > PostgreSQL, not CentOS. I've said why I think that is at least once > already [1]. Simply put, there is no justification for the belief that > some people have that collations should be immutable, and there is > some justification for OS vendors updating them in a way that > disregards our use of the collations. Glibc actually versions > collations, although that isn't in any standard, and it isn't clear > that we can benefit from that beyond refusing to start the server when > a collation was superseded by a new version. I for one am not going to flame you ;) I always thought that this is a major shortcoming (if not a bug) in Postgres that the collation support is left to the OS. Because it essentially means that that exactly the same query with exactly the same data might return a different resultif run on different OS Thomas
Peter Geoghegan <peter.geoghegan86@gmail.com> writes: > At the risk of getting flamed: I think that this is a bug in > PostgreSQL, not CentOS. I've said why I think that is at least once > already [1]. Simply put, there is no justification for the belief that > some people have that collations should be immutable, and there is > some justification for OS vendors updating them in a way that > disregards our use of the collations. Glibc actually versions > collations, although that isn't in any standard, and it isn't clear > that we can benefit from that beyond refusing to start the server when > a collation was superseded by a new version. I can think of less helpful responses to such a situation, but not easily. The only real way out of such a situation is to REINDEX affected indexes. Refusing to start the server not only doesn't contribute to a solution, but makes it impossible to fix manually. regards, tom lane
On Wed, Oct 7, 2015 at 6:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The only real way out of such a situation is to REINDEX affected indexes. > Refusing to start the server not only doesn't contribute to a solution, > but makes it impossible to fix manually. I agree that that would be almost as bad as carrying on, because there is no reason to think that the locale thing can easily be rolled back. That was my point, in fact. -- Regards, Peter Geoghegan
On Thu, Oct 8, 2015 at 5:52 AM, Peter Geoghegan <peter.geoghegan86@gmail.com> wrote: > On Wed, Oct 7, 2015 at 6:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The only real way out of such a situation is to REINDEX affected indexes. >> Refusing to start the server not only doesn't contribute to a solution, >> but makes it impossible to fix manually. > > I agree that that would be almost as bad as carrying on, because there > is no reason to think that the locale thing can easily be rolled back. > That was my point, in fact. I have contemplated a maintenance script that would track either the md5 checksums of the /usr/lib/locale/*/LC_COLLATE files or the version of installed locale packages and automatically reindex things when they change (I guess after restarting the cluster to clear any glibc caches that might be lurking in long running backends). Or at least tell me that's needed. Obviously completely OS-specific... -- Thomas Munro http://www.enterprisedb.com
On Wed, Oct 7, 2015 at 4:49 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> I agree that that would be almost as bad as carrying on, because there >> is no reason to think that the locale thing can easily be rolled back. >> That was my point, in fact. > > I have contemplated a maintenance script that would track either the > md5 checksums of the /usr/lib/locale/*/LC_COLLATE files or the version > of installed locale packages and automatically reindex things when > they change (I guess after restarting the cluster to clear any glibc > caches that might be lurking in long running backends). Or at least > tell me that's needed. Obviously completely OS-specific... I think we should bite the bullet and adopt ICU, without abandoning support for OS locales for users that really need it (certainly, many will need it initially when using pg_upgrade to get on to the first version that happens to have ICU support). I don't like suggesting a solution that I myself am unlikely to find the time to work on, but in the long run that's the only sensible approach IMV. -- Regards, Peter Geoghegan
On Thu, Oct 8, 2015 at 1:16 PM, Peter Geoghegan <peter.geoghegan86@gmail.com> wrote: > On Wed, Oct 7, 2015 at 4:49 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >>> I agree that that would be almost as bad as carrying on, because there >>> is no reason to think that the locale thing can easily be rolled back. >>> That was my point, in fact. >> >> I have contemplated a maintenance script that would track either the >> md5 checksums of the /usr/lib/locale/*/LC_COLLATE files or the version >> of installed locale packages and automatically reindex things when >> they change (I guess after restarting the cluster to clear any glibc >> caches that might be lurking in long running backends). Or at least >> tell me that's needed. Obviously completely OS-specific... > > I think we should bite the bullet and adopt ICU, without abandoning > support for OS locales for users that really need it (certainly, many > will need it initially when using pg_upgrade to get on to the first > version that happens to have ICU support). I don't like suggesting a > solution that I myself am unlikely to find the time to work on, but in > the long run that's the only sensible approach IMV. How would you handle changes in ICU's collation definitions? -- Thomas Munro http://www.enterprisedb.com
Peter Geoghegan <peter.geoghegan86@gmail.com> writes: > I think we should bite the bullet and adopt ICU, I see absolutely nothing to recommend that course of action. Reasons not to: 1. Being compatible with the operating system's collation behavior is a feature, not a bug. If nothing else, it allows us to tell people that if we sort data the same way that sort(1) does, then it's not a bug that we're not sorting the way they think we should. But quite aside from that, there are practical uses to being compatible with other tools. 2. Last I checked, ICU *only* supports Unicode, and not only that, but only UTF16. This is a non-starter; not only for our Far Eastern users, but also those who find various LatinX encodings sufficient. ICU would be a functional fail for the former and a performance fail for the latter. 3. As Thomas Munro already noted, whatcha gonna do when ICU changes their collations? Or are their collations graven on stone tablets, unlike anyone else's? We certainly could stand to put some work into the problem of coping with collation changes. But claiming that ICU is the solution, or even a solution, seems obviously wrong. regards, tom lane
On Wed, Oct 7, 2015 at 8:06 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> I think we should bite the bullet and adopt ICU, without abandoning >> support for OS locales for users that really need it (certainly, many >> will need it initially when using pg_upgrade to get on to the first >> version that happens to have ICU support). I don't like suggesting a >> solution that I myself am unlikely to find the time to work on, but in >> the long run that's the only sensible approach IMV. > > How would you handle changes in ICU's collation definitions? ICU provides an API for collation versioning because of these kinds of issues with indexes: http://userguide.icu-project.org/collation/architecture#TOC-Versioning There are specifications of collations used by ICU that originate from the Unicode CLDR Project: http://cldr.unicode.org/ Basically, you prevent this kind of thing from ever happening in the first place by making versioning explicit, and putting it under the direct control of Postgres. I think a bunch of well regarded database systems have used ICU for many years, including DB2, for example. I think it's possible to arrange it so that the collations simply never go away, but if that does happen (or if you decide that the changes to a collation matter for cultural or correctness reasons) then you can at least detect the change and recover from it reliably. ICU has some other really nice features, too, but that's another discussion. -- Regards, Peter Geoghegan
On Wed, Oct 7, 2015 at 8:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > 1. Being compatible with the operating system's collation behavior is a > feature, not a bug. If nothing else, it allows us to tell people that > if we sort data the same way that sort(1) does, then it's not a bug that > we're not sorting the way they think we should. But quite aside from > that, there are practical uses to being compatible with other tools. I am not proposing to make that impossible. > 2. Last I checked, ICU *only* supports Unicode, and not only that, but > only UTF16. This is a non-starter; not only for our Far Eastern users, > but also those who find various LatinX encodings sufficient. ICU would be > a functional fail for the former and a performance fail for the latter. UTF-16 is more efficient for representing East Asian languages, so not sure what you mean about that. I realize that using UTF-16 is a non-starter, though. I guess you were talking about people who don't use Unicode due to the Han Unification controversy. Again, I'm not proposing to only support Unicode, but realistically the vast majority of users want Unicode, even in East Asia. Yes, ICU only supports Unicode, but it has supported UTF-8 for years now, and not as a second class citizen. See http://userguide.icu-project.org/strings/utf-8 . As it says there: """ If it is known that the default charset is always UTF-8 on the target platform, then you should #define U_CHARSET_IS_UTF8 1 in or before unicode/utypes.h. (For example, modify the default value there or pass -DU_CHARSET_IS_UTF8=1 as a compiler flag.) This will change most of the implementation code to use dedicated (simpler, faster) UTF-8 code paths and avoid dependencies on the conversion framework. (Avoiding such dependencies helps with statically linked libraries and may allow the use of UCONFIG_NO_LEGACY_CONVERSION or even UCONFIG_NO_CONVERSION [see unicode/uconfig.h].) """ > 3. As Thomas Munro already noted, whatcha gonna do when ICU changes their > collations? Or are their collations graven on stone tablets, unlike > anyone else's? See my response to Thomas. -- Regards, Peter Geoghegan
Thomas Kellerer wrote: > I always thought that this is a major shortcoming (if not a bug) in Postgres > that the collation support is left to the OS. > > Because it essentially means that that exactly the same query with exactly > the same data might return a different result if run on different OS Could an extension help with this? CREATE EXTENSION pg_locales; The extension would provide its own strcoll/strxfrm/etc. and a set of locale definitions, presumably slurped from Linux or BSD. It would probably be preloaded in postgresql.conf: shared_preload_libraries = pg_locales A collation would be created with SELECT create_pg_collation(name,locale,lc_collate,lc_ctype); which would be a kind of CREATE COLLATION but relying on the extension code/data rather than the libc. Then tables and columns could be set to use that collation, and the core postgres code should divert the corresponding string comparisons to the extension functions. Benefits: - being insulated from unexpected OS changes. - portability across OS, assuming that the extension would be itself portable. - the core postgres would not be responsible of maintaining these locales. Would something like that make sense? Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On Thu, Oct 8, 2015 at 12:49 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
I got around to trying this on a recent rainy day:
https://github.com/macdice/check_pg_collations/blob/master/check_pg_collations
It needs Python and psycopg2. It notices when LC_COLLATE files' checksums change, and generates REINDEX commands on stdout. It seems to work OK on Debian, but I haven't tested much. Pull requests welcome :-) But I guess it would be much cooler as a pure plpgsql extension.
-- On Thu, Oct 8, 2015 at 5:52 AM, Peter Geoghegan
<peter.geoghegan86@gmail.com> wrote:
> On Wed, Oct 7, 2015 at 6:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The only real way out of such a situation is to REINDEX affected indexes.
>> Refusing to start the server not only doesn't contribute to a solution,
>> but makes it impossible to fix manually.
>
> I agree that that would be almost as bad as carrying on, because there
> is no reason to think that the locale thing can easily be rolled back.
> That was my point, in fact.
I have contemplated a maintenance script that would track either the
md5 checksums of the /usr/lib/locale/*/LC_COLLATE files or the version
of installed locale packages and automatically reindex things when
they change (I guess after restarting the cluster to clear any glibc
caches that might be lurking in long running backends). Or at least
tell me that's needed. Obviously completely OS-specific...
I got around to trying this on a recent rainy day:
https://github.com/macdice/check_pg_collations/blob/master/check_pg_collations
It needs Python and psycopg2. It notices when LC_COLLATE files' checksums change, and generates REINDEX commands on stdout. It seems to work OK on Debian, but I haven't tested much. Pull requests welcome :-) But I guess it would be much cooler as a pure plpgsql extension.
Thomas Munro
http://www.enterprisedb.com
http://www.enterprisedb.com