Thread: Issues with german locale on CentOS 5,6,7

Issues with german locale on CentOS 5,6,7

From
Bernd Helmle
Date:
The last day we've encountered an issue what i think is somewhat severe if
you want to do either OS upgrades with CentOS or even binary upgrades with
an existing PostgreSQL instance to a new machine with locale de_DE.UTF-8
and thus i'd like to share here.

Here are the details:

Originally a Postgres 9.4 was running on CentOS 5.11/x86_64. The database
in question was initialized with locale de_DE.UTF-8 and previously upgraded
via pg_upgrade from 9.2 and then running without any issues for a while.

After that the customer migrated to new hardware with an OS upgrade to
CentOS 6.6/x86_64. This was done by just remounting the SAN LUN on the new
machine. So far so good, no issues.

However, after a while developers realized duplicate values in unique keys
with certain types of string values (the format is described in the
examples below). So the suspicion was that this has to do with locales. And
yes, the german locale collation order changed:

CentOS 5.11 has:

echo -e '156\n1-5-6\n110\n1-1-0' | LANG=de_DE.UTF-8 sort
110
1-1-0
156
1-5-6

CentOS 6.6 does:

echo -e '159\n1-5-9\n110\n1-1-0' | LANG=de_DE.UTF-8 sort
1-1-0
110
1-5-9
159

Interestingly CentOS 7.1 restores the behavior from CentOS 5.11

echo -e '159\n1-5-9\n110\n1-1-0' | LANG=de_DE.UTF-8 sort
110
1-1-0
159
1-5-9

There are entries in the CentOS bugtracker regarding other locales:

https://bugs.centos.org/view.php?id=7009
https://bugs.centos.org/view.php?id=6210

So users are encouraged to carefully test their platforms when upgrading.
Checks show that at least RHEL6 and RHEL7 have the same issue, too.

--
Thanks

    Bernd



Re: Issues with german locale on CentOS 5,6,7

From
Peter Geoghegan
Date:
On Wed, Oct 7, 2015 at 2:20 AM, Bernd Helmle <mailings@oopsware.de> wrote:
> The last day we've encountered an issue what i think is somewhat severe if
> you want to do either OS upgrades with CentOS or even binary upgrades with
> an existing PostgreSQL instance to a new machine with locale de_DE.UTF-8
> and thus i'd like to share here.

At the risk of getting flamed: I think that this is a bug in
PostgreSQL, not CentOS. I've said why I think that is at least once
already [1]. Simply put, there is no justification for the belief that
some people have that collations should be immutable, and there is
some justification for OS vendors updating them in a way that
disregards our use of the collations. Glibc actually versions
collations, although that isn't in any standard, and it isn't clear
that we can benefit from that beyond refusing to start the server when
a collation was superseded by a new version.

[1] http://www.postgresql.org/message-id/CAEYLb_UTMgM2V_pP7qnuKZYmTYXoym-zNYVbwoU79=TuP8HE3A@mail.gmail.com
--
Regards,
Peter Geoghegan


Re: Issues with german locale on CentOS 5,6,7

From
Bernd Helmle
Date:
--On 7. Oktober 2015 02:33:59 -0700 Peter Geoghegan
<peter.geoghegan86@gmail.com> wrote:

>
> At the risk of getting flamed: I think that this is a bug in
> PostgreSQL, not CentOS. I've said why I think that is at least once
> already [1]. Simply put, there is no justification for the belief that
> some people have that collations should be immutable, and there is
> some justification for OS vendors updating them in a way that
> disregards our use of the collations. Glibc actually versions
> collations, although that isn't in any standard, and it isn't clear
> that we can benefit from that beyond refusing to start the server when
> a collation was superseded by a new version.
>

We had exactly the same discussion here. I tried it to express carefully
since i just wanted to hint any german users, but given that many database
vendors out there have their own collation definitions shipped shows
clearly that this is something we cannot ignore anymore.

> [1]
> http://www.postgresql.org/message-id/CAEYLb_UTMgM2V_pP7qnuKZYmTYXoym-zNYV
> bwoU79=TuP8HE3A@mail.gmail.com

Yeah, i know this discussion and SR clearly widens the window for such
things to happen.

--
Thanks

    Bernd


Re: Issues with german locale on CentOS 5,6,7

From
Thomas Kellerer
Date:
Peter Geoghegan schrieb am 07.10.2015 um 11:33:
> On Wed, Oct 7, 2015 at 2:20 AM, Bernd Helmle <mailings@oopsware.de> wrote:
>> The last day we've encountered an issue what i think is somewhat severe if
>> you want to do either OS upgrades with CentOS or even binary upgrades with
>> an existing PostgreSQL instance to a new machine with locale de_DE.UTF-8
>> and thus i'd like to share here.
>
> At the risk of getting flamed: I think that this is a bug in
> PostgreSQL, not CentOS. I've said why I think that is at least once
> already [1]. Simply put, there is no justification for the belief that
> some people have that collations should be immutable, and there is
> some justification for OS vendors updating them in a way that
> disregards our use of the collations. Glibc actually versions
> collations, although that isn't in any standard, and it isn't clear
> that we can benefit from that beyond refusing to start the server when
> a collation was superseded by a new version.

I for one am not going to flame you ;)

I always thought that this is a major shortcoming (if not a bug) in Postgres that the collation support is left to the
OS. 

Because it essentially means that that exactly the same query with exactly the same data might return a different
resultif run on different OS 

Thomas



Re: Issues with german locale on CentOS 5,6,7

From
Tom Lane
Date:
Peter Geoghegan <peter.geoghegan86@gmail.com> writes:
> At the risk of getting flamed: I think that this is a bug in
> PostgreSQL, not CentOS. I've said why I think that is at least once
> already [1]. Simply put, there is no justification for the belief that
> some people have that collations should be immutable, and there is
> some justification for OS vendors updating them in a way that
> disregards our use of the collations. Glibc actually versions
> collations, although that isn't in any standard, and it isn't clear
> that we can benefit from that beyond refusing to start the server when
> a collation was superseded by a new version.

I can think of less helpful responses to such a situation, but not easily.

The only real way out of such a situation is to REINDEX affected indexes.
Refusing to start the server not only doesn't contribute to a solution,
but makes it impossible to fix manually.

            regards, tom lane


Re: Issues with german locale on CentOS 5,6,7

From
Peter Geoghegan
Date:
On Wed, Oct 7, 2015 at 6:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The only real way out of such a situation is to REINDEX affected indexes.
> Refusing to start the server not only doesn't contribute to a solution,
> but makes it impossible to fix manually.

I agree that that would be almost as bad as carrying on, because there
is no reason to think that the locale thing can easily be rolled back.
That was my point, in fact.

--
Regards,
Peter Geoghegan


Re: Issues with german locale on CentOS 5,6,7

From
Thomas Munro
Date:
On Thu, Oct 8, 2015 at 5:52 AM, Peter Geoghegan
<peter.geoghegan86@gmail.com> wrote:
> On Wed, Oct 7, 2015 at 6:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The only real way out of such a situation is to REINDEX affected indexes.
>> Refusing to start the server not only doesn't contribute to a solution,
>> but makes it impossible to fix manually.
>
> I agree that that would be almost as bad as carrying on, because there
> is no reason to think that the locale thing can easily be rolled back.
> That was my point, in fact.

I have contemplated a maintenance script that would track either the
md5 checksums of the /usr/lib/locale/*/LC_COLLATE files or the version
of installed locale packages and automatically reindex things when
they change (I guess after restarting the cluster to clear any glibc
caches that might be lurking in long running backends).  Or at least
tell me that's needed.  Obviously completely OS-specific...

--
Thomas Munro
http://www.enterprisedb.com


Re: Issues with german locale on CentOS 5,6,7

From
Peter Geoghegan
Date:
On Wed, Oct 7, 2015 at 4:49 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>> I agree that that would be almost as bad as carrying on, because there
>> is no reason to think that the locale thing can easily be rolled back.
>> That was my point, in fact.
>
> I have contemplated a maintenance script that would track either the
> md5 checksums of the /usr/lib/locale/*/LC_COLLATE files or the version
> of installed locale packages and automatically reindex things when
> they change (I guess after restarting the cluster to clear any glibc
> caches that might be lurking in long running backends).  Or at least
> tell me that's needed.  Obviously completely OS-specific...

I think we should bite the bullet and adopt ICU, without abandoning
support for OS locales for users that really need it (certainly, many
will need it initially when using pg_upgrade to get on to the first
version that happens to have ICU support). I don't like suggesting a
solution that I myself am unlikely to find the time to work on, but in
the long run that's the only sensible approach IMV.

--
Regards,
Peter Geoghegan


Re: Issues with german locale on CentOS 5,6,7

From
Thomas Munro
Date:
On Thu, Oct 8, 2015 at 1:16 PM, Peter Geoghegan
<peter.geoghegan86@gmail.com> wrote:
> On Wed, Oct 7, 2015 at 4:49 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>>> I agree that that would be almost as bad as carrying on, because there
>>> is no reason to think that the locale thing can easily be rolled back.
>>> That was my point, in fact.
>>
>> I have contemplated a maintenance script that would track either the
>> md5 checksums of the /usr/lib/locale/*/LC_COLLATE files or the version
>> of installed locale packages and automatically reindex things when
>> they change (I guess after restarting the cluster to clear any glibc
>> caches that might be lurking in long running backends).  Or at least
>> tell me that's needed.  Obviously completely OS-specific...
>
> I think we should bite the bullet and adopt ICU, without abandoning
> support for OS locales for users that really need it (certainly, many
> will need it initially when using pg_upgrade to get on to the first
> version that happens to have ICU support). I don't like suggesting a
> solution that I myself am unlikely to find the time to work on, but in
> the long run that's the only sensible approach IMV.

How would you handle changes in ICU's collation definitions?

--
Thomas Munro
http://www.enterprisedb.com


Re: Issues with german locale on CentOS 5,6,7

From
Tom Lane
Date:
Peter Geoghegan <peter.geoghegan86@gmail.com> writes:
> I think we should bite the bullet and adopt ICU,

I see absolutely nothing to recommend that course of action.  Reasons not
to:

1. Being compatible with the operating system's collation behavior is a
feature, not a bug.  If nothing else, it allows us to tell people that
if we sort data the same way that sort(1) does, then it's not a bug that
we're not sorting the way they think we should.  But quite aside from
that, there are practical uses to being compatible with other tools.

2. Last I checked, ICU *only* supports Unicode, and not only that, but
only UTF16.  This is a non-starter; not only for our Far Eastern users,
but also those who find various LatinX encodings sufficient.  ICU would be
a functional fail for the former and a performance fail for the latter.

3. As Thomas Munro already noted, whatcha gonna do when ICU changes their
collations?  Or are their collations graven on stone tablets, unlike
anyone else's?


We certainly could stand to put some work into the problem of coping with
collation changes.  But claiming that ICU is the solution, or even a
solution, seems obviously wrong.

            regards, tom lane


Re: Issues with german locale on CentOS 5,6,7

From
Peter Geoghegan
Date:
On Wed, Oct 7, 2015 at 8:06 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>> I think we should bite the bullet and adopt ICU, without abandoning
>> support for OS locales for users that really need it (certainly, many
>> will need it initially when using pg_upgrade to get on to the first
>> version that happens to have ICU support). I don't like suggesting a
>> solution that I myself am unlikely to find the time to work on, but in
>> the long run that's the only sensible approach IMV.
>
> How would you handle changes in ICU's collation definitions?

ICU provides an API for collation versioning because of these kinds of
issues with indexes:
http://userguide.icu-project.org/collation/architecture#TOC-Versioning

There are specifications of collations used by ICU that originate from
the Unicode CLDR Project: http://cldr.unicode.org/

Basically, you prevent this kind of thing from ever happening in the
first place by making versioning explicit, and putting it under the
direct control of Postgres. I think a bunch of well regarded database
systems have used ICU for many years, including DB2, for example. I
think it's possible to arrange it so that the collations simply never
go away, but if that does happen (or if you decide that the changes to
a collation matter for cultural or correctness reasons) then you can
at least detect the change and recover from it reliably.

ICU has some other really nice features, too, but that's another discussion.

--
Regards,
Peter Geoghegan


Re: Issues with german locale on CentOS 5,6,7

From
Peter Geoghegan
Date:
On Wed, Oct 7, 2015 at 8:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 1. Being compatible with the operating system's collation behavior is a
> feature, not a bug.  If nothing else, it allows us to tell people that
> if we sort data the same way that sort(1) does, then it's not a bug that
> we're not sorting the way they think we should.  But quite aside from
> that, there are practical uses to being compatible with other tools.

I am not proposing to make that impossible.

> 2. Last I checked, ICU *only* supports Unicode, and not only that, but
> only UTF16.  This is a non-starter; not only for our Far Eastern users,
> but also those who find various LatinX encodings sufficient.  ICU would be
> a functional fail for the former and a performance fail for the latter.

UTF-16 is more efficient for representing East Asian languages, so not
sure what you mean about that. I realize that using UTF-16 is a
non-starter, though.

I guess you were talking about people who don't use Unicode due to the
Han Unification controversy. Again, I'm not proposing to only support
Unicode, but realistically the vast majority of users want Unicode,
even in East Asia.

Yes, ICU only supports Unicode, but it has supported UTF-8 for years
now, and not as a second class citizen. See
http://userguide.icu-project.org/strings/utf-8 . As it says there:

"""
If it is known that the default charset is always UTF-8 on the target
platform, then you should #define U_CHARSET_IS_UTF8 1 in or before
unicode/utypes.h. (For example, modify the default value there or pass
-DU_CHARSET_IS_UTF8=1 as a compiler flag.) This will change most of
the implementation code to use dedicated (simpler, faster) UTF-8 code
paths and avoid dependencies on the conversion framework. (Avoiding
such dependencies helps with statically linked libraries and may allow
the use of UCONFIG_NO_LEGACY_CONVERSION or even UCONFIG_NO_CONVERSION
[see unicode/uconfig.h].)

"""

> 3. As Thomas Munro already noted, whatcha gonna do when ICU changes their
> collations?  Or are their collations graven on stone tablets, unlike
> anyone else's?

See my response to Thomas.

--
Regards,
Peter Geoghegan


Re: Issues with german locale on CentOS 5,6,7

From
"Daniel Verite"
Date:
    Thomas Kellerer wrote:

> I always thought that this is a major shortcoming (if not a bug) in Postgres
> that the collation support is left to the OS.
>
> Because it essentially means that that exactly the same query with exactly
> the same data might return a different result if run on different OS

Could an extension help with this?

CREATE EXTENSION pg_locales;

The extension would provide its own strcoll/strxfrm/etc. and a set
of locale definitions, presumably slurped from Linux or BSD.

It would probably be preloaded in postgresql.conf:
shared_preload_libraries = pg_locales

A collation would be created with
SELECT create_pg_collation(name,locale,lc_collate,lc_ctype);
which would be a kind of CREATE COLLATION but relying on the
extension code/data rather than the libc.

Then tables and columns could be set to use that collation, and the
core postgres code should divert the corresponding string comparisons
to the extension functions.

Benefits:
- being insulated from unexpected OS changes.
- portability across OS, assuming that the extension would be itself
portable.
- the core postgres would not be responsible of maintaining these locales.

Would something like that make sense?

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


Re: Issues with german locale on CentOS 5,6,7

From
Thomas Munro
Date:


On Thu, Oct 8, 2015 at 12:49 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Thu, Oct 8, 2015 at 5:52 AM, Peter Geoghegan
<peter.geoghegan86@gmail.com> wrote:
> On Wed, Oct 7, 2015 at 6:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The only real way out of such a situation is to REINDEX affected indexes.
>> Refusing to start the server not only doesn't contribute to a solution,
>> but makes it impossible to fix manually.
>
> I agree that that would be almost as bad as carrying on, because there
> is no reason to think that the locale thing can easily be rolled back.
> That was my point, in fact.

I have contemplated a maintenance script that would track either the
md5 checksums of the /usr/lib/locale/*/LC_COLLATE files or the version
of installed locale packages and automatically reindex things when
they change (I guess after restarting the cluster to clear any glibc
caches that might be lurking in long running backends).  Or at least
tell me that's needed.  Obviously completely OS-specific...

I got around to trying this on a recent rainy day:

https://github.com/macdice/check_pg_collations/blob/master/check_pg_collations

It needs Python and psycopg2.  It notices when LC_COLLATE files' checksums change, and generates REINDEX commands on stdout.  It seems to work OK on Debian, but I haven't tested much.  Pull requests welcome :-)  But I guess it would be much cooler as a pure plpgsql extension.
 
--