Re: The dangers of streaming across versions of glibc: A cautionary tale - Mailing list pgsql-general

From Peter Geoghegan
Subject Re: The dangers of streaming across versions of glibc: A cautionary tale
Date
Msg-id CAEYLb_UTMgM2V_pP7qnuKZYmTYXoym-zNYVbwoU79=TuP8HE3A@mail.gmail.com
Whole thread Raw
In response to Re: The dangers of streaming across versions of glibc: A cautionary tale  (Bruce Momjian <bruce@momjian.us>)
Responses Re: The dangers of streaming across versions of glibc: A cautionary tale  (Tatsuo Ishii <ishii@postgresql.org>)
List pgsql-general
On Wed, Aug 6, 2014 at 5:11 PM, Bruce Momjian <bruce@momjian.us> wrote:
> No surprise;  I have been expecting to hear about such breakage, and am
> surprised we hear about it so rarely.  We really have no way of testing
> for breakage either.  :-(

I guess that Trip Advisor were using some particular collation that
had a chance of changing. Sorting rules for English text (so, say,
en_US.UTF-8) are highly unlikely to change. That might be much less
true for other locales.

Unicode Technical Standard #10 states:

"""
Collation order is not fixed.

Over time, collation order will vary: there may be fixes needed as
more information becomes available about languages; there may be new
government or industry standards for the language that require
changes; and finally, new characters added to the Unicode Standard
will interleave with the previously-defined ones. This means that
collations must be carefully versioned.
"""

So, the reality is that we only have ourselves to blame.  :-(

LC_IDENTIFICATION serves this purpose on glibc. Here is what en_US
looks like on my machine:

"""
escape_char /
comment_char %
% Locale for English locale in the USA
% Contributed by Ulrich Drepper <drepper@redhat.com>, 2000

LC_IDENTIFICATION
title      "English locale for the USA"
source     "Free Software Foundation, Inc."
address    "59 Temple Place - Suite 330, Boston, MA 02111-1307, USA"
contact    ""
email      "bug-glibc-locales@gnu.org"
tel        ""
fax        ""
language   "English"
territory  "USA"
revision   "1.0"
date       "2000-06-24"
%
category  "en_US:2000";LC_IDENTIFICATION
category  "en_US:2000";LC_CTYPE
category  "en_US:2000";LC_COLLATE
category  "en_US:2000";LC_TIME
category  "en_US:2000";LC_NUMERIC
category  "en_US:2000";LC_MONETARY
category  "en_US:2000";LC_MESSAGES
category  "en_US:2000";LC_PAPER
category  "en_US:2000";LC_NAME
category  "en_US:2000";LC_ADDRESS
category  "en_US:2000";LC_TELEPHONE
*** SNIP ***
"""

This is a GNU extension [1]. If the OS adds a new version of a
collation, that probably accidentally works a lot of the time, because
the collation rule added or removed was fairly esoteric anyway, such
is the nature of these things. If it was something that came up a lot,
it would surely have been settled by standardization years ago.

If OS vendors are not going to give us a standard API for versioning,
we're hosed. I thought about suggesting that we hash a strxfrm() blob
for about 2 minutes, before realizing that that's a stupid idea. Glibc
would be a good start.

[1] https://www.gnu.org/software/autoconf/manual/autoconf-2.63/html_node/Special-Shell-Variables.html
--
Regards,
Peter Geoghegan


pgsql-general by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: The dangers of streaming across versions of glibc: A cautionary tale
Next
From: Phoenix Kiula
Date:
Subject: Need help in tuning