Re: Catastrophic changes to PostgreSQL 8.4 - Mailing list pgsql-general

From Greg Stark
Subject Re: Catastrophic changes to PostgreSQL 8.4
Date
Msg-id 407d949e0912030541r261895f3j2e2430e7978617c5@mail.gmail.com
Whole thread Raw
In response to Re: Catastrophic changes to PostgreSQL 8.4  (Craig Ringer <craig@postnewspapers.com.au>)
List pgsql-general
On Thu, Dec 3, 2009 at 8:33 AM, Craig Ringer
<craig@postnewspapers.com.au> wrote:
> While true in theory, in practice it's pretty unusual to have filenames
> encoded with an encoding other than the system LC_CTYPE on a modern
> UNIX/Linux/BSD machine.
>
> I'd _very_ much prefer to have Bacula back my machines up by respecting
> LC_CTYPE and applying appropriate conversions at the fd if LC_CTYPE on
> the fd's host is not utf-8 and the database is.

a) it doesn't really matter how uncommon it is, backup software is
like databases, it's supposed to always work, not just usually work.

b) LC_CTYPE is an environment variable, it can be different for different users.

c) backup software that tries to fix up the data it's backing up to
what it thinks it should look like is bogus. If I can't trust my
backup software to restore exactly the same data with exactly the same
filenames then it's useless. The last thing I want to be doing when
recovering from a disaster is trying to debug some difference of
opinion between some third party commercial software and
postgres/bacula about unicode encodings.

> (3) As (2), but add a `bytea' column to `path' and `filename' tables
>    that's null if the fd was able to convert the filename from the
>    system LC_CTYPE to utf-8. In the rare cases it couldn't (due to
>    reasons like users running with different LC_CTYPE, nfs volumes
>    exported to systems with different LC_CTYPE, tarballs from
>    systems with different charsets, etc) the raw unconverted bytes
>    of the filename get stored in the bytea field, and a mangled
>    form of the name gets stored in the text field for user display
>    purposes only.

That's an interesting thought. I think it's not quite right -- you
want to always store the raw filename in the bytea and then also store
a text field with the visual representation. That way you can also
deal with broken encodings in some application specific way too,
perhaps by trying to guess a reasonable encoding.

An alternative would be to just store them in byteas and then handle
sorting and displaying by calling the conversion procedure on the fly.

--
greg

pgsql-general by date:

Previous
From: Howard Cole
Date:
Subject: Re: Unexpected EOF on client connection
Next
From: Alvaro Herrera
Date:
Subject: Re: pg_attribute.attnum - wrong column ordinal?