Re: Catastrophic changes to PostgreSQL 8.4 - Mailing list pgsql-general

From Craig Ringer
Subject Re: Catastrophic changes to PostgreSQL 8.4
Date
Msg-id 4B17284F.3090401@postnewspapers.com.au
Whole thread Raw
In response to Catastrophic changes to PostgreSQL 8.4  (Kern Sibbald <kern@sibbald.com>)
Responses Re: Catastrophic changes to PostgreSQL 8.4  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [Bacula-users] Catastrophic changes to PostgreSQL 8.4  (Jerome Alet <jerome.alet@univ-nc.nc>)
Re: Catastrophic changes to PostgreSQL 8.4  (Kern Sibbald <kern@sibbald.com>)
List pgsql-general
On 2/12/2009 9:18 PM, Kern Sibbald wrote:
> Hello,
>
> I am the project manager of Bacula.  One of the database backends that Bacula
> uses is PostgreSQL.

As a Bacula user (though I'm not on the Bacula lists), first - thanks
for all your work. It's practically eliminated all human intervention
from something that used to be a major pain. Configuring it to handle
the different backup frequencies, retention periods and diff/inc/full
needs of the different data sets was a nightmare, but once set up it's
been bliss. The 3.x `Accurate' mode is particularly nice.

> Bacula sets the database encoding to SQL_ASCII, because although
> Bacula "supports" UTF-8 character encoding, it cannot enforce it.  Certain
> operating systems such as Unix, Linux and MacOS can have filenames that are
> not in UTF-8 format.  Since Bacula stores filenames in PostgreSQL tables, we
> use SQL_ASCII.

I noticed that while doing some work on the Bacula database a while ago.

I was puzzled at the time about why Bacula does not translate file names
from the source system's encoding to utf-8 for storage in the database,
so all file names are known to be sane and are in a known encoding.

Because Bacula does not store the encoding or seem to transcode the file
name to a single known encoding, it does not seem to be possible to
retrieve files by name if the bacula console is run on a machine with a
different text encoding to the machine the files came from. After all,
café in utf-8 is a different byte sequence to café in iso-9660-1, and
won't match in equality tests under SQL_ASCII.

Additionally, I'm worried that restoring to a different machine with a
different encoding may fail, and if it doesn't will result in hopelessly
mangled file names. This wouldn't be fun to deal with during disaster
recovery. (I don't yet know if there are provisions within Bacula its
self to deal with this and need to do some testing).

Anyway, it'd be nice if Bacula would convert file names to utf-8 at the
file daemon, using the encoding of the client, for storage in a utf-8
database.

Mac OS X (HFS Plus) and Windows (NTFS) systems store file names as
Unicode (UTF-16 IIRC). Unix systems increasingly use utf-8, but may use
other encodings. If a unix system does use another encoding, this may be
determined from the locale in the environment and used to convert file
names to utf-8.

Windows systems using FAT32 and Mac OS 9 machines on plain old HFS will
have file names in the locale's encoding, like UNIX systems, and are
fairly easily handled.

About the only issue I see is that systems may have file names that are
not valid text strings in the current locale, usually due to buggy
software butchering text encodings. I guess a *nix system _might_ have
different users running with different locales and encodings, too. The
latter case doesn't seem easy to handle cleanly as file names on unix
systems don't have any indication of what encoding they're in stored
with them. I'm not really sure these cases actually show up in practice,
though.

Personally, I'd like to see Bacula capable of using a utf-8 database,
with proper encoding conversion at the fd for non-utf-8 encoded client
systems. It'd really simplify managing backups for systems with a
variety of different encodings.

( BTW, one way to handle incorrectly encoded filenames and paths might
be to have a `bytea' field that's generally null to store such mangled
file names. Personally though I'd favour just rejecting them. )

> We set SQL_ASCII by default when creating the database via the command
> recommended in recent versions of PostgreSQL (e.g. 8.1), with:
>
> CREATE DATABASE bacula ENCODING 'SQL_ASCII';
>
> However, with PostgreSQL 8.4, the above command is ignored because the default
> table copied is not template0.

It's a pity that attempting to specify an encoding other than the safe
one when using a non-template0 database doesn't cause the CREATE
DATABASE command to fail with an error.

--
Craig Ringer

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Build universal binary on Mac OS X 10.6?
Next
From: Craig Ringer
Date:
Subject: Re: Auto Vacuum Daemon