Thread: Catastrophic changes to PostgreSQL 8.4

Catastrophic changes to PostgreSQL 8.4

From
Kern Sibbald
Date:
Hello,

I am the project manager of Bacula.  One of the database backends that Bacula
uses is PostgreSQL.

This email is to notify you that a change you made to setting database
character codes has created havoc with certain unfortunate Bacula users.

Bacula sets the database encoding to SQL_ASCII, because although
Bacula "supports" UTF-8 character encoding, it cannot enforce it.  Certain
operating systems such as Unix, Linux and MacOS can have filenames that are
not in UTF-8 format.  Since Bacula stores filenames in PostgreSQL tables, we
use SQL_ASCII.

We set SQL_ASCII by default when creating the database via the command
recommended in recent versions of PostgreSQL (e.g. 8.1), with:

CREATE DATABASE bacula ENCODING 'SQL_ASCII';

However, with PostgreSQL 8.4, the above command is ignored because the default
table copied is not template0.  This means that some Bacula users who have
created PostgreSQL databases with version 8.4, typically find them created
with SQL_UTF8 format, which results in serious errors when doing backups for
certain machines.

Apparently, for PostgreSQL 8.4, one must specify:

CREATE DATABASE bacula ENCODING 'SQL_ASCII' TEMPLATE=template0;

Rather than making this incompatible modification, it would have been
advisable to make the default equivalent to the above.  Of course I imagine
that there are lots of reasons why that could not be programmed as such.

I also notice the following comment in your 8.4 documentation:

  "PostgreSQL will allow superusers to create databases with SQL_ASCII
encoding even when LC_CTYPE is not C or POSIX. As noted above, SQL_ASCII does
not enforce that the data stored in the database has any particular encoding,
and so this choice poses risks of locale-dependent misbehavior. Using this
combination of settings is deprecated and may someday be forbidden
altogether."

If you do eliminate SQL_ASCII at some point, it would be a real pity and
create a lot of inconvience for your users.

Regards,

Kern

Re: Catastrophic changes to PostgreSQL 8.4

From
Tom Lane
Date:
Kern Sibbald <kern@sibbald.com> writes:
> Bacula sets the database encoding to SQL_ASCII, because although
> Bacula "supports" UTF-8 character encoding, it cannot enforce it.

Okay ...

> CREATE DATABASE bacula ENCODING 'SQL_ASCII';
>
> However, with PostgreSQL 8.4, the above command is ignored because the default
> table copied is not template0.

... so copy template0.  I think you probably want to force C locale
(lc_ctype/lc_collate) as well, which was not possible for an individual
database before 8.4.  So on the whole I think 8.4 ought to be more
amenable to what you want than prior versions were.  No, it is not
exactly the same as prior versions were, but that's the price of
progress.

> I also notice the following comment in your 8.4 documentation:

>   "PostgreSQL will allow superusers to create databases with SQL_ASCII
> encoding even when LC_CTYPE is not C or POSIX. As noted above, SQL_ASCII does
> not enforce that the data stored in the database has any particular encoding,
> and so this choice poses risks of locale-dependent misbehavior. Using this
> combination of settings is deprecated and may someday be forbidden
> altogether."

> If you do eliminate SQL_ASCII at some point, it would be a real pity and
> create a lot of inconvience for your users.

You misread it.  We are not talking about eliminating SQL_ASCII --- as
you say, that's useful.  What is deprecated is trying to use SQL_ASCII
with a non-C locale, which is dangerous, and always has been.  If you've
been putting non-UTF8 data into a database that could be running under a
UTF8-dependent locale, I'm surprised you haven't noticed problems already.

            regards, tom lane

Re: Catastrophic changes to PostgreSQL 8.4

From
Adrian Klaver
Date:
On Wednesday 02 December 2009 5:18:52 am Kern Sibbald wrote:
> Hello,
>
> I am the project manager of Bacula.  One of the database backends that
> Bacula uses is PostgreSQL.
>
> This email is to notify you that a change you made to setting database
> character codes has created havoc with certain unfortunate Bacula users.
>
> Bacula sets the database encoding to SQL_ASCII, because although
> Bacula "supports" UTF-8 character encoding, it cannot enforce it.  Certain
> operating systems such as Unix, Linux and MacOS can have filenames that are
> not in UTF-8 format.  Since Bacula stores filenames in PostgreSQL tables,
> we use SQL_ASCII.
>
> We set SQL_ASCII by default when creating the database via the command
> recommended in recent versions of PostgreSQL (e.g. 8.1), with:
>
> CREATE DATABASE bacula ENCODING 'SQL_ASCII';
>
> However, with PostgreSQL 8.4, the above command is ignored because the
> default table copied is not template0.  This means that some Bacula users
> who have created PostgreSQL databases with version 8.4, typically find them
> created with SQL_UTF8 format, which results in serious errors when doing
> backups for certain machines.


When I tried the above Postgres did not ignore the command, instead it gave me
the following error and did not create the database:
CREATE DATABASE bacula ENCODING 'SQL_ASCII';
ERROR:  new encoding (SQL_ASCII) is incompatible with the encoding of the
template database (UTF8)
HINT:  Use the same encoding as in the template database, or use template0 as
template.



>
> Regards,
>
> Kern



--
Adrian Klaver
aklaver@comcast.net

Re: Catastrophic changes to PostgreSQL 8.4

From
Craig Ringer
Date:
On 2/12/2009 9:18 PM, Kern Sibbald wrote:
> Hello,
>
> I am the project manager of Bacula.  One of the database backends that Bacula
> uses is PostgreSQL.

As a Bacula user (though I'm not on the Bacula lists), first - thanks
for all your work. It's practically eliminated all human intervention
from something that used to be a major pain. Configuring it to handle
the different backup frequencies, retention periods and diff/inc/full
needs of the different data sets was a nightmare, but once set up it's
been bliss. The 3.x `Accurate' mode is particularly nice.

> Bacula sets the database encoding to SQL_ASCII, because although
> Bacula "supports" UTF-8 character encoding, it cannot enforce it.  Certain
> operating systems such as Unix, Linux and MacOS can have filenames that are
> not in UTF-8 format.  Since Bacula stores filenames in PostgreSQL tables, we
> use SQL_ASCII.

I noticed that while doing some work on the Bacula database a while ago.

I was puzzled at the time about why Bacula does not translate file names
from the source system's encoding to utf-8 for storage in the database,
so all file names are known to be sane and are in a known encoding.

Because Bacula does not store the encoding or seem to transcode the file
name to a single known encoding, it does not seem to be possible to
retrieve files by name if the bacula console is run on a machine with a
different text encoding to the machine the files came from. After all,
café in utf-8 is a different byte sequence to café in iso-9660-1, and
won't match in equality tests under SQL_ASCII.

Additionally, I'm worried that restoring to a different machine with a
different encoding may fail, and if it doesn't will result in hopelessly
mangled file names. This wouldn't be fun to deal with during disaster
recovery. (I don't yet know if there are provisions within Bacula its
self to deal with this and need to do some testing).

Anyway, it'd be nice if Bacula would convert file names to utf-8 at the
file daemon, using the encoding of the client, for storage in a utf-8
database.

Mac OS X (HFS Plus) and Windows (NTFS) systems store file names as
Unicode (UTF-16 IIRC). Unix systems increasingly use utf-8, but may use
other encodings. If a unix system does use another encoding, this may be
determined from the locale in the environment and used to convert file
names to utf-8.

Windows systems using FAT32 and Mac OS 9 machines on plain old HFS will
have file names in the locale's encoding, like UNIX systems, and are
fairly easily handled.

About the only issue I see is that systems may have file names that are
not valid text strings in the current locale, usually due to buggy
software butchering text encodings. I guess a *nix system _might_ have
different users running with different locales and encodings, too. The
latter case doesn't seem easy to handle cleanly as file names on unix
systems don't have any indication of what encoding they're in stored
with them. I'm not really sure these cases actually show up in practice,
though.

Personally, I'd like to see Bacula capable of using a utf-8 database,
with proper encoding conversion at the fd for non-utf-8 encoded client
systems. It'd really simplify managing backups for systems with a
variety of different encodings.

( BTW, one way to handle incorrectly encoded filenames and paths might
be to have a `bytea' field that's generally null to store such mangled
file names. Personally though I'd favour just rejecting them. )

> We set SQL_ASCII by default when creating the database via the command
> recommended in recent versions of PostgreSQL (e.g. 8.1), with:
>
> CREATE DATABASE bacula ENCODING 'SQL_ASCII';
>
> However, with PostgreSQL 8.4, the above command is ignored because the default
> table copied is not template0.

It's a pity that attempting to specify an encoding other than the safe
one when using a non-template0 database doesn't cause the CREATE
DATABASE command to fail with an error.

--
Craig Ringer

Re: Catastrophic changes to PostgreSQL 8.4

From
Tom Lane
Date:
Craig Ringer <craig@postnewspapers.com.au> writes:
> It's a pity that attempting to specify an encoding other than the safe
> one when using a non-template0 database doesn't cause the CREATE
> DATABASE command to fail with an error.

Huh?

regression=# create database foo lc_ctype = 'en_US.utf8' encoding = 'latin1';
ERROR:  encoding LATIN1 does not match locale en_US.utf8
DETAIL:  The chosen LC_CTYPE setting requires encoding UTF8.

As best I can tell, the OP is complaining exactly because it did fail,
in contrast to pre-8.4 versions that let him create unsafe combinations.

            regards, tom lane

Re: Catastrophic changes to PostgreSQL 8.4

From
Craig Ringer
Date:
On 3/12/2009 11:03 AM, Tom Lane wrote:
> Craig Ringer<craig@postnewspapers.com.au>  writes:
>> It's a pity that attempting to specify an encoding other than the safe
>> one when using a non-template0 database doesn't cause the CREATE
>> DATABASE command to fail with an error.
>
> Huh?
>
> regression=# create database foo lc_ctype = 'en_US.utf8' encoding = 'latin1';
> ERROR:  encoding LATIN1 does not match locale en_US.utf8
> DETAIL:  The chosen LC_CTYPE setting requires encoding UTF8.
>
> As best I can tell, the OP is complaining exactly because it did fail,
> in contrast to pre-8.4 versions that let him create unsafe combinations.

Teach me not to check it myself before saying anything. I read the OP's
post as saying that they were having issues with creation _succeeding_
but disregarding the encoding specification:

> CREATE DATABASE bacula ENCODING 'SQL_ASCII';
>
> However, with PostgreSQL 8.4, the above command is ignored because the default
> table copied is not template0.  This means that some Bacula users who have
> created PostgreSQL databases with version 8.4, typically find them created
> with SQL_UTF8 format, which results in serious errors when doing backups for
> certain machines.

... which would indeed be nasty. That's what I was commenting on.
Failing to create an unsafe database, on the other hand, is just what I
want to happen, and it's good to see that's what happens on my 8.4
systems. I'm confused by the above, though.

Kern: Is the issue that the CREATE DATABASE is failing, or that it's
succeeding incorrectly? If it's succeeding incorrectly, it'd really help
to know more about the problem systems.

The creation script shipped in 3.0.2 was:

#
# KES: Note: the CREATE DATABASE, probably should be
#   CREATE DATABASE ${db_name} $ENCODING TEMPLATE template0
#
if $bindir/psql -f - -d template1 $* <<END-OF-DATA
CREATE DATABASE ${db_name} $ENCODING;
ALTER DATABASE ${db_name} SET datestyle TO 'ISO, YMD';
END-OF-DATA
then
    echo "Creation of ${db_name} database succeeded."
else
    echo "Creation of ${db_name} database failed."
fi
exit 0

... which doesn't look like it'd do anything unexpected to template1 if
the CREATE DATABASE failed. You might want to set ON_ERROR_STOP in psql
anyway, though.

--
Craig Ringer

Re: [Bacula-users] Catastrophic changes to PostgreSQL 8.4

From
Craig Ringer
Date:
On 3/12/2009 11:09 AM, Jerome Alet wrote:
> On Thu, Dec 03, 2009 at 10:54:07AM +0800, Craig Ringer wrote:
>>
>> Anyway, it'd be nice if Bacula would convert file names to utf-8 at the
>> file daemon, using the encoding of the client, for storage in a utf-8
>> database.
>
> +1 for me.
>
> this is the way to go.
>
> I understand people with an existing backup history won't be very happy
> with this unless you provide them the appropriate tools or instructions
> to convert their database's content, though.

I just noticed, while reading src/cats/create_postgresql_database:

# use SQL_ASCII to be able to put any filename into
#  the database even those created with unusual character sets
ENCODING="ENCODING 'SQL_ASCII'"

# use UTF8 if you are using standard Unix/Linux LANG specifications
#  that use UTF8 -- this is normally the default and *should* be
#  your standard.  Bacula works correctly *only* with correct UTF8.
#
#  Note, with this encoding, if you have any "weird" filenames on
#  your system (names generated from Win32 or Mac OS), you may
#  get Bacula batch insert failures.
#
#ENCODING="ENCODING 'UTF8'"



... so it's defaulting to SQL_ASCII, but actually supports utf-8 if your
systems are all in a utf-8 locale. Assuming there's some way for the
filed to find out the encoding of the director's database, it probably
wouldn't be too tricky to convert non-matching file names to the
director's encoding in the fd (when the director's encoding isn't
SQL_ASCII, of course).

This also makes me wonder how filenames on Mac OS X and Windows are
handled. I didn't see any use of the unicode-form APIs or any UTF-16 to
UTF-8 conversion in an admittedly _very_ quick glance at the filed/
sources. How does bacula handle file names on those platforms? Read them
with the non-unicode APIs and hope they fit into the current non-unicode
encoding? Or am I missing something?

--
Craig Ringer

Re: [Bacula-users] Catastrophic changes to PostgreSQL 8.4

From
Stephen Frost
Date:
* Craig Ringer (craig@postnewspapers.com.au) wrote:
> ... so it's defaulting to SQL_ASCII, but actually supports utf-8 if your
> systems are all in a utf-8 locale. Assuming there's some way for the
> filed to find out the encoding of the director's database, it probably
> wouldn't be too tricky to convert non-matching file names to the
> director's encoding in the fd (when the director's encoding isn't
> SQL_ASCII, of course).

I'm not sure which piece of bacula connects to PostgreSQL, but whatever
it is, it could just send a 'set client_encoding' to the PG backend and
all the conversion will be done by PG..

> This also makes me wonder how filenames on Mac OS X and Windows are
> handled. I didn't see any use of the unicode-form APIs or any UTF-16 to
> UTF-8 conversion in an admittedly _very_ quick glance at the filed/
> sources. How does bacula handle file names on those platforms? Read them
> with the non-unicode APIs and hope they fit into the current non-unicode
> encoding? Or am I missing something?

Good question.

    Stephen

Attachment

Re: [Bacula-users] Catastrophic changes to PostgreSQL 8.4

From
Jerome Alet
Date:
On Thu, Dec 03, 2009 at 10:54:07AM +0800, Craig Ringer wrote:
>
> Anyway, it'd be nice if Bacula would convert file names to utf-8 at the
> file daemon, using the encoding of the client, for storage in a utf-8
> database.

+1 for me.

this is the way to go.

I understand people with an existing backup history won't be very happy
with this unless you provide them the appropriate tools or instructions
to convert their database's content, though.

bye

--
Jérôme Alet - <jerome.alet@univ-nc.nc> - Centre de Ressources Informatiques
      Université de la Nouvelle-Calédonie - BPR4 - 98851 NOUMEA CEDEX
   Tél : +687 266754                                  Fax : +687 254829

Re: [Bacula-users] Catastrophic changes to PostgreSQL 8.4

From
Jose Ildefonso Camargo Tolosa
Date:
Hi!

On Thu, Dec 3, 2009 at 10:39 PM, Jerome Alet <jerome.alet@univ-nc.nc> wrote:
> On Thu, Dec 03, 2009 at 10:54:07AM +0800, Craig Ringer wrote:
>>
>> Anyway, it'd be nice if Bacula would convert file names to utf-8 at the
>> file daemon, using the encoding of the client, for storage in a utf-8
>> database.
>
> +1 for me.

+1 here: it, in fact, have problems when restoring to a server with
different code page as the original one.

>
> this is the way to go.
>
> I understand people with an existing backup history won't be very happy
> with this unless you provide them the appropriate tools or instructions
> to convert their database's content, though.
>
> bye
>
> --
> Jérôme Alet - <jerome.alet@univ-nc.nc> - Centre de Ressources Informatiques
>      Université de la Nouvelle-Calédonie - BPR4 - 98851 NOUMEA CEDEX
>   Tél : +687 266754                                  Fax : +687 254829
>
> ------------------------------------------------------------------------------
> Join us December 9, 2009 for the Red Hat Virtual Experience,
> a free event focused on virtualization and cloud computing.
> Attend in-depth sessions from your desk. Your couch. Anywhere.
> http://p.sf.net/sfu/redhat-sfdev2dev
> _______________________________________________
> Bacula-users mailing list
> Bacula-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/bacula-users
>

Re: [Bacula-users] Catastrophic changes to PostgreSQL 8.4

From
Craig Ringer
Date:
Stephen Frost wrote:
> * Craig Ringer (craig@postnewspapers.com.au) wrote:
>> ... so it's defaulting to SQL_ASCII, but actually supports utf-8 if your
>> systems are all in a utf-8 locale. Assuming there's some way for the
>> filed to find out the encoding of the director's database, it probably
>> wouldn't be too tricky to convert non-matching file names to the
>> director's encoding in the fd (when the director's encoding isn't
>> SQL_ASCII, of course).
>
> I'm not sure which piece of bacula connects to PostgreSQL, but whatever
> it is, it could just send a 'set client_encoding' to the PG backend and
> all the conversion will be done by PG.

The director is responsible for managing all the metadata, and it's the
component that connects to Pg.

If the fd sent the system charset along with the bundle of filenames etc
that it sends to the director, then I don't see why the director
couldn't `SET client_encoding' appropriately before inserting data from
that fd, then `RESET client_encoding' once the batch insert was done.

The only downside is that if even one file has invalidly encoded data,
the whole batch insert fails and is rolled back. For that reason, I'd
personally prefer that the fd handle conversion so that it can exclude
such files (with a loud complaint in the error log) or munge the file
name into something that _can_ be stored.

Come to think of it, if the fd and database are both on a utf-8
encoding, the fd should *still* validate the utf-8 filenames it reads.
There's no guarantee that just because the system thinks the filename
should be utf-8, it's actually valid utf-8, and it'd be good to catch
this at the fd rather than messing up the batch insert by the director,
thus making it much safer than it presently is to use Bacula with a
utf-8 database.

--
Craig Ringer

Re: Catastrophic changes to PostgreSQL 8.4

From
Kern Sibbald
Date:
Hello,

Thanks for all the answers; I am a bit overwhelmed by the number, so I am
going to try to answer everyone in one email.

The first thing to understand is that it is *impossible* to know what the
encoding is on the client machine (FD -- or File daemon).  On say a
Unix/Linux system, the user could create filenames with non-UTF-8 then switch
to UTF-8, or restore files that were tarred on Windows or on Mac, or simply
copy a Mac directory.  Finally, using system calls to create a file, you can
put *any* character into a filename.

So, rather than trying to figure everything out (impossible, I think) and
rather than failing to backup files, Bacula gets the "raw" filename from the
OS and stores it on the Volume then puts it in the database.  We treat the
filename as if it is UTF-8 for display purposes, but in all other cases, what
we want is for the filename to go into the database and come back out
unchanged.

On MySQL we use BLOBS.  On PostgreSQL, we TEXT and set the encoding to
SQL_ASCII so that PostgreSQL will not attempt to do any translation.  This
works well, and I hope that PostgreSQL will continue to support letting
Bacula insert text characters in the database with no character encoding
checks in the future.

See more notes below ...


On Thursday 03 December 2009 03:54:07 Craig Ringer wrote:
> On 2/12/2009 9:18 PM, Kern Sibbald wrote:
> > Hello,
> >
> > I am the project manager of Bacula.  One of the database backends that
> > Bacula uses is PostgreSQL.
>
> As a Bacula user (though I'm not on the Bacula lists), first - thanks
> for all your work. It's practically eliminated all human intervention
> from something that used to be a major pain. Configuring it to handle
> the different backup frequencies, retention periods and diff/inc/full
> needs of the different data sets was a nightmare, but once set up it's
> been bliss. The 3.x `Accurate' mode is particularly nice.
>
> > Bacula sets the database encoding to SQL_ASCII, because although
> > Bacula "supports" UTF-8 character encoding, it cannot enforce it.
> > Certain operating systems such as Unix, Linux and MacOS can have
> > filenames that are not in UTF-8 format.  Since Bacula stores filenames in
> > PostgreSQL tables, we use SQL_ASCII.
>
> I noticed that while doing some work on the Bacula database a while ago.
>
> I was puzzled at the time about why Bacula does not translate file names
> from the source system's encoding to utf-8 for storage in the database,
> so all file names are known to be sane and are in a known encoding.

We don't and cannot know the encoding scheme on Unix/Linux systems (see
above), so attempting to convert them to UTF-8 would just consume more CPU
time and result in errors at some point.

>
> Because Bacula does not store the encoding or seem to transcode the file
> name to a single known encoding, it does not seem to be possible to
> retrieve files by name if the bacula console is run on a machine with a
> different text encoding to the machine the files came from. After all,
> café in utf-8 is a different byte sequence to café in iso-9660-1, and
> won't match in equality tests under SQL_ASCII.

If all the filenames go in in "binary" or "litteral" form, then any tests will
work fine.  The only test Bacula does is "equality".  Bacula doesn't worry
about sorting.  Users may care, but for backup and restore the only test
Bacula needs is equality, and as long as you are working with unchanged byte
streams everything works on every system.

The one place where we do convert filenames is on Windows. We convert UCS to
UTF-8.

>
> Additionally, I'm worried that restoring to a different machine with a
> different encoding may fail, and if it doesn't will result in hopelessly
> mangled file names. This wouldn't be fun to deal with during disaster
> recovery. (I don't yet know if there are provisions within Bacula its
> self to deal with this and need to do some testing).

Yes, if you restore on a different system with a different encoding, you will
end up with the same binary string at the OS level, but when you see the
filenames they may look different.

>
> Anyway, it'd be nice if Bacula would convert file names to utf-8 at the
> file daemon, using the encoding of the client, for storage in a utf-8
> database.

As I mention, this is not possible since Unix/Linux stores "binary" strings.
They can be in any format.

>
> Mac OS X (HFS Plus) and Windows (NTFS) systems store file names as
> Unicode (UTF-16 IIRC). Unix systems increasingly use utf-8, but may use
> other encodings. If a unix system does use another encoding, this may be
> determined from the locale in the environment and used to convert file
> names to utf-8.

As I mentioned above, on Linux/Unix systems, the user is free to change the
encoding at will, and when he does so, existing filenames remain unchanged,
so it is not possible to choose a particular encoding and backup and restore
files without changing the filenames.  Bacula backs them up and restores them
using "binary" strings so we don't have problems of changing encoding.

>
> Windows systems using FAT32 and Mac OS 9 machines on plain old HFS will
> have file names in the locale's encoding, like UNIX systems, and are
> fairly easily handled.
>
> About the only issue I see is that systems may have file names that are
> not valid text strings in the current locale, usually due to buggy
> software butchering text encodings. I guess a *nix system _might_ have
> different users running with different locales and encodings, too. The
> latter case doesn't seem easy to handle cleanly as file names on unix
> systems don't have any indication of what encoding they're in stored
> with them. I'm not really sure these cases actually show up in practice,
> though.
>
> Personally, I'd like to see Bacula capable of using a utf-8 database,
> with proper encoding conversion at the fd for non-utf-8 encoded client
> systems. It'd really simplify managing backups for systems with a
> variety of different encodings.

Maybe this will happen someday, but first we will need Linux/Unix systems that
*force* the encoding to be in some particular standard.

>
> ( BTW, one way to handle incorrectly encoded filenames and paths might
> be to have a `bytea' field that's generally null to store such mangled
> file names. Personally though I'd favour just rejecting them. )
>
> > We set SQL_ASCII by default when creating the database via the command
> > recommended in recent versions of PostgreSQL (e.g. 8.1), with:
> >
> > CREATE DATABASE bacula ENCODING 'SQL_ASCII';
> >
> > However, with PostgreSQL 8.4, the above command is ignored because the
> > default table copied is not template0.
>
> It's a pity that attempting to specify an encoding other than the safe
> one when using a non-template0 database doesn't cause the CREATE
> DATABASE command to fail with an error.

I didn't actually run it myself, so it is possible that it produced an error
message, but it did apparently create the database but with UTF-8 encoding.
Most of these things are done in script files, so certain non-fatal errors
may be overlooked.

As far as I can tell, it took the above encoding command, and perhaps printed
an error message but went ahead and created the database with an encoding
that was not correct.  If that is indeed the case, then it is in my opinion,
a bad design policy.  I would much prefer that either Postgres accept the
command or that it not create the database.  This way, either the database
would work as the user expects or there would be no database, and the problem
would be resolved before it creates databases that cannot be read.

In any case we have corrected the command to include the TEMPLATE, but this
won't help people with older Bacula's.

The other point I wanted to emphasize is that the documentation implied that
future versions of Postgres may eliminate the feature of having SQL_ASCII
(i.e. the ability to input arbritrary binary strings).  As I said, that would
be a pity -- I suppose we could switch to using LOs or whatever they are
called in Postgres, but that would be rather inconvenient.

Thanks for all the responses,

Best regards,

Kern



Re: Catastrophic changes to PostgreSQL 8.4

From
Pavel Stehule
Date:
2009/12/3 Kern Sibbald <kern@sibbald.com>:
> Hello,
>
> Thanks for all the answers; I am a bit overwhelmed by the number, so I am
> going to try to answer everyone in one email.
>
> The first thing to understand is that it is *impossible* to know what the
> encoding is on the client machine (FD -- or File daemon).  On say a
> Unix/Linux system, the user could create filenames with non-UTF-8 then switch
> to UTF-8, or restore files that were tarred on Windows or on Mac, or simply
> copy a Mac directory.  Finally, using system calls to create a file, you can
> put *any* character into a filename.
>
> So, rather than trying to figure everything out (impossible, I think) and
> rather than failing to backup files, Bacula gets the "raw" filename from the
> OS and stores it on the Volume then puts it in the database.  We treat the
> filename as if it is UTF-8 for display purposes, but in all other cases, what
> we want is for the filename to go into the database and come back out
> unchanged.
>
> On MySQL we use BLOBS.  On PostgreSQL, we TEXT and set the encoding to
> SQL_ASCII so that PostgreSQL will not attempt to do any translation.  This
> works well, and I hope that PostgreSQL will continue to support letting
> Bacula insert text characters in the database with no character encoding
> checks in the future.

Hello

just use bytea datatype instead text.

http://www.postgresql.org/docs/8.4/interactive/datatype-binary.html

it is exactly what you wont.

Regards

Pavel Stehule

Re: Catastrophic changes to PostgreSQL 8.4

From
Craig Ringer
Date:
Kern Sibbald wrote:
> Hello,
>
> Thanks for all the answers; I am a bit overwhelmed by the number, so I am
> going to try to answer everyone in one email.
>
> The first thing to understand is that it is *impossible* to know what the
> encoding is on the client machine (FD -- or File daemon).  On say a
> Unix/Linux system, the user could create filenames with non-UTF-8 then switch
> to UTF-8, or restore files that were tarred on Windows or on Mac, or simply
> copy a Mac directory.  Finally, using system calls to create a file, you can
> put *any* character into a filename.

While true in theory, in practice it's pretty unusual to have filenames
encoded with an encoding other than the system LC_CTYPE on a modern
UNIX/Linux/BSD machine.

I'd _very_ much prefer to have Bacula back my machines up by respecting
LC_CTYPE and applying appropriate conversions at the fd if LC_CTYPE on
the fd's host is not utf-8 and the database is.

If the database was SQL_ASCII, it could retain its existing behaviour.

That way, people get to pick between the two rational behaviours:

(1) Store file names as raw byte strings (SQL_ASCII). Guaranteed to work
    even on garbage file names that aren't valid in the current system
    encoding, but has issues with searches, filename matches, restoring
    to another system, etc.

(2) Store file names as UTF-8, performing any required translation from
    the system charset at the file daemon. File names that are nonsense
    in the system encoding are either (a) rejected with an error in the
    fd logs, or (b) backed up with some form of name mangling.


I *strongly* suspect most people will pick the second option.

There's also a third possibility:

(3) As (2), but add a `bytea' column to `path' and `filename' tables
    that's null if the fd was able to convert the filename from the
    system LC_CTYPE to utf-8. In the rare cases it couldn't (due to
    reasons like users running with different LC_CTYPE, nfs volumes
    exported to systems with different LC_CTYPE, tarballs from
    systems with different charsets, etc) the raw unconverted bytes
    of the filename get stored in the bytea field, and a mangled
    form of the name gets stored in the text field for user display
    purposes only.

I don't know if that'd be worth the hassle, though. I'd just want to use
(2) and I suspect so would a majority of users.

> On MySQL we use BLOBS.  On PostgreSQL, we TEXT and set the encoding to
> SQL_ASCII so that PostgreSQL will not attempt to do any translation.  This
> works well, and I hope that PostgreSQL will continue to support letting
> Bacula insert text characters in the database with no character encoding
> checks in the future.

Even if that was removed (which I can't see happening) you could use the
bytea type that's designed for exactly that purpose. Pity it's a bit of
a pain to work with :-(

>> Because Bacula does not store the encoding or seem to transcode the file
>> name to a single known encoding, it does not seem to be possible to
>> retrieve files by name if the bacula console is run on a machine with a
>> different text encoding to the machine the files came from. After all,
>> café in utf-8 is a different byte sequence to café in iso-9660-1, and
>> won't match in equality tests under SQL_ASCII.
>
> If all the filenames go in in "binary" or "litteral" form, then any tests will
> work fine.  The only test Bacula does is "equality".

Byte strings for the same sequence of characters in different encodings
are not equal. If you're looking for a file called "café.txt" that was
created on a machine with a latin-1 encoding, you cannot find it by
searching for 'café' in bconsole because bconsole will search for the
utf-8 byte sequence for 'café' not the latin-1 byte sequence for 'café'.

$ python
>>> x = u"café"
>>> x.encode("utf-8")
'caf\xc3\xa9'
>>> x.encode("latin-1")
'caf\xe9'
>>> x.encode("utf-8") == x.encode("latin-1")
False


or in Pg:


craig=> SHOW client_encoding;
 client_encoding
-----------------
 UTF8
(1 row)

craig=> CREATE TABLE test (x text);
CREATE TABLE

craig=> INSERT INTO test(x) VALUES ('café');
INSERT 0 1

craig=> SELECT x, x::bytea FROM test;
  x   |      x
------+-------------
 café | caf\303\251
(1 row)

craig=> SELECT convert_to(x, 'latin-1') from test;
 convert_to
------------
 caf\351
(1 row)

craig=> SELECT convert_to(x, 'utf-8') = x::bytea,
        convert_to(x, 'utf-8') = convert_to(x, 'latin-1')
        FROM test;
 ?column? | ?column?
----------+----------
 t        | f
(1 row)

> The one place where we do convert filenames is on Windows. We convert UCS to
> UTF-8.

Cool. I thought that must be the case, but it's good to know.

What about Mac OS X? It stores file names in UTF-16 normalized form, but
has a variety of ways to access those files, including POSIX interfaces.

Hmm. A quick test suggests that, irrespective of LC_CTYPE, LANG and
LC_ALL, Mac OS X converts file names to UTF-8 for use with POSIX APIs. A
bit of digging helps confirm that:

http://lists.apple.com/archives/applescript-users/2002/Sep/msg00319.html

... so Mac OS X is no problem. It's just traditional UNIXes which refuse
to standardize on a unicode encoding at the file system level that're a
problem.

(It drives me NUTS that Linux file systems don't do everything in
unicode, with the VFS layer or even glibc converting between unicode and
LC_CTYPE for programs that want another encoding. Even Microsoft figured
out the need for this one in 1998!)

> Yes, if you restore on a different system with a different encoding, you will
> end up with the same binary string at the OS level, but when you see the
> filenames they may look different.

... which to most users equates to "argh, Bacula corrupted my backup!"
or "argh, all my filenames are garbage!". Much as they'll say "I changed
my system language and now my filenames are all garbage" if they change
LC_CTYPE...


>> Personally, I'd like to see Bacula capable of using a utf-8 database,
>> with proper encoding conversion at the fd for non-utf-8 encoded client
>> systems. It'd really simplify managing backups for systems with a
>> variety of different encodings.
>
> Maybe this will happen someday, but first we will need Linux/Unix systems that
> *force* the encoding to be in some particular standard.

That would make me a happy, happy fellow.

I don't really see why it can't be done in glibc now, in truth. Convert
everything that goes through a system call to/from utf-8.

> I didn't actually run it myself, so it is possible that it produced an error
> message, but it did apparently create the database but with UTF-8 encoding.
> Most of these things are done in script files, so certain non-fatal errors
> may be overlooked.
>
> As far as I can tell, it took the above encoding command, and perhaps printed
> an error message but went ahead and created the database with an encoding
> that was not correct.

If that's the case, it'd be REALLY good to find out more about the
systems it happened on. Affected Bacula users, please speak up now. Tom
Lane pointed out and demonstrated that it should report an error and
fail to create the DB in these cases, so if it's not that's something
that needs investigation.

Are you sure it wasn't users (or "helpful" distro packagers) altering
create_postgresql_database to issue a CREATE DATABASE statement with
utf-8 ... perhaps after getting errors they didn't understand when using
SQL_ASCII ?

--
Craig Ringer

Re: [Bacula-users] Catastrophic changes to PostgreSQL 8.4

From
Avi Rozen
Date:
Craig Ringer wrote:
> Kern Sibbald wrote:
>
>> Hello,
>>
>> Thanks for all the answers; I am a bit overwhelmed by the number, so I am
>> going to try to answer everyone in one email.
>>
>> The first thing to understand is that it is *impossible* to know what the
>> encoding is on the client machine (FD -- or File daemon).  On say a
>> Unix/Linux system, the user could create filenames with non-UTF-8 then switch
>> to UTF-8, or restore files that were tarred on Windows or on Mac, or simply
>> copy a Mac directory.  Finally, using system calls to create a file, you can
>> put *any* character into a filename.
>>
>
> While true in theory, in practice it's pretty unusual to have filenames
> encoded with an encoding other than the system LC_CTYPE on a modern
> UNIX/Linux/BSD machine.
>

In my case garbage filenames are all too common. It's a the sad
*reality*, when you're mixing languages (Hebrew and English in my case)
and operating systems. Garbage filenames are everywhere: directories and
files shared between different operating systems and file systems, mail
attachments, mp3 file names based on garbage id3 tags, files in zip
archives (which seem to not handle filename encoding at all), etc.

When I first tried Bacula (version 1.38), I expected to have trouble
with filenames, since this is what I'm used to. I was rather pleased to
find out that it could both backup and restore files, regardless of
origin and destination filename encoding.

I like Bacula because, among other things, it can take the punishment
and chug along, without me even noticing that there was supposed to be a
problem (a recent example: backup/restore files with a negative mtime ...)

My 2c
Avi

Re: Catastrophic changes to PostgreSQL 8.4

From
"Daniel Verite"
Date:
    Craig Ringer wrote:

> While true in theory, in practice it's pretty unusual to have filenames
> encoded with an encoding other than the system LC_CTYPE on a modern
> UNIX/Linux/BSD machine.

It depends. In western Europe, where iso-8859-1[5] and utf8 are evenly used,
it's not unusual at all. You just have to extract an archive created by
someone who uses a different encoding than you. Since tar files don't carry
any information about the encoding of the filenames it contains, they come
out as they are, whatever LC_CTYPE is. The same problem exists for zip files.

Best regards,
--
Daniel
PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org

Re: [Bacula-users] Catastrophic changes to PostgreSQL 8.4

From
"Kern Sibbald"
Date:
> Craig Ringer wrote:
>> Kern Sibbald wrote:
>>
>>> Hello,
>>>
>>> Thanks for all the answers; I am a bit overwhelmed by the number, so I
>>> am
>>> going to try to answer everyone in one email.
>>>
>>> The first thing to understand is that it is *impossible* to know what
>>> the
>>> encoding is on the client machine (FD -- or File daemon).  On say a
>>> Unix/Linux system, the user could create filenames with non-UTF-8 then
>>> switch
>>> to UTF-8, or restore files that were tarred on Windows or on Mac, or
>>> simply
>>> copy a Mac directory.  Finally, using system calls to create a file,
>>> you can
>>> put *any* character into a filename.
>>>
>>
>> While true in theory, in practice it's pretty unusual to have filenames
>> encoded with an encoding other than the system LC_CTYPE on a modern
>> UNIX/Linux/BSD machine.
>>
>
> In my case garbage filenames are all too common. It's a the sad
> *reality*, when you're mixing languages (Hebrew and English in my case)
> and operating systems. Garbage filenames are everywhere: directories and
> files shared between different operating systems and file systems, mail
> attachments, mp3 file names based on garbage id3 tags, files in zip
> archives (which seem to not handle filename encoding at all), etc.

Yes, that is my experience too.  I understand Craig's comments, but I
would much prefer that Bacula just backup and restore and leave the
checking of filename consistencies to other programs.  At least for the
moment, that seems to work quite well.  Obviously if users mix character
sets, sometime display of filenames in Bacula will be wierd, but
nevertheless Bacula will backup and restore them so that what was on the
system before the backup is what is restored.

>
> When I first tried Bacula (version 1.38), I expected to have trouble
> with filenames, since this is what I'm used to. I was rather pleased to
> find out that it could both backup and restore files, regardless of
> origin and destination filename encoding.
>
> I like Bacula because, among other things, it can take the punishment
> and chug along, without me even noticing that there was supposed to be a
> problem (a recent example: backup/restore files with a negative mtime ...)
>

Thanks.  Thanks also for using Bacula :-)

Best regards,

Kern



Re: Catastrophic changes to PostgreSQL 8.4

From
Sam Mason
Date:
On Thu, Dec 03, 2009 at 08:33:38AM +0100, Kern Sibbald wrote:
> Bacula gets the "raw" filename from the OS and stores it on the Volume
> then puts it in the database.  We treat the filename as if it is UTF-8
> for display purposes, but in all other cases, what we want is for the
> filename to go into the database and come back out unchanged.

How about also storing the encoding of the path/filename as well?  This
would allow the restore to do the right thing for display purposes and
also when going to a system that uses a different encoding.  Obviously
you wouldn't know this for Unix derivatives, but for most other systems
this would seem to help.

> On MySQL we use BLOBS.  On PostgreSQL, we TEXT and set the encoding to
> SQL_ASCII so that PostgreSQL will not attempt to do any translation.
> This works well, and I hope that PostgreSQL will continue to support
> letting Bacula insert text characters in the database with no
> character encoding checks in the future.

As others have said; BYTEA is probably the best datatype for you to
use.  The encoding of BYTEA literals is a bit of a fiddle and may need
some changes, but it's going to be much more faithful to your needs of
treating the filename as an opaque blob of data.

--
  Sam  http://samason.me.uk/

Re: Catastrophic changes to PostgreSQL 8.4

From
Bruce Momjian
Date:
Kern Sibbald wrote:
> Hello,
>
> Thanks for all the answers; I am a bit overwhelmed by the number, so I am
> going to try to answer everyone in one email.

We aim to please, and overwhelm.  :-)

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Catastrophic changes to PostgreSQL 8.4

From
Greg Stark
Date:
On Thu, Dec 3, 2009 at 8:33 AM, Craig Ringer
<craig@postnewspapers.com.au> wrote:
> While true in theory, in practice it's pretty unusual to have filenames
> encoded with an encoding other than the system LC_CTYPE on a modern
> UNIX/Linux/BSD machine.
>
> I'd _very_ much prefer to have Bacula back my machines up by respecting
> LC_CTYPE and applying appropriate conversions at the fd if LC_CTYPE on
> the fd's host is not utf-8 and the database is.

a) it doesn't really matter how uncommon it is, backup software is
like databases, it's supposed to always work, not just usually work.

b) LC_CTYPE is an environment variable, it can be different for different users.

c) backup software that tries to fix up the data it's backing up to
what it thinks it should look like is bogus. If I can't trust my
backup software to restore exactly the same data with exactly the same
filenames then it's useless. The last thing I want to be doing when
recovering from a disaster is trying to debug some difference of
opinion between some third party commercial software and
postgres/bacula about unicode encodings.

> (3) As (2), but add a `bytea' column to `path' and `filename' tables
>    that's null if the fd was able to convert the filename from the
>    system LC_CTYPE to utf-8. In the rare cases it couldn't (due to
>    reasons like users running with different LC_CTYPE, nfs volumes
>    exported to systems with different LC_CTYPE, tarballs from
>    systems with different charsets, etc) the raw unconverted bytes
>    of the filename get stored in the bytea field, and a mangled
>    form of the name gets stored in the text field for user display
>    purposes only.

That's an interesting thought. I think it's not quite right -- you
want to always store the raw filename in the bytea and then also store
a text field with the visual representation. That way you can also
deal with broken encodings in some application specific way too,
perhaps by trying to guess a reasonable encoding.

An alternative would be to just store them in byteas and then handle
sorting and displaying by calling the conversion procedure on the fly.

--
greg

Re: Catastrophic changes to PostgreSQL 8.4

From
Adrian Klaver
Date:
On Wednesday 02 December 2009 11:33:38 pm Kern Sibbald wrote:

> > ( BTW, one way to handle incorrectly encoded filenames and paths might
> > be to have a `bytea' field that's generally null to store such mangled
> > file names. Personally though I'd favour just rejecting them. )
> >
> > > We set SQL_ASCII by default when creating the database via the command
> > > recommended in recent versions of PostgreSQL (e.g. 8.1), with:
> > >
> > > CREATE DATABASE bacula ENCODING 'SQL_ASCII';
> > >
> > > However, with PostgreSQL 8.4, the above command is ignored because the
> > > default table copied is not template0.
> >
> > It's a pity that attempting to specify an encoding other than the safe
> > one when using a non-template0 database doesn't cause the CREATE
> > DATABASE command to fail with an error.
>
> I didn't actually run it myself, so it is possible that it produced an
> error message, but it did apparently create the database but with UTF-8
> encoding. Most of these things are done in script files, so certain
> non-fatal errors may be overlooked.
>
> As far as I can tell, it took the above encoding command, and perhaps
> printed an error message but went ahead and created the database with an
> encoding that was not correct.  If that is indeed the case, then it is in
> my opinion, a bad design policy.  I would much prefer that either Postgres
> accept the command or that it not create the database.  This way, either
> the database would work as the user expects or there would be no database,
> and the problem would be resolved before it creates databases that cannot
> be read.

It does not CREATE the database. If the users are seeing that happen, then as
others have suggested it is a bug. The other option is that they are
un-commenting the #ENCODING="ENCODING 'UTF8'" line in the
create_postgresql_database.in script to get it to run. The interesting part in
that script is the Note:

# KES: Note: the CREATE DATABASE, probably should be
#   CREATE DATABASE ${db_name} $ENCODING TEMPLATE template0

According to the git repository this showed up in July of this year;


http://bacula.git.sourceforge.net/git/gitweb.cgi?p=bacula/bacula;a=blob;f=bacula/src/cats/create_postgresql_database.in;hb=6e024d0fe47ea0d9e6d3fbec52c4165caa44967f

>
> In any case we have corrected the command to include the TEMPLATE, but this
> won't help people with older Bacula's.

Could they not just get the corrected version of create_postgresql_database.in.
It would run on the old versions as well.

>
> The other point I wanted to emphasize is that the documentation implied
> that future versions of Postgres may eliminate the feature of having
> SQL_ASCII (i.e. the ability to input arbritrary binary strings).  As I
> said, that would be a pity -- I suppose we could switch to using LOs or
> whatever they are called in Postgres, but that would be rather
> inconvenient.

Per Tom's previous post:
"You misread it.  We are not talking about eliminating SQL_ASCII --- as
you say, that's useful.  What is deprecated is trying to use SQL_ASCII
with a non-C locale, which is dangerous, and always has been.  If you've
been putting non-UTF8 data into a database that could be running under a
UTF8-dependent locale, I'm surprised you haven't noticed problems already.'

>
> Thanks for all the responses,
>
> Best regards,
>
> Kern



--
Adrian Klaver
aklaver@comcast.net

Re: Catastrophic changes to PostgreSQL 8.4

From
Tom Lane
Date:
Sam Mason <sam@samason.me.uk> writes:
> As others have said; BYTEA is probably the best datatype for you to
> use.  The encoding of BYTEA literals is a bit of a fiddle and may need
> some changes, but it's going to be much more faithful to your needs of
> treating the filename as an opaque blob of data.

bytea might be theoretically the best choice, but the fact remains that
99% of the entries will be text that's readable in the user's encoding
(whatever that is).  bytea will just be a serious PITA because of its
escaping issues.  Also, the fact that 8.5 may change to hex display by
default will make bytea even more of a PITA for mostly-text data.
So I think Bacula's choice to use sql_ascii with text columns is
entirely defensible.

What concerns me is the claim that PG made a database with some
arbitrary parameters after having rejected a now-considered-invalid
command.   I frankly do not believe that, but if it did happen it's
a *serious* bug that requires investigation.

            regards, tom lane

Re: [Bacula-users] Catastrophic changes to PostgreSQL 8.4

From
Craig Ringer
Date:
Frank Sweetser wrote:

> Unless, of course, you're at a good sized school with lots of
> international students, and have fileservers holding filenames created
> on desktops running in Chinese, Turkish, Russian, and other locales.

What I struggle with here is why they're not using ru_RU.UTF-8,
cn_CN.UTF-8, etc as their locales. Why mix charsets?

I don't think that these people should be forced to use a utf-8 database
and encoding conversion if they want to do things like mix-and-match
charsets for file name chaos on their machines, though. I'd just like to
be able to back up systems that _do_ have consistent charsets in ways
that permit me to later reliably search for files by name, restore to
any host, etc.

Perhaps I'm strange in thinking that all this mix-and-match encodings
stuff is bizarre and backward. The Mac OS X and Windows folks seem to
agree, though. Let the file system store unicode data, and translate at
the file system or libc layer for applications that insist on using
other encodings.

I do take Greg Stark's point (a) though. As *nix systems stand,
solutions will only ever be mostly-works, not always-works, which I
agree isn't good enough. Since there's no sane agreement about encodings
on *nix systems and everything is just byte strings that different apps
can interpret in different ways under different environmental
conditions, we may as well throw up our hands in disgust and give up
trying to do anything sensible. The alternative is saying that files the
file system considers legal can't be backed up because of file naming,
which I do agree isn't ok.

The system shouldn't permit those files to exist, either, but I suspect
we'll still have borked encoding-agnostic wackiness as long as we have
*nix systems at all since nobody will ever agree on anything for long
enough to change it.

Sigh. I think this is about the only time I've ever wished I was using
Windows (or Mac OS X).

Also: Greg, your point (c) goes two ways. If I can't trust my backup
software to restore my filenames from one host exactly correctly to
another host that may have configuration differences not reflected in
the backup metadata, a different OS revision, etc, then what good is it
for disaster recovery? How do I even know what those byte strings
*mean*? Bacula doesn't even record the default system encoding with
backup jobs so there's no way for even the end user to try to fix up the
file names for a different encoding. You're faced with some byte strings
in wtf-is-this-anyway encoding and guesswork. Even recording lc_ctype in
the backup job metadata and offering the _option_ to convert encoding on
restore would be a big step, (though it wouldn't fix the breakage with
searches by filename not matching due to encoding mismatches).

Personally, I'm just going to stick to a utf-8 only policy for all my
hosts, working around the limitation that way. It's worked ok thus far,
though I don't much like the way that different normalizations of
unicode won't match equal under SQL_ASCII so I can't reliably search for
file names.

--
Craig Ringer

Re: [Bacula-users] Catastrophic changes to PostgreSQL 8.4

From
Ivan Sergio Borgonovo
Date:
On Thu, 3 Dec 2009 12:22:50 +0100 (CET)
"Kern Sibbald" <kern@sibbald.com> wrote:

> Yes, that is my experience too.  I understand Craig's comments,
> but I would much prefer that Bacula just backup and restore and
> leave the checking of filename consistencies to other programs.
> At least for the moment, that seems to work quite well.  Obviously
> if users mix character sets, sometime display of filenames in
> Bacula will be wierd, but nevertheless Bacula will backup and
> restore them so that what was on the system before the backup is
> what is restored.

I expect a backup software has a predictable, reversible behaviour
and warn me if I'm shooting myself in the foot.

It should be the responsibility of the admin to restore files in a
proper place knowing that locales may be a problem.

I think Bacula is taking the right approach.

Still I'd surely appreciate as a feature a "tool" that will help me
to restore files in a system with a different locale than the
original one or warn me if the locale is different or it can't be
sure it is the same.
That's exactly what Postgresql is doing: at least warning you.
Even Postgresql is taking the right approach.

An additional "guessed original locale" field and a tool/option to
convert/restore with selected locale could be an interesting feature.

What is Bacula going to do with xattr on different systems?

Postgresql seems to offer a good choice of tools to convert between
encodings and deal with bytea.
Formally I'd prefer bytea but in real use it may just be an
additional pain and other DB may not offer the same tools for
encoding/bytea conversions.

Is it possible to search for a file in a backup set?
What is it going to happen if I'm searching from a system that has a
different locale from the one the backup was made on?
Can I use regexp? Can accents be ignored during searches?


--
Ivan Sergio Borgonovo
http://www.webthatworks.it


Re: Catastrophic changes to PostgreSQL 8.4

From
Sam Mason
Date:
On Thu, Dec 03, 2009 at 10:46:54AM -0500, Tom Lane wrote:
> Sam Mason <sam@samason.me.uk> writes:
> > As others have said; BYTEA is probably the best datatype for you to
> > use.  The encoding of BYTEA literals is a bit of a fiddle and may need
> > some changes, but it's going to be much more faithful to your needs of
> > treating the filename as an opaque blob of data.
>
> bytea might be theoretically the best choice, but the fact remains that
> 99% of the entries will be text that's readable in the user's encoding
> (whatever that is).

I agree it'll be fine most of the time and the more important thing is
normally the data rather than the filename.  Still, for non-english
speaking people I'd guess there are many more encodings floating around
than I'd ever expect to see on a daily basis.  Us English/US speakers
really do have a very easy life.

There's also the issue that the user's encoding doesn't necessarily
match the system's encoding.  Thus within an account everything may be
easy, but when a system daemon comes in and looks at things it's going
to be somewhat messy.

No hard numbers either way, I just know I see a very biased sample of
systems and would not like to make generalizations.

> What concerns me is the claim that PG made a database with some
> arbitrary parameters after having rejected a now-considered-invalid
> command.   I frankly do not believe that, but if it did happen it's
> a *serious* bug that requires investigation.

Yup, be interesting to hear more details from the OP about this.

--
  Sam  http://samason.me.uk/

Re: [Bacula-users] Catastrophic changes to PostgreSQL 8.4

From
Frank Sweetser
Date:
On 12/03/2009 10:54 AM, Craig Ringer wrote:
> Frank Sweetser wrote:
>
>> Unless, of course, you're at a good sized school with lots of
>> international students, and have fileservers holding filenames created
>> on desktops running in Chinese, Turkish, Russian, and other locales.
>
> What I struggle with here is why they're not using ru_RU.UTF-8,
> cn_CN.UTF-8, etc as their locales. Why mix charsets?

The problem isn't so much what they're using on their unmanaged desktops.  The
problem is that the server, which is the one getting backed up, holds an
aggregation of files created by an unknown collection of applications running
on a mish-mash of operating systems (every large edu has its horror story of
the 15+ year old, unpatched, mission critical machine that no one dares touch)
with wildly varying charset configurations, no doubt including horribly broken
and pre-UTF ones.

The end result is a fileset full of filenames created on a hacked Chinese copy
of XP, a Russian copy of winME, romanian RedHat 4.0, and Mac OS 8.

This kind of junk is, sadly, not uncommon in academic environments, where IT
is often required to support stuff that they don't get to manage.

--
Frank Sweetser fs at wpi.edu  |  For every problem, there is a solution that
WPI Senior Network Engineer   |  is simple, elegant, and wrong. - HL Mencken
     GPG fingerprint = 6174 1257 129E 0D21 D8D4  E8A3 8E39 29E3 E2E8 8CEC

Re: [Bacula-users] Catastrophic changes to PostgreSQL 8.4

From
Eitan Talmi
Date:
Hi Avi

Please have a look at this link, this is how to install Bacula with MYSQL database with Hebrew support

Eitan


On Thu, Dec 3, 2009 at 12:35 PM, Avi Rozen <avi.rozen@gmail.com> wrote:
Craig Ringer wrote:
> Kern Sibbald wrote:
>
>> Hello,
>>
>> Thanks for all the answers; I am a bit overwhelmed by the number, so I am
>> going to try to answer everyone in one email.
>>
>> The first thing to understand is that it is *impossible* to know what the
>> encoding is on the client machine (FD -- or File daemon).  On say a
>> Unix/Linux system, the user could create filenames with non-UTF-8 then switch
>> to UTF-8, or restore files that were tarred on Windows or on Mac, or simply
>> copy a Mac directory.  Finally, using system calls to create a file, you can
>> put *any* character into a filename.
>>
>
> While true in theory, in practice it's pretty unusual to have filenames
> encoded with an encoding other than the system LC_CTYPE on a modern
> UNIX/Linux/BSD machine.
>

In my case garbage filenames are all too common. It's a the sad
*reality*, when you're mixing languages (Hebrew and English in my case)
and operating systems. Garbage filenames are everywhere: directories and
files shared between different operating systems and file systems, mail
attachments, mp3 file names based on garbage id3 tags, files in zip
archives (which seem to not handle filename encoding at all), etc.

When I first tried Bacula (version 1.38), I expected to have trouble
with filenames, since this is what I'm used to. I was rather pleased to
find out that it could both backup and restore files, regardless of
origin and destination filename encoding.

I like Bacula because, among other things, it can take the punishment
and chug along, without me even noticing that there was supposed to be a
problem (a recent example: backup/restore files with a negative mtime ...)

My 2c
Avi

------------------------------------------------------------------------------
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing.
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Catastrophic changes to PostgreSQL 8.4

From
Frank Sweetser
Date:
On 12/3/2009 3:33 AM, Craig Ringer wrote:
> Kern Sibbald wrote:
>> Hello,
>>
>> Thanks for all the answers; I am a bit overwhelmed by the number, so I am
>> going to try to answer everyone in one email.
>>
>> The first thing to understand is that it is *impossible* to know what the
>> encoding is on the client machine (FD -- or File daemon).  On say a

Or, even worse, which encoding the user or application was thinking of when it
wrote a particular out.  There's no guarantee that any two files on a system
were intended to be looked at with the same encoding.

>> Unix/Linux system, the user could create filenames with non-UTF-8 then switch
>> to UTF-8, or restore files that were tarred on Windows or on Mac, or simply
>> copy a Mac directory.  Finally, using system calls to create a file, you can
>> put *any* character into a filename.
>
> While true in theory, in practice it's pretty unusual to have filenames
> encoded with an encoding other than the system LC_CTYPE on a modern
> UNIX/Linux/BSD machine.

Unless, of course, you're at a good sized school with lots of international
students, and have fileservers holding filenames created on desktops running
in Chinese, Turkish, Russian, and other locales.

In the end, a filename is (under linux, at least) just a string of arbitrary
bytes containing anything except / and NULL.  If bacula tries to get too
clever, and munges or misinterprets those bytes strings - or, worse yet, if
the database does it behind your back - then stuff _will_ end up breaking.

(A few years back, someone heavily involved in linux kernel filesystem work
was talking about this exact issue, and made the remark that many doing
internationalization work secretly feel it would be easier to just teach
everyone english.  Impossible as this may be, I have since come to understand
what they were talking about...)

--
Frank Sweetser fs at wpi.edu  |  For every problem, there is a solution that
WPI Senior Network Engineer   |  is simple, elegant, and wrong. - HL Mencken
      GPG fingerprint = 6174 1257 129E 0D21 D8D4  E8A3 8E39 29E3 E2E8 8CEC

Re: [Bacula-users] Catastrophic changes to PostgreSQL 8.4

From
Alvaro Herrera
Date:
Craig Ringer wrote:
> Frank Sweetser wrote:
>
> > Unless, of course, you're at a good sized school with lots of
> > international students, and have fileservers holding filenames created
> > on desktops running in Chinese, Turkish, Russian, and other locales.
>
> What I struggle with here is why they're not using ru_RU.UTF-8,
> cn_CN.UTF-8, etc as their locales. Why mix charsets?

On my own desktop computer, I switched from Latin1 to UTF8 some two
years ago, and I still have a mixture of file name encodings.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Catastrophic changes to PostgreSQL 8.4

From
Tom Lane
Date:
Kern Sibbald <kern@sibbald.com> writes:
> Concerning the future: I am still a bit concerning about the mention in the
> document of possible future changes concerning SQL_ASCII and LC_CTYPE not C
> or POSIX ("Using this combination of settings is deprecated and may someday
> be forbidden altogether. "), so if you have any suggestions for CREATE
> DATABASE that we can use now that will ensure that for future versions of
> Bacula + PostgreSQL that we will end up with our bacula database created with
> SQL_ASCII, I would much appreciate it.

What you need to do is add
    LC_COLLATE = 'C' LC_CTYPE = 'C'
to the command.  This is already important in 8.4 to ensure the command
won't fail if the installation default locale is not C.

The difficulty of course is that this is only for 8.4 and up, older
versions won't take it.  I don't know the context well enough to suggest
how you might want to deal with version-sensitive commands (probably you
have some plan for that already).

            regards, tom lane

Re: Catastrophic changes to PostgreSQL 8.4

From
Kern Sibbald
Date:
On Thursday 03 December 2009 16:42:58 Adrian Klaver wrote:
> On Wednesday 02 December 2009 11:33:38 pm Kern Sibbald wrote:
> > > ( BTW, one way to handle incorrectly encoded filenames and paths might
> > > be to have a `bytea' field that's generally null to store such mangled
> > > file names. Personally though I'd favour just rejecting them. )
> > >
> > > > We set SQL_ASCII by default when creating the database via the
> > > > command recommended in recent versions of PostgreSQL (e.g. 8.1),
> > > > with:
> > > >
> > > > CREATE DATABASE bacula ENCODING 'SQL_ASCII';
> > > >
> > > > However, with PostgreSQL 8.4, the above command is ignored because
> > > > the default table copied is not template0.
> > >
> > > It's a pity that attempting to specify an encoding other than the safe
> > > one when using a non-template0 database doesn't cause the CREATE
> > > DATABASE command to fail with an error.
> >
> > I didn't actually run it myself, so it is possible that it produced an
> > error message, but it did apparently create the database but with UTF-8
> > encoding. Most of these things are done in script files, so certain
> > non-fatal errors may be overlooked.
> >
> > As far as I can tell, it took the above encoding command, and perhaps
> > printed an error message but went ahead and created the database with an
> > encoding that was not correct.  If that is indeed the case, then it is in
> > my opinion, a bad design policy.  I would much prefer that either
> > Postgres accept the command or that it not create the database.  This
> > way, either the database would work as the user expects or there would be
> > no database, and the problem would be resolved before it creates
> > databases that cannot be read.
>
> It does not CREATE the database. If the users are seeing that happen, then
> as others have suggested it is a bug. The other option is that they are
> un-commenting the #ENCODING="ENCODING 'UTF8'" line in the
> create_postgresql_database.in script to get it to run. The interesting part
> in that script is the Note:
>
> # KES: Note: the CREATE DATABASE, probably should be
> #   CREATE DATABASE ${db_name} $ENCODING TEMPLATE template0
>
> According to the git repository this showed up in July of this year;

I had forgotten about that, but it does show that I do go and read the
documentation from time to time :-)

>
> http://bacula.git.sourceforge.net/git/gitweb.cgi?p=bacula/bacula;a=blob;f=b
>acula/src/cats/create_postgresql_database.in;hb=6e024d0fe47ea0d9e6d3fbec52c4
>165caa44967f
>
> > In any case we have corrected the command to include the TEMPLATE, but
> > this won't help people with older Bacula's.
>
> Could they not just get the corrected version of
> create_postgresql_database.in. It would run on the old versions as well.

Have you ever tried to push a fix for postgres downstream?  It usually is not
easy and takes a long time.

Anyway, now that we understand the problem, we will be able to warn users,
which is what happened when we copied the bacula-users email list, and those
who miss we can help rather quickly.

>
> > The other point I wanted to emphasize is that the documentation implied
> > that future versions of Postgres may eliminate the feature of having
> > SQL_ASCII (i.e. the ability to input arbritrary binary strings).  As I
> > said, that would be a pity -- I suppose we could switch to using LOs or
> > whatever they are called in Postgres, but that would be rather
> > inconvenient.
>
> Per Tom's previous post:
> "You misread it.  We are not talking about eliminating SQL_ASCII --- as
> you say, that's useful.  What is deprecated is trying to use SQL_ASCII
> with a non-C locale, which is dangerous, and always has been.  If you've
> been putting non-UTF8 data into a database that could be running under a
> UTF8-dependent locale, I'm surprised you haven't noticed problems already.'
>

Duh, yes, I misunderstood it. Another comment by Tom Lane clarified the
situation for me. I've now ensured that for future Bacula versions the
LC_COLLATE and LC_CYTPE are set to 'C' when we set SQL_ASCII.

Thanks for the help and polite comments from everyone :-)

Kern

Re: Catastrophic changes to PostgreSQL 8.4

From
Kern Sibbald
Date:
Hello,

Thanks for your response.

On Thursday 03 December 2009 16:46:54 Tom Lane wrote:
> Sam Mason <sam@samason.me.uk> writes:
> > As others have said; BYTEA is probably the best datatype for you to
> > use.  The encoding of BYTEA literals is a bit of a fiddle and may need
> > some changes, but it's going to be much more faithful to your needs of
> > treating the filename as an opaque blob of data.
>
> bytea might be theoretically the best choice, but the fact remains that
> 99% of the entries will be text that's readable in the user's encoding
> (whatever that is).  bytea will just be a serious PITA because of its
> escaping issues.  Also, the fact that 8.5 may change to hex display by
> default will make bytea even more of a PITA for mostly-text data.
> So I think Bacula's choice to use sql_ascii with text columns is
> entirely defensible.

Thank you for confirming this.  I do appreciate the SQL_ASCII feature that
PosgreSQL has.  It is far preferable for our users than the MySQL BLOB
solution, because as you say, most of the filenames do go in as valid UTF-8,
and those that don't are backed up and restored correctly, but may have a few
problems during display of their names, which is rather rare anyway.

>
> What concerns me is the claim that PG made a database with some
> arbitrary parameters after having rejected a now-considered-invalid
> command.   I frankly do not believe that, but if it did happen it's
> a *serious* bug that requires investigation.

Concerning the problem with specifying ENCODING 'SQL_ASCII' but not TEMPLATE
template0:  it appears that Postgres *did* reject the command and did not
create a database with SQL_UTF8 as I had supposed, so as you say, it is not a
PosgreSQL bug.  The best I can tell is that the command failed, so the user
manually submitted it, probably without the ENCODING and so it ended up with
the wrong type.

I just fixed our Bacula create_postgresql_database script so that it clearly
detects any problem creating the database (ON_ERROR_STOP), which is was not
previously the case.

Concerning the future: I am still a bit concerning about the mention in the
document of possible future changes concerning SQL_ASCII and LC_CTYPE not C
or POSIX ("Using this combination of settings is deprecated and may someday
be forbidden altogether. "), so if you have any suggestions for CREATE
DATABASE that we can use now that will ensure that for future versions of
Bacula + PostgreSQL that we will end up with our bacula database created with
SQL_ASCII, I would much appreciate it.

Best regards,

Kern