Thread: Catastrophic changes to PostgreSQL 8.4
Hello, I am the project manager of Bacula. One of the database backends that Bacula uses is PostgreSQL. This email is to notify you that a change you made to setting database character codes has created havoc with certain unfortunate Bacula users. Bacula sets the database encoding to SQL_ASCII, because although Bacula "supports" UTF-8 character encoding, it cannot enforce it. Certain operating systems such as Unix, Linux and MacOS can have filenames that are not in UTF-8 format. Since Bacula stores filenames in PostgreSQL tables, we use SQL_ASCII. We set SQL_ASCII by default when creating the database via the command recommended in recent versions of PostgreSQL (e.g. 8.1), with: CREATE DATABASE bacula ENCODING 'SQL_ASCII'; However, with PostgreSQL 8.4, the above command is ignored because the default table copied is not template0. This means that some Bacula users who have created PostgreSQL databases with version 8.4, typically find them created with SQL_UTF8 format, which results in serious errors when doing backups for certain machines. Apparently, for PostgreSQL 8.4, one must specify: CREATE DATABASE bacula ENCODING 'SQL_ASCII' TEMPLATE=template0; Rather than making this incompatible modification, it would have been advisable to make the default equivalent to the above. Of course I imagine that there are lots of reasons why that could not be programmed as such. I also notice the following comment in your 8.4 documentation: "PostgreSQL will allow superusers to create databases with SQL_ASCII encoding even when LC_CTYPE is not C or POSIX. As noted above, SQL_ASCII does not enforce that the data stored in the database has any particular encoding, and so this choice poses risks of locale-dependent misbehavior. Using this combination of settings is deprecated and may someday be forbidden altogether." If you do eliminate SQL_ASCII at some point, it would be a real pity and create a lot of inconvience for your users. Regards, Kern
Kern Sibbald <kern@sibbald.com> writes: > Bacula sets the database encoding to SQL_ASCII, because although > Bacula "supports" UTF-8 character encoding, it cannot enforce it. Okay ... > CREATE DATABASE bacula ENCODING 'SQL_ASCII'; > > However, with PostgreSQL 8.4, the above command is ignored because the default > table copied is not template0. ... so copy template0. I think you probably want to force C locale (lc_ctype/lc_collate) as well, which was not possible for an individual database before 8.4. So on the whole I think 8.4 ought to be more amenable to what you want than prior versions were. No, it is not exactly the same as prior versions were, but that's the price of progress. > I also notice the following comment in your 8.4 documentation: > "PostgreSQL will allow superusers to create databases with SQL_ASCII > encoding even when LC_CTYPE is not C or POSIX. As noted above, SQL_ASCII does > not enforce that the data stored in the database has any particular encoding, > and so this choice poses risks of locale-dependent misbehavior. Using this > combination of settings is deprecated and may someday be forbidden > altogether." > If you do eliminate SQL_ASCII at some point, it would be a real pity and > create a lot of inconvience for your users. You misread it. We are not talking about eliminating SQL_ASCII --- as you say, that's useful. What is deprecated is trying to use SQL_ASCII with a non-C locale, which is dangerous, and always has been. If you've been putting non-UTF8 data into a database that could be running under a UTF8-dependent locale, I'm surprised you haven't noticed problems already. regards, tom lane
On Wednesday 02 December 2009 5:18:52 am Kern Sibbald wrote: > Hello, > > I am the project manager of Bacula. One of the database backends that > Bacula uses is PostgreSQL. > > This email is to notify you that a change you made to setting database > character codes has created havoc with certain unfortunate Bacula users. > > Bacula sets the database encoding to SQL_ASCII, because although > Bacula "supports" UTF-8 character encoding, it cannot enforce it. Certain > operating systems such as Unix, Linux and MacOS can have filenames that are > not in UTF-8 format. Since Bacula stores filenames in PostgreSQL tables, > we use SQL_ASCII. > > We set SQL_ASCII by default when creating the database via the command > recommended in recent versions of PostgreSQL (e.g. 8.1), with: > > CREATE DATABASE bacula ENCODING 'SQL_ASCII'; > > However, with PostgreSQL 8.4, the above command is ignored because the > default table copied is not template0. This means that some Bacula users > who have created PostgreSQL databases with version 8.4, typically find them > created with SQL_UTF8 format, which results in serious errors when doing > backups for certain machines. When I tried the above Postgres did not ignore the command, instead it gave me the following error and did not create the database: CREATE DATABASE bacula ENCODING 'SQL_ASCII'; ERROR: new encoding (SQL_ASCII) is incompatible with the encoding of the template database (UTF8) HINT: Use the same encoding as in the template database, or use template0 as template. > > Regards, > > Kern -- Adrian Klaver aklaver@comcast.net
On 2/12/2009 9:18 PM, Kern Sibbald wrote: > Hello, > > I am the project manager of Bacula. One of the database backends that Bacula > uses is PostgreSQL. As a Bacula user (though I'm not on the Bacula lists), first - thanks for all your work. It's practically eliminated all human intervention from something that used to be a major pain. Configuring it to handle the different backup frequencies, retention periods and diff/inc/full needs of the different data sets was a nightmare, but once set up it's been bliss. The 3.x `Accurate' mode is particularly nice. > Bacula sets the database encoding to SQL_ASCII, because although > Bacula "supports" UTF-8 character encoding, it cannot enforce it. Certain > operating systems such as Unix, Linux and MacOS can have filenames that are > not in UTF-8 format. Since Bacula stores filenames in PostgreSQL tables, we > use SQL_ASCII. I noticed that while doing some work on the Bacula database a while ago. I was puzzled at the time about why Bacula does not translate file names from the source system's encoding to utf-8 for storage in the database, so all file names are known to be sane and are in a known encoding. Because Bacula does not store the encoding or seem to transcode the file name to a single known encoding, it does not seem to be possible to retrieve files by name if the bacula console is run on a machine with a different text encoding to the machine the files came from. After all, café in utf-8 is a different byte sequence to café in iso-9660-1, and won't match in equality tests under SQL_ASCII. Additionally, I'm worried that restoring to a different machine with a different encoding may fail, and if it doesn't will result in hopelessly mangled file names. This wouldn't be fun to deal with during disaster recovery. (I don't yet know if there are provisions within Bacula its self to deal with this and need to do some testing). Anyway, it'd be nice if Bacula would convert file names to utf-8 at the file daemon, using the encoding of the client, for storage in a utf-8 database. Mac OS X (HFS Plus) and Windows (NTFS) systems store file names as Unicode (UTF-16 IIRC). Unix systems increasingly use utf-8, but may use other encodings. If a unix system does use another encoding, this may be determined from the locale in the environment and used to convert file names to utf-8. Windows systems using FAT32 and Mac OS 9 machines on plain old HFS will have file names in the locale's encoding, like UNIX systems, and are fairly easily handled. About the only issue I see is that systems may have file names that are not valid text strings in the current locale, usually due to buggy software butchering text encodings. I guess a *nix system _might_ have different users running with different locales and encodings, too. The latter case doesn't seem easy to handle cleanly as file names on unix systems don't have any indication of what encoding they're in stored with them. I'm not really sure these cases actually show up in practice, though. Personally, I'd like to see Bacula capable of using a utf-8 database, with proper encoding conversion at the fd for non-utf-8 encoded client systems. It'd really simplify managing backups for systems with a variety of different encodings. ( BTW, one way to handle incorrectly encoded filenames and paths might be to have a `bytea' field that's generally null to store such mangled file names. Personally though I'd favour just rejecting them. ) > We set SQL_ASCII by default when creating the database via the command > recommended in recent versions of PostgreSQL (e.g. 8.1), with: > > CREATE DATABASE bacula ENCODING 'SQL_ASCII'; > > However, with PostgreSQL 8.4, the above command is ignored because the default > table copied is not template0. It's a pity that attempting to specify an encoding other than the safe one when using a non-template0 database doesn't cause the CREATE DATABASE command to fail with an error. -- Craig Ringer
Craig Ringer <craig@postnewspapers.com.au> writes: > It's a pity that attempting to specify an encoding other than the safe > one when using a non-template0 database doesn't cause the CREATE > DATABASE command to fail with an error. Huh? regression=# create database foo lc_ctype = 'en_US.utf8' encoding = 'latin1'; ERROR: encoding LATIN1 does not match locale en_US.utf8 DETAIL: The chosen LC_CTYPE setting requires encoding UTF8. As best I can tell, the OP is complaining exactly because it did fail, in contrast to pre-8.4 versions that let him create unsafe combinations. regards, tom lane
On 3/12/2009 11:03 AM, Tom Lane wrote: > Craig Ringer<craig@postnewspapers.com.au> writes: >> It's a pity that attempting to specify an encoding other than the safe >> one when using a non-template0 database doesn't cause the CREATE >> DATABASE command to fail with an error. > > Huh? > > regression=# create database foo lc_ctype = 'en_US.utf8' encoding = 'latin1'; > ERROR: encoding LATIN1 does not match locale en_US.utf8 > DETAIL: The chosen LC_CTYPE setting requires encoding UTF8. > > As best I can tell, the OP is complaining exactly because it did fail, > in contrast to pre-8.4 versions that let him create unsafe combinations. Teach me not to check it myself before saying anything. I read the OP's post as saying that they were having issues with creation _succeeding_ but disregarding the encoding specification: > CREATE DATABASE bacula ENCODING 'SQL_ASCII'; > > However, with PostgreSQL 8.4, the above command is ignored because the default > table copied is not template0. This means that some Bacula users who have > created PostgreSQL databases with version 8.4, typically find them created > with SQL_UTF8 format, which results in serious errors when doing backups for > certain machines. ... which would indeed be nasty. That's what I was commenting on. Failing to create an unsafe database, on the other hand, is just what I want to happen, and it's good to see that's what happens on my 8.4 systems. I'm confused by the above, though. Kern: Is the issue that the CREATE DATABASE is failing, or that it's succeeding incorrectly? If it's succeeding incorrectly, it'd really help to know more about the problem systems. The creation script shipped in 3.0.2 was: # # KES: Note: the CREATE DATABASE, probably should be # CREATE DATABASE ${db_name} $ENCODING TEMPLATE template0 # if $bindir/psql -f - -d template1 $* <<END-OF-DATA CREATE DATABASE ${db_name} $ENCODING; ALTER DATABASE ${db_name} SET datestyle TO 'ISO, YMD'; END-OF-DATA then echo "Creation of ${db_name} database succeeded." else echo "Creation of ${db_name} database failed." fi exit 0 ... which doesn't look like it'd do anything unexpected to template1 if the CREATE DATABASE failed. You might want to set ON_ERROR_STOP in psql anyway, though. -- Craig Ringer
On 3/12/2009 11:09 AM, Jerome Alet wrote: > On Thu, Dec 03, 2009 at 10:54:07AM +0800, Craig Ringer wrote: >> >> Anyway, it'd be nice if Bacula would convert file names to utf-8 at the >> file daemon, using the encoding of the client, for storage in a utf-8 >> database. > > +1 for me. > > this is the way to go. > > I understand people with an existing backup history won't be very happy > with this unless you provide them the appropriate tools or instructions > to convert their database's content, though. I just noticed, while reading src/cats/create_postgresql_database: # use SQL_ASCII to be able to put any filename into # the database even those created with unusual character sets ENCODING="ENCODING 'SQL_ASCII'" # use UTF8 if you are using standard Unix/Linux LANG specifications # that use UTF8 -- this is normally the default and *should* be # your standard. Bacula works correctly *only* with correct UTF8. # # Note, with this encoding, if you have any "weird" filenames on # your system (names generated from Win32 or Mac OS), you may # get Bacula batch insert failures. # #ENCODING="ENCODING 'UTF8'" ... so it's defaulting to SQL_ASCII, but actually supports utf-8 if your systems are all in a utf-8 locale. Assuming there's some way for the filed to find out the encoding of the director's database, it probably wouldn't be too tricky to convert non-matching file names to the director's encoding in the fd (when the director's encoding isn't SQL_ASCII, of course). This also makes me wonder how filenames on Mac OS X and Windows are handled. I didn't see any use of the unicode-form APIs or any UTF-16 to UTF-8 conversion in an admittedly _very_ quick glance at the filed/ sources. How does bacula handle file names on those platforms? Read them with the non-unicode APIs and hope they fit into the current non-unicode encoding? Or am I missing something? -- Craig Ringer
* Craig Ringer (craig@postnewspapers.com.au) wrote: > ... so it's defaulting to SQL_ASCII, but actually supports utf-8 if your > systems are all in a utf-8 locale. Assuming there's some way for the > filed to find out the encoding of the director's database, it probably > wouldn't be too tricky to convert non-matching file names to the > director's encoding in the fd (when the director's encoding isn't > SQL_ASCII, of course). I'm not sure which piece of bacula connects to PostgreSQL, but whatever it is, it could just send a 'set client_encoding' to the PG backend and all the conversion will be done by PG.. > This also makes me wonder how filenames on Mac OS X and Windows are > handled. I didn't see any use of the unicode-form APIs or any UTF-16 to > UTF-8 conversion in an admittedly _very_ quick glance at the filed/ > sources. How does bacula handle file names on those platforms? Read them > with the non-unicode APIs and hope they fit into the current non-unicode > encoding? Or am I missing something? Good question. Stephen
Attachment
On Thu, Dec 03, 2009 at 10:54:07AM +0800, Craig Ringer wrote: > > Anyway, it'd be nice if Bacula would convert file names to utf-8 at the > file daemon, using the encoding of the client, for storage in a utf-8 > database. +1 for me. this is the way to go. I understand people with an existing backup history won't be very happy with this unless you provide them the appropriate tools or instructions to convert their database's content, though. bye -- Jérôme Alet - <jerome.alet@univ-nc.nc> - Centre de Ressources Informatiques Université de la Nouvelle-Calédonie - BPR4 - 98851 NOUMEA CEDEX Tél : +687 266754 Fax : +687 254829
Hi! On Thu, Dec 3, 2009 at 10:39 PM, Jerome Alet <jerome.alet@univ-nc.nc> wrote: > On Thu, Dec 03, 2009 at 10:54:07AM +0800, Craig Ringer wrote: >> >> Anyway, it'd be nice if Bacula would convert file names to utf-8 at the >> file daemon, using the encoding of the client, for storage in a utf-8 >> database. > > +1 for me. +1 here: it, in fact, have problems when restoring to a server with different code page as the original one. > > this is the way to go. > > I understand people with an existing backup history won't be very happy > with this unless you provide them the appropriate tools or instructions > to convert their database's content, though. > > bye > > -- > Jérôme Alet - <jerome.alet@univ-nc.nc> - Centre de Ressources Informatiques > Université de la Nouvelle-Calédonie - BPR4 - 98851 NOUMEA CEDEX > Tél : +687 266754 Fax : +687 254829 > > ------------------------------------------------------------------------------ > Join us December 9, 2009 for the Red Hat Virtual Experience, > a free event focused on virtualization and cloud computing. > Attend in-depth sessions from your desk. Your couch. Anywhere. > http://p.sf.net/sfu/redhat-sfdev2dev > _______________________________________________ > Bacula-users mailing list > Bacula-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/bacula-users >
Stephen Frost wrote: > * Craig Ringer (craig@postnewspapers.com.au) wrote: >> ... so it's defaulting to SQL_ASCII, but actually supports utf-8 if your >> systems are all in a utf-8 locale. Assuming there's some way for the >> filed to find out the encoding of the director's database, it probably >> wouldn't be too tricky to convert non-matching file names to the >> director's encoding in the fd (when the director's encoding isn't >> SQL_ASCII, of course). > > I'm not sure which piece of bacula connects to PostgreSQL, but whatever > it is, it could just send a 'set client_encoding' to the PG backend and > all the conversion will be done by PG. The director is responsible for managing all the metadata, and it's the component that connects to Pg. If the fd sent the system charset along with the bundle of filenames etc that it sends to the director, then I don't see why the director couldn't `SET client_encoding' appropriately before inserting data from that fd, then `RESET client_encoding' once the batch insert was done. The only downside is that if even one file has invalidly encoded data, the whole batch insert fails and is rolled back. For that reason, I'd personally prefer that the fd handle conversion so that it can exclude such files (with a loud complaint in the error log) or munge the file name into something that _can_ be stored. Come to think of it, if the fd and database are both on a utf-8 encoding, the fd should *still* validate the utf-8 filenames it reads. There's no guarantee that just because the system thinks the filename should be utf-8, it's actually valid utf-8, and it'd be good to catch this at the fd rather than messing up the batch insert by the director, thus making it much safer than it presently is to use Bacula with a utf-8 database. -- Craig Ringer
Hello, Thanks for all the answers; I am a bit overwhelmed by the number, so I am going to try to answer everyone in one email. The first thing to understand is that it is *impossible* to know what the encoding is on the client machine (FD -- or File daemon). On say a Unix/Linux system, the user could create filenames with non-UTF-8 then switch to UTF-8, or restore files that were tarred on Windows or on Mac, or simply copy a Mac directory. Finally, using system calls to create a file, you can put *any* character into a filename. So, rather than trying to figure everything out (impossible, I think) and rather than failing to backup files, Bacula gets the "raw" filename from the OS and stores it on the Volume then puts it in the database. We treat the filename as if it is UTF-8 for display purposes, but in all other cases, what we want is for the filename to go into the database and come back out unchanged. On MySQL we use BLOBS. On PostgreSQL, we TEXT and set the encoding to SQL_ASCII so that PostgreSQL will not attempt to do any translation. This works well, and I hope that PostgreSQL will continue to support letting Bacula insert text characters in the database with no character encoding checks in the future. See more notes below ... On Thursday 03 December 2009 03:54:07 Craig Ringer wrote: > On 2/12/2009 9:18 PM, Kern Sibbald wrote: > > Hello, > > > > I am the project manager of Bacula. One of the database backends that > > Bacula uses is PostgreSQL. > > As a Bacula user (though I'm not on the Bacula lists), first - thanks > for all your work. It's practically eliminated all human intervention > from something that used to be a major pain. Configuring it to handle > the different backup frequencies, retention periods and diff/inc/full > needs of the different data sets was a nightmare, but once set up it's > been bliss. The 3.x `Accurate' mode is particularly nice. > > > Bacula sets the database encoding to SQL_ASCII, because although > > Bacula "supports" UTF-8 character encoding, it cannot enforce it. > > Certain operating systems such as Unix, Linux and MacOS can have > > filenames that are not in UTF-8 format. Since Bacula stores filenames in > > PostgreSQL tables, we use SQL_ASCII. > > I noticed that while doing some work on the Bacula database a while ago. > > I was puzzled at the time about why Bacula does not translate file names > from the source system's encoding to utf-8 for storage in the database, > so all file names are known to be sane and are in a known encoding. We don't and cannot know the encoding scheme on Unix/Linux systems (see above), so attempting to convert them to UTF-8 would just consume more CPU time and result in errors at some point. > > Because Bacula does not store the encoding or seem to transcode the file > name to a single known encoding, it does not seem to be possible to > retrieve files by name if the bacula console is run on a machine with a > different text encoding to the machine the files came from. After all, > café in utf-8 is a different byte sequence to café in iso-9660-1, and > won't match in equality tests under SQL_ASCII. If all the filenames go in in "binary" or "litteral" form, then any tests will work fine. The only test Bacula does is "equality". Bacula doesn't worry about sorting. Users may care, but for backup and restore the only test Bacula needs is equality, and as long as you are working with unchanged byte streams everything works on every system. The one place where we do convert filenames is on Windows. We convert UCS to UTF-8. > > Additionally, I'm worried that restoring to a different machine with a > different encoding may fail, and if it doesn't will result in hopelessly > mangled file names. This wouldn't be fun to deal with during disaster > recovery. (I don't yet know if there are provisions within Bacula its > self to deal with this and need to do some testing). Yes, if you restore on a different system with a different encoding, you will end up with the same binary string at the OS level, but when you see the filenames they may look different. > > Anyway, it'd be nice if Bacula would convert file names to utf-8 at the > file daemon, using the encoding of the client, for storage in a utf-8 > database. As I mention, this is not possible since Unix/Linux stores "binary" strings. They can be in any format. > > Mac OS X (HFS Plus) and Windows (NTFS) systems store file names as > Unicode (UTF-16 IIRC). Unix systems increasingly use utf-8, but may use > other encodings. If a unix system does use another encoding, this may be > determined from the locale in the environment and used to convert file > names to utf-8. As I mentioned above, on Linux/Unix systems, the user is free to change the encoding at will, and when he does so, existing filenames remain unchanged, so it is not possible to choose a particular encoding and backup and restore files without changing the filenames. Bacula backs them up and restores them using "binary" strings so we don't have problems of changing encoding. > > Windows systems using FAT32 and Mac OS 9 machines on plain old HFS will > have file names in the locale's encoding, like UNIX systems, and are > fairly easily handled. > > About the only issue I see is that systems may have file names that are > not valid text strings in the current locale, usually due to buggy > software butchering text encodings. I guess a *nix system _might_ have > different users running with different locales and encodings, too. The > latter case doesn't seem easy to handle cleanly as file names on unix > systems don't have any indication of what encoding they're in stored > with them. I'm not really sure these cases actually show up in practice, > though. > > Personally, I'd like to see Bacula capable of using a utf-8 database, > with proper encoding conversion at the fd for non-utf-8 encoded client > systems. It'd really simplify managing backups for systems with a > variety of different encodings. Maybe this will happen someday, but first we will need Linux/Unix systems that *force* the encoding to be in some particular standard. > > ( BTW, one way to handle incorrectly encoded filenames and paths might > be to have a `bytea' field that's generally null to store such mangled > file names. Personally though I'd favour just rejecting them. ) > > > We set SQL_ASCII by default when creating the database via the command > > recommended in recent versions of PostgreSQL (e.g. 8.1), with: > > > > CREATE DATABASE bacula ENCODING 'SQL_ASCII'; > > > > However, with PostgreSQL 8.4, the above command is ignored because the > > default table copied is not template0. > > It's a pity that attempting to specify an encoding other than the safe > one when using a non-template0 database doesn't cause the CREATE > DATABASE command to fail with an error. I didn't actually run it myself, so it is possible that it produced an error message, but it did apparently create the database but with UTF-8 encoding. Most of these things are done in script files, so certain non-fatal errors may be overlooked. As far as I can tell, it took the above encoding command, and perhaps printed an error message but went ahead and created the database with an encoding that was not correct. If that is indeed the case, then it is in my opinion, a bad design policy. I would much prefer that either Postgres accept the command or that it not create the database. This way, either the database would work as the user expects or there would be no database, and the problem would be resolved before it creates databases that cannot be read. In any case we have corrected the command to include the TEMPLATE, but this won't help people with older Bacula's. The other point I wanted to emphasize is that the documentation implied that future versions of Postgres may eliminate the feature of having SQL_ASCII (i.e. the ability to input arbritrary binary strings). As I said, that would be a pity -- I suppose we could switch to using LOs or whatever they are called in Postgres, but that would be rather inconvenient. Thanks for all the responses, Best regards, Kern
2009/12/3 Kern Sibbald <kern@sibbald.com>: > Hello, > > Thanks for all the answers; I am a bit overwhelmed by the number, so I am > going to try to answer everyone in one email. > > The first thing to understand is that it is *impossible* to know what the > encoding is on the client machine (FD -- or File daemon). On say a > Unix/Linux system, the user could create filenames with non-UTF-8 then switch > to UTF-8, or restore files that were tarred on Windows or on Mac, or simply > copy a Mac directory. Finally, using system calls to create a file, you can > put *any* character into a filename. > > So, rather than trying to figure everything out (impossible, I think) and > rather than failing to backup files, Bacula gets the "raw" filename from the > OS and stores it on the Volume then puts it in the database. We treat the > filename as if it is UTF-8 for display purposes, but in all other cases, what > we want is for the filename to go into the database and come back out > unchanged. > > On MySQL we use BLOBS. On PostgreSQL, we TEXT and set the encoding to > SQL_ASCII so that PostgreSQL will not attempt to do any translation. This > works well, and I hope that PostgreSQL will continue to support letting > Bacula insert text characters in the database with no character encoding > checks in the future. Hello just use bytea datatype instead text. http://www.postgresql.org/docs/8.4/interactive/datatype-binary.html it is exactly what you wont. Regards Pavel Stehule
Kern Sibbald wrote: > Hello, > > Thanks for all the answers; I am a bit overwhelmed by the number, so I am > going to try to answer everyone in one email. > > The first thing to understand is that it is *impossible* to know what the > encoding is on the client machine (FD -- or File daemon). On say a > Unix/Linux system, the user could create filenames with non-UTF-8 then switch > to UTF-8, or restore files that were tarred on Windows or on Mac, or simply > copy a Mac directory. Finally, using system calls to create a file, you can > put *any* character into a filename. While true in theory, in practice it's pretty unusual to have filenames encoded with an encoding other than the system LC_CTYPE on a modern UNIX/Linux/BSD machine. I'd _very_ much prefer to have Bacula back my machines up by respecting LC_CTYPE and applying appropriate conversions at the fd if LC_CTYPE on the fd's host is not utf-8 and the database is. If the database was SQL_ASCII, it could retain its existing behaviour. That way, people get to pick between the two rational behaviours: (1) Store file names as raw byte strings (SQL_ASCII). Guaranteed to work even on garbage file names that aren't valid in the current system encoding, but has issues with searches, filename matches, restoring to another system, etc. (2) Store file names as UTF-8, performing any required translation from the system charset at the file daemon. File names that are nonsense in the system encoding are either (a) rejected with an error in the fd logs, or (b) backed up with some form of name mangling. I *strongly* suspect most people will pick the second option. There's also a third possibility: (3) As (2), but add a `bytea' column to `path' and `filename' tables that's null if the fd was able to convert the filename from the system LC_CTYPE to utf-8. In the rare cases it couldn't (due to reasons like users running with different LC_CTYPE, nfs volumes exported to systems with different LC_CTYPE, tarballs from systems with different charsets, etc) the raw unconverted bytes of the filename get stored in the bytea field, and a mangled form of the name gets stored in the text field for user display purposes only. I don't know if that'd be worth the hassle, though. I'd just want to use (2) and I suspect so would a majority of users. > On MySQL we use BLOBS. On PostgreSQL, we TEXT and set the encoding to > SQL_ASCII so that PostgreSQL will not attempt to do any translation. This > works well, and I hope that PostgreSQL will continue to support letting > Bacula insert text characters in the database with no character encoding > checks in the future. Even if that was removed (which I can't see happening) you could use the bytea type that's designed for exactly that purpose. Pity it's a bit of a pain to work with :-( >> Because Bacula does not store the encoding or seem to transcode the file >> name to a single known encoding, it does not seem to be possible to >> retrieve files by name if the bacula console is run on a machine with a >> different text encoding to the machine the files came from. After all, >> café in utf-8 is a different byte sequence to café in iso-9660-1, and >> won't match in equality tests under SQL_ASCII. > > If all the filenames go in in "binary" or "litteral" form, then any tests will > work fine. The only test Bacula does is "equality". Byte strings for the same sequence of characters in different encodings are not equal. If you're looking for a file called "café.txt" that was created on a machine with a latin-1 encoding, you cannot find it by searching for 'café' in bconsole because bconsole will search for the utf-8 byte sequence for 'café' not the latin-1 byte sequence for 'café'. $ python >>> x = u"café" >>> x.encode("utf-8") 'caf\xc3\xa9' >>> x.encode("latin-1") 'caf\xe9' >>> x.encode("utf-8") == x.encode("latin-1") False or in Pg: craig=> SHOW client_encoding; client_encoding ----------------- UTF8 (1 row) craig=> CREATE TABLE test (x text); CREATE TABLE craig=> INSERT INTO test(x) VALUES ('café'); INSERT 0 1 craig=> SELECT x, x::bytea FROM test; x | x ------+------------- café | caf\303\251 (1 row) craig=> SELECT convert_to(x, 'latin-1') from test; convert_to ------------ caf\351 (1 row) craig=> SELECT convert_to(x, 'utf-8') = x::bytea, convert_to(x, 'utf-8') = convert_to(x, 'latin-1') FROM test; ?column? | ?column? ----------+---------- t | f (1 row) > The one place where we do convert filenames is on Windows. We convert UCS to > UTF-8. Cool. I thought that must be the case, but it's good to know. What about Mac OS X? It stores file names in UTF-16 normalized form, but has a variety of ways to access those files, including POSIX interfaces. Hmm. A quick test suggests that, irrespective of LC_CTYPE, LANG and LC_ALL, Mac OS X converts file names to UTF-8 for use with POSIX APIs. A bit of digging helps confirm that: http://lists.apple.com/archives/applescript-users/2002/Sep/msg00319.html ... so Mac OS X is no problem. It's just traditional UNIXes which refuse to standardize on a unicode encoding at the file system level that're a problem. (It drives me NUTS that Linux file systems don't do everything in unicode, with the VFS layer or even glibc converting between unicode and LC_CTYPE for programs that want another encoding. Even Microsoft figured out the need for this one in 1998!) > Yes, if you restore on a different system with a different encoding, you will > end up with the same binary string at the OS level, but when you see the > filenames they may look different. ... which to most users equates to "argh, Bacula corrupted my backup!" or "argh, all my filenames are garbage!". Much as they'll say "I changed my system language and now my filenames are all garbage" if they change LC_CTYPE... >> Personally, I'd like to see Bacula capable of using a utf-8 database, >> with proper encoding conversion at the fd for non-utf-8 encoded client >> systems. It'd really simplify managing backups for systems with a >> variety of different encodings. > > Maybe this will happen someday, but first we will need Linux/Unix systems that > *force* the encoding to be in some particular standard. That would make me a happy, happy fellow. I don't really see why it can't be done in glibc now, in truth. Convert everything that goes through a system call to/from utf-8. > I didn't actually run it myself, so it is possible that it produced an error > message, but it did apparently create the database but with UTF-8 encoding. > Most of these things are done in script files, so certain non-fatal errors > may be overlooked. > > As far as I can tell, it took the above encoding command, and perhaps printed > an error message but went ahead and created the database with an encoding > that was not correct. If that's the case, it'd be REALLY good to find out more about the systems it happened on. Affected Bacula users, please speak up now. Tom Lane pointed out and demonstrated that it should report an error and fail to create the DB in these cases, so if it's not that's something that needs investigation. Are you sure it wasn't users (or "helpful" distro packagers) altering create_postgresql_database to issue a CREATE DATABASE statement with utf-8 ... perhaps after getting errors they didn't understand when using SQL_ASCII ? -- Craig Ringer
Craig Ringer wrote: > Kern Sibbald wrote: > >> Hello, >> >> Thanks for all the answers; I am a bit overwhelmed by the number, so I am >> going to try to answer everyone in one email. >> >> The first thing to understand is that it is *impossible* to know what the >> encoding is on the client machine (FD -- or File daemon). On say a >> Unix/Linux system, the user could create filenames with non-UTF-8 then switch >> to UTF-8, or restore files that were tarred on Windows or on Mac, or simply >> copy a Mac directory. Finally, using system calls to create a file, you can >> put *any* character into a filename. >> > > While true in theory, in practice it's pretty unusual to have filenames > encoded with an encoding other than the system LC_CTYPE on a modern > UNIX/Linux/BSD machine. > In my case garbage filenames are all too common. It's a the sad *reality*, when you're mixing languages (Hebrew and English in my case) and operating systems. Garbage filenames are everywhere: directories and files shared between different operating systems and file systems, mail attachments, mp3 file names based on garbage id3 tags, files in zip archives (which seem to not handle filename encoding at all), etc. When I first tried Bacula (version 1.38), I expected to have trouble with filenames, since this is what I'm used to. I was rather pleased to find out that it could both backup and restore files, regardless of origin and destination filename encoding. I like Bacula because, among other things, it can take the punishment and chug along, without me even noticing that there was supposed to be a problem (a recent example: backup/restore files with a negative mtime ...) My 2c Avi
Craig Ringer wrote: > While true in theory, in practice it's pretty unusual to have filenames > encoded with an encoding other than the system LC_CTYPE on a modern > UNIX/Linux/BSD machine. It depends. In western Europe, where iso-8859-1[5] and utf8 are evenly used, it's not unusual at all. You just have to extract an archive created by someone who uses a different encoding than you. Since tar files don't carry any information about the encoding of the filenames it contains, they come out as they are, whatever LC_CTYPE is. The same problem exists for zip files. Best regards, -- Daniel PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org
> Craig Ringer wrote: >> Kern Sibbald wrote: >> >>> Hello, >>> >>> Thanks for all the answers; I am a bit overwhelmed by the number, so I >>> am >>> going to try to answer everyone in one email. >>> >>> The first thing to understand is that it is *impossible* to know what >>> the >>> encoding is on the client machine (FD -- or File daemon). On say a >>> Unix/Linux system, the user could create filenames with non-UTF-8 then >>> switch >>> to UTF-8, or restore files that were tarred on Windows or on Mac, or >>> simply >>> copy a Mac directory. Finally, using system calls to create a file, >>> you can >>> put *any* character into a filename. >>> >> >> While true in theory, in practice it's pretty unusual to have filenames >> encoded with an encoding other than the system LC_CTYPE on a modern >> UNIX/Linux/BSD machine. >> > > In my case garbage filenames are all too common. It's a the sad > *reality*, when you're mixing languages (Hebrew and English in my case) > and operating systems. Garbage filenames are everywhere: directories and > files shared between different operating systems and file systems, mail > attachments, mp3 file names based on garbage id3 tags, files in zip > archives (which seem to not handle filename encoding at all), etc. Yes, that is my experience too. I understand Craig's comments, but I would much prefer that Bacula just backup and restore and leave the checking of filename consistencies to other programs. At least for the moment, that seems to work quite well. Obviously if users mix character sets, sometime display of filenames in Bacula will be wierd, but nevertheless Bacula will backup and restore them so that what was on the system before the backup is what is restored. > > When I first tried Bacula (version 1.38), I expected to have trouble > with filenames, since this is what I'm used to. I was rather pleased to > find out that it could both backup and restore files, regardless of > origin and destination filename encoding. > > I like Bacula because, among other things, it can take the punishment > and chug along, without me even noticing that there was supposed to be a > problem (a recent example: backup/restore files with a negative mtime ...) > Thanks. Thanks also for using Bacula :-) Best regards, Kern
On Thu, Dec 03, 2009 at 08:33:38AM +0100, Kern Sibbald wrote: > Bacula gets the "raw" filename from the OS and stores it on the Volume > then puts it in the database. We treat the filename as if it is UTF-8 > for display purposes, but in all other cases, what we want is for the > filename to go into the database and come back out unchanged. How about also storing the encoding of the path/filename as well? This would allow the restore to do the right thing for display purposes and also when going to a system that uses a different encoding. Obviously you wouldn't know this for Unix derivatives, but for most other systems this would seem to help. > On MySQL we use BLOBS. On PostgreSQL, we TEXT and set the encoding to > SQL_ASCII so that PostgreSQL will not attempt to do any translation. > This works well, and I hope that PostgreSQL will continue to support > letting Bacula insert text characters in the database with no > character encoding checks in the future. As others have said; BYTEA is probably the best datatype for you to use. The encoding of BYTEA literals is a bit of a fiddle and may need some changes, but it's going to be much more faithful to your needs of treating the filename as an opaque blob of data. -- Sam http://samason.me.uk/
Kern Sibbald wrote: > Hello, > > Thanks for all the answers; I am a bit overwhelmed by the number, so I am > going to try to answer everyone in one email. We aim to please, and overwhelm. :-) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Thu, Dec 3, 2009 at 8:33 AM, Craig Ringer <craig@postnewspapers.com.au> wrote: > While true in theory, in practice it's pretty unusual to have filenames > encoded with an encoding other than the system LC_CTYPE on a modern > UNIX/Linux/BSD machine. > > I'd _very_ much prefer to have Bacula back my machines up by respecting > LC_CTYPE and applying appropriate conversions at the fd if LC_CTYPE on > the fd's host is not utf-8 and the database is. a) it doesn't really matter how uncommon it is, backup software is like databases, it's supposed to always work, not just usually work. b) LC_CTYPE is an environment variable, it can be different for different users. c) backup software that tries to fix up the data it's backing up to what it thinks it should look like is bogus. If I can't trust my backup software to restore exactly the same data with exactly the same filenames then it's useless. The last thing I want to be doing when recovering from a disaster is trying to debug some difference of opinion between some third party commercial software and postgres/bacula about unicode encodings. > (3) As (2), but add a `bytea' column to `path' and `filename' tables > that's null if the fd was able to convert the filename from the > system LC_CTYPE to utf-8. In the rare cases it couldn't (due to > reasons like users running with different LC_CTYPE, nfs volumes > exported to systems with different LC_CTYPE, tarballs from > systems with different charsets, etc) the raw unconverted bytes > of the filename get stored in the bytea field, and a mangled > form of the name gets stored in the text field for user display > purposes only. That's an interesting thought. I think it's not quite right -- you want to always store the raw filename in the bytea and then also store a text field with the visual representation. That way you can also deal with broken encodings in some application specific way too, perhaps by trying to guess a reasonable encoding. An alternative would be to just store them in byteas and then handle sorting and displaying by calling the conversion procedure on the fly. -- greg
On Wednesday 02 December 2009 11:33:38 pm Kern Sibbald wrote: > > ( BTW, one way to handle incorrectly encoded filenames and paths might > > be to have a `bytea' field that's generally null to store such mangled > > file names. Personally though I'd favour just rejecting them. ) > > > > > We set SQL_ASCII by default when creating the database via the command > > > recommended in recent versions of PostgreSQL (e.g. 8.1), with: > > > > > > CREATE DATABASE bacula ENCODING 'SQL_ASCII'; > > > > > > However, with PostgreSQL 8.4, the above command is ignored because the > > > default table copied is not template0. > > > > It's a pity that attempting to specify an encoding other than the safe > > one when using a non-template0 database doesn't cause the CREATE > > DATABASE command to fail with an error. > > I didn't actually run it myself, so it is possible that it produced an > error message, but it did apparently create the database but with UTF-8 > encoding. Most of these things are done in script files, so certain > non-fatal errors may be overlooked. > > As far as I can tell, it took the above encoding command, and perhaps > printed an error message but went ahead and created the database with an > encoding that was not correct. If that is indeed the case, then it is in > my opinion, a bad design policy. I would much prefer that either Postgres > accept the command or that it not create the database. This way, either > the database would work as the user expects or there would be no database, > and the problem would be resolved before it creates databases that cannot > be read. It does not CREATE the database. If the users are seeing that happen, then as others have suggested it is a bug. The other option is that they are un-commenting the #ENCODING="ENCODING 'UTF8'" line in the create_postgresql_database.in script to get it to run. The interesting part in that script is the Note: # KES: Note: the CREATE DATABASE, probably should be # CREATE DATABASE ${db_name} $ENCODING TEMPLATE template0 According to the git repository this showed up in July of this year; http://bacula.git.sourceforge.net/git/gitweb.cgi?p=bacula/bacula;a=blob;f=bacula/src/cats/create_postgresql_database.in;hb=6e024d0fe47ea0d9e6d3fbec52c4165caa44967f > > In any case we have corrected the command to include the TEMPLATE, but this > won't help people with older Bacula's. Could they not just get the corrected version of create_postgresql_database.in. It would run on the old versions as well. > > The other point I wanted to emphasize is that the documentation implied > that future versions of Postgres may eliminate the feature of having > SQL_ASCII (i.e. the ability to input arbritrary binary strings). As I > said, that would be a pity -- I suppose we could switch to using LOs or > whatever they are called in Postgres, but that would be rather > inconvenient. Per Tom's previous post: "You misread it. We are not talking about eliminating SQL_ASCII --- as you say, that's useful. What is deprecated is trying to use SQL_ASCII with a non-C locale, which is dangerous, and always has been. If you've been putting non-UTF8 data into a database that could be running under a UTF8-dependent locale, I'm surprised you haven't noticed problems already.' > > Thanks for all the responses, > > Best regards, > > Kern -- Adrian Klaver aklaver@comcast.net
Sam Mason <sam@samason.me.uk> writes: > As others have said; BYTEA is probably the best datatype for you to > use. The encoding of BYTEA literals is a bit of a fiddle and may need > some changes, but it's going to be much more faithful to your needs of > treating the filename as an opaque blob of data. bytea might be theoretically the best choice, but the fact remains that 99% of the entries will be text that's readable in the user's encoding (whatever that is). bytea will just be a serious PITA because of its escaping issues. Also, the fact that 8.5 may change to hex display by default will make bytea even more of a PITA for mostly-text data. So I think Bacula's choice to use sql_ascii with text columns is entirely defensible. What concerns me is the claim that PG made a database with some arbitrary parameters after having rejected a now-considered-invalid command. I frankly do not believe that, but if it did happen it's a *serious* bug that requires investigation. regards, tom lane
Frank Sweetser wrote: > Unless, of course, you're at a good sized school with lots of > international students, and have fileservers holding filenames created > on desktops running in Chinese, Turkish, Russian, and other locales. What I struggle with here is why they're not using ru_RU.UTF-8, cn_CN.UTF-8, etc as their locales. Why mix charsets? I don't think that these people should be forced to use a utf-8 database and encoding conversion if they want to do things like mix-and-match charsets for file name chaos on their machines, though. I'd just like to be able to back up systems that _do_ have consistent charsets in ways that permit me to later reliably search for files by name, restore to any host, etc. Perhaps I'm strange in thinking that all this mix-and-match encodings stuff is bizarre and backward. The Mac OS X and Windows folks seem to agree, though. Let the file system store unicode data, and translate at the file system or libc layer for applications that insist on using other encodings. I do take Greg Stark's point (a) though. As *nix systems stand, solutions will only ever be mostly-works, not always-works, which I agree isn't good enough. Since there's no sane agreement about encodings on *nix systems and everything is just byte strings that different apps can interpret in different ways under different environmental conditions, we may as well throw up our hands in disgust and give up trying to do anything sensible. The alternative is saying that files the file system considers legal can't be backed up because of file naming, which I do agree isn't ok. The system shouldn't permit those files to exist, either, but I suspect we'll still have borked encoding-agnostic wackiness as long as we have *nix systems at all since nobody will ever agree on anything for long enough to change it. Sigh. I think this is about the only time I've ever wished I was using Windows (or Mac OS X). Also: Greg, your point (c) goes two ways. If I can't trust my backup software to restore my filenames from one host exactly correctly to another host that may have configuration differences not reflected in the backup metadata, a different OS revision, etc, then what good is it for disaster recovery? How do I even know what those byte strings *mean*? Bacula doesn't even record the default system encoding with backup jobs so there's no way for even the end user to try to fix up the file names for a different encoding. You're faced with some byte strings in wtf-is-this-anyway encoding and guesswork. Even recording lc_ctype in the backup job metadata and offering the _option_ to convert encoding on restore would be a big step, (though it wouldn't fix the breakage with searches by filename not matching due to encoding mismatches). Personally, I'm just going to stick to a utf-8 only policy for all my hosts, working around the limitation that way. It's worked ok thus far, though I don't much like the way that different normalizations of unicode won't match equal under SQL_ASCII so I can't reliably search for file names. -- Craig Ringer
On Thu, 3 Dec 2009 12:22:50 +0100 (CET) "Kern Sibbald" <kern@sibbald.com> wrote: > Yes, that is my experience too. I understand Craig's comments, > but I would much prefer that Bacula just backup and restore and > leave the checking of filename consistencies to other programs. > At least for the moment, that seems to work quite well. Obviously > if users mix character sets, sometime display of filenames in > Bacula will be wierd, but nevertheless Bacula will backup and > restore them so that what was on the system before the backup is > what is restored. I expect a backup software has a predictable, reversible behaviour and warn me if I'm shooting myself in the foot. It should be the responsibility of the admin to restore files in a proper place knowing that locales may be a problem. I think Bacula is taking the right approach. Still I'd surely appreciate as a feature a "tool" that will help me to restore files in a system with a different locale than the original one or warn me if the locale is different or it can't be sure it is the same. That's exactly what Postgresql is doing: at least warning you. Even Postgresql is taking the right approach. An additional "guessed original locale" field and a tool/option to convert/restore with selected locale could be an interesting feature. What is Bacula going to do with xattr on different systems? Postgresql seems to offer a good choice of tools to convert between encodings and deal with bytea. Formally I'd prefer bytea but in real use it may just be an additional pain and other DB may not offer the same tools for encoding/bytea conversions. Is it possible to search for a file in a backup set? What is it going to happen if I'm searching from a system that has a different locale from the one the backup was made on? Can I use regexp? Can accents be ignored during searches? -- Ivan Sergio Borgonovo http://www.webthatworks.it
On Thu, Dec 03, 2009 at 10:46:54AM -0500, Tom Lane wrote: > Sam Mason <sam@samason.me.uk> writes: > > As others have said; BYTEA is probably the best datatype for you to > > use. The encoding of BYTEA literals is a bit of a fiddle and may need > > some changes, but it's going to be much more faithful to your needs of > > treating the filename as an opaque blob of data. > > bytea might be theoretically the best choice, but the fact remains that > 99% of the entries will be text that's readable in the user's encoding > (whatever that is). I agree it'll be fine most of the time and the more important thing is normally the data rather than the filename. Still, for non-english speaking people I'd guess there are many more encodings floating around than I'd ever expect to see on a daily basis. Us English/US speakers really do have a very easy life. There's also the issue that the user's encoding doesn't necessarily match the system's encoding. Thus within an account everything may be easy, but when a system daemon comes in and looks at things it's going to be somewhat messy. No hard numbers either way, I just know I see a very biased sample of systems and would not like to make generalizations. > What concerns me is the claim that PG made a database with some > arbitrary parameters after having rejected a now-considered-invalid > command. I frankly do not believe that, but if it did happen it's > a *serious* bug that requires investigation. Yup, be interesting to hear more details from the OP about this. -- Sam http://samason.me.uk/
On 12/03/2009 10:54 AM, Craig Ringer wrote: > Frank Sweetser wrote: > >> Unless, of course, you're at a good sized school with lots of >> international students, and have fileservers holding filenames created >> on desktops running in Chinese, Turkish, Russian, and other locales. > > What I struggle with here is why they're not using ru_RU.UTF-8, > cn_CN.UTF-8, etc as their locales. Why mix charsets? The problem isn't so much what they're using on their unmanaged desktops. The problem is that the server, which is the one getting backed up, holds an aggregation of files created by an unknown collection of applications running on a mish-mash of operating systems (every large edu has its horror story of the 15+ year old, unpatched, mission critical machine that no one dares touch) with wildly varying charset configurations, no doubt including horribly broken and pre-UTF ones. The end result is a fileset full of filenames created on a hacked Chinese copy of XP, a Russian copy of winME, romanian RedHat 4.0, and Mac OS 8. This kind of junk is, sadly, not uncommon in academic environments, where IT is often required to support stuff that they don't get to manage. -- Frank Sweetser fs at wpi.edu | For every problem, there is a solution that WPI Senior Network Engineer | is simple, elegant, and wrong. - HL Mencken GPG fingerprint = 6174 1257 129E 0D21 D8D4 E8A3 8E39 29E3 E2E8 8CEC
Hi Avi
Please have a look at this link, this is how to install Bacula with MYSQL database with Hebrew support
Eitan
Please have a look at this link, this is how to install Bacula with MYSQL database with Hebrew support
Eitan
On Thu, Dec 3, 2009 at 12:35 PM, Avi Rozen <avi.rozen@gmail.com> wrote:
Craig Ringer wrote:
> Kern Sibbald wrote:
>
>> Hello,
>>
>> Thanks for all the answers; I am a bit overwhelmed by the number, so I am
>> going to try to answer everyone in one email.
>>
>> The first thing to understand is that it is *impossible* to know what the
>> encoding is on the client machine (FD -- or File daemon). On say a
>> Unix/Linux system, the user could create filenames with non-UTF-8 then switch
>> to UTF-8, or restore files that were tarred on Windows or on Mac, or simply
>> copy a Mac directory. Finally, using system calls to create a file, you can
>> put *any* character into a filename.
>>
>
> While true in theory, in practice it's pretty unusual to have filenames
> encoded with an encoding other than the system LC_CTYPE on a modern
> UNIX/Linux/BSD machine.
>
In my case garbage filenames are all too common. It's a the sad
*reality*, when you're mixing languages (Hebrew and English in my case)
and operating systems. Garbage filenames are everywhere: directories and
files shared between different operating systems and file systems, mail
attachments, mp3 file names based on garbage id3 tags, files in zip
archives (which seem to not handle filename encoding at all), etc.
When I first tried Bacula (version 1.38), I expected to have trouble
with filenames, since this is what I'm used to. I was rather pleased to
find out that it could both backup and restore files, regardless of
origin and destination filename encoding.
I like Bacula because, among other things, it can take the punishment
and chug along, without me even noticing that there was supposed to be a
problem (a recent example: backup/restore files with a negative mtime ...)
My 2c
Avi
------------------------------------------------------------------------------
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing.
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users
On 12/3/2009 3:33 AM, Craig Ringer wrote: > Kern Sibbald wrote: >> Hello, >> >> Thanks for all the answers; I am a bit overwhelmed by the number, so I am >> going to try to answer everyone in one email. >> >> The first thing to understand is that it is *impossible* to know what the >> encoding is on the client machine (FD -- or File daemon). On say a Or, even worse, which encoding the user or application was thinking of when it wrote a particular out. There's no guarantee that any two files on a system were intended to be looked at with the same encoding. >> Unix/Linux system, the user could create filenames with non-UTF-8 then switch >> to UTF-8, or restore files that were tarred on Windows or on Mac, or simply >> copy a Mac directory. Finally, using system calls to create a file, you can >> put *any* character into a filename. > > While true in theory, in practice it's pretty unusual to have filenames > encoded with an encoding other than the system LC_CTYPE on a modern > UNIX/Linux/BSD machine. Unless, of course, you're at a good sized school with lots of international students, and have fileservers holding filenames created on desktops running in Chinese, Turkish, Russian, and other locales. In the end, a filename is (under linux, at least) just a string of arbitrary bytes containing anything except / and NULL. If bacula tries to get too clever, and munges or misinterprets those bytes strings - or, worse yet, if the database does it behind your back - then stuff _will_ end up breaking. (A few years back, someone heavily involved in linux kernel filesystem work was talking about this exact issue, and made the remark that many doing internationalization work secretly feel it would be easier to just teach everyone english. Impossible as this may be, I have since come to understand what they were talking about...) -- Frank Sweetser fs at wpi.edu | For every problem, there is a solution that WPI Senior Network Engineer | is simple, elegant, and wrong. - HL Mencken GPG fingerprint = 6174 1257 129E 0D21 D8D4 E8A3 8E39 29E3 E2E8 8CEC
Craig Ringer wrote: > Frank Sweetser wrote: > > > Unless, of course, you're at a good sized school with lots of > > international students, and have fileservers holding filenames created > > on desktops running in Chinese, Turkish, Russian, and other locales. > > What I struggle with here is why they're not using ru_RU.UTF-8, > cn_CN.UTF-8, etc as their locales. Why mix charsets? On my own desktop computer, I switched from Latin1 to UTF8 some two years ago, and I still have a mixture of file name encodings. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Kern Sibbald <kern@sibbald.com> writes: > Concerning the future: I am still a bit concerning about the mention in the > document of possible future changes concerning SQL_ASCII and LC_CTYPE not C > or POSIX ("Using this combination of settings is deprecated and may someday > be forbidden altogether. "), so if you have any suggestions for CREATE > DATABASE that we can use now that will ensure that for future versions of > Bacula + PostgreSQL that we will end up with our bacula database created with > SQL_ASCII, I would much appreciate it. What you need to do is add LC_COLLATE = 'C' LC_CTYPE = 'C' to the command. This is already important in 8.4 to ensure the command won't fail if the installation default locale is not C. The difficulty of course is that this is only for 8.4 and up, older versions won't take it. I don't know the context well enough to suggest how you might want to deal with version-sensitive commands (probably you have some plan for that already). regards, tom lane
On Thursday 03 December 2009 16:42:58 Adrian Klaver wrote: > On Wednesday 02 December 2009 11:33:38 pm Kern Sibbald wrote: > > > ( BTW, one way to handle incorrectly encoded filenames and paths might > > > be to have a `bytea' field that's generally null to store such mangled > > > file names. Personally though I'd favour just rejecting them. ) > > > > > > > We set SQL_ASCII by default when creating the database via the > > > > command recommended in recent versions of PostgreSQL (e.g. 8.1), > > > > with: > > > > > > > > CREATE DATABASE bacula ENCODING 'SQL_ASCII'; > > > > > > > > However, with PostgreSQL 8.4, the above command is ignored because > > > > the default table copied is not template0. > > > > > > It's a pity that attempting to specify an encoding other than the safe > > > one when using a non-template0 database doesn't cause the CREATE > > > DATABASE command to fail with an error. > > > > I didn't actually run it myself, so it is possible that it produced an > > error message, but it did apparently create the database but with UTF-8 > > encoding. Most of these things are done in script files, so certain > > non-fatal errors may be overlooked. > > > > As far as I can tell, it took the above encoding command, and perhaps > > printed an error message but went ahead and created the database with an > > encoding that was not correct. If that is indeed the case, then it is in > > my opinion, a bad design policy. I would much prefer that either > > Postgres accept the command or that it not create the database. This > > way, either the database would work as the user expects or there would be > > no database, and the problem would be resolved before it creates > > databases that cannot be read. > > It does not CREATE the database. If the users are seeing that happen, then > as others have suggested it is a bug. The other option is that they are > un-commenting the #ENCODING="ENCODING 'UTF8'" line in the > create_postgresql_database.in script to get it to run. The interesting part > in that script is the Note: > > # KES: Note: the CREATE DATABASE, probably should be > # CREATE DATABASE ${db_name} $ENCODING TEMPLATE template0 > > According to the git repository this showed up in July of this year; I had forgotten about that, but it does show that I do go and read the documentation from time to time :-) > > http://bacula.git.sourceforge.net/git/gitweb.cgi?p=bacula/bacula;a=blob;f=b >acula/src/cats/create_postgresql_database.in;hb=6e024d0fe47ea0d9e6d3fbec52c4 >165caa44967f > > > In any case we have corrected the command to include the TEMPLATE, but > > this won't help people with older Bacula's. > > Could they not just get the corrected version of > create_postgresql_database.in. It would run on the old versions as well. Have you ever tried to push a fix for postgres downstream? It usually is not easy and takes a long time. Anyway, now that we understand the problem, we will be able to warn users, which is what happened when we copied the bacula-users email list, and those who miss we can help rather quickly. > > > The other point I wanted to emphasize is that the documentation implied > > that future versions of Postgres may eliminate the feature of having > > SQL_ASCII (i.e. the ability to input arbritrary binary strings). As I > > said, that would be a pity -- I suppose we could switch to using LOs or > > whatever they are called in Postgres, but that would be rather > > inconvenient. > > Per Tom's previous post: > "You misread it. We are not talking about eliminating SQL_ASCII --- as > you say, that's useful. What is deprecated is trying to use SQL_ASCII > with a non-C locale, which is dangerous, and always has been. If you've > been putting non-UTF8 data into a database that could be running under a > UTF8-dependent locale, I'm surprised you haven't noticed problems already.' > Duh, yes, I misunderstood it. Another comment by Tom Lane clarified the situation for me. I've now ensured that for future Bacula versions the LC_COLLATE and LC_CYTPE are set to 'C' when we set SQL_ASCII. Thanks for the help and polite comments from everyone :-) Kern
Hello, Thanks for your response. On Thursday 03 December 2009 16:46:54 Tom Lane wrote: > Sam Mason <sam@samason.me.uk> writes: > > As others have said; BYTEA is probably the best datatype for you to > > use. The encoding of BYTEA literals is a bit of a fiddle and may need > > some changes, but it's going to be much more faithful to your needs of > > treating the filename as an opaque blob of data. > > bytea might be theoretically the best choice, but the fact remains that > 99% of the entries will be text that's readable in the user's encoding > (whatever that is). bytea will just be a serious PITA because of its > escaping issues. Also, the fact that 8.5 may change to hex display by > default will make bytea even more of a PITA for mostly-text data. > So I think Bacula's choice to use sql_ascii with text columns is > entirely defensible. Thank you for confirming this. I do appreciate the SQL_ASCII feature that PosgreSQL has. It is far preferable for our users than the MySQL BLOB solution, because as you say, most of the filenames do go in as valid UTF-8, and those that don't are backed up and restored correctly, but may have a few problems during display of their names, which is rather rare anyway. > > What concerns me is the claim that PG made a database with some > arbitrary parameters after having rejected a now-considered-invalid > command. I frankly do not believe that, but if it did happen it's > a *serious* bug that requires investigation. Concerning the problem with specifying ENCODING 'SQL_ASCII' but not TEMPLATE template0: it appears that Postgres *did* reject the command and did not create a database with SQL_UTF8 as I had supposed, so as you say, it is not a PosgreSQL bug. The best I can tell is that the command failed, so the user manually submitted it, probably without the ENCODING and so it ended up with the wrong type. I just fixed our Bacula create_postgresql_database script so that it clearly detects any problem creating the database (ON_ERROR_STOP), which is was not previously the case. Concerning the future: I am still a bit concerning about the mention in the document of possible future changes concerning SQL_ASCII and LC_CTYPE not C or POSIX ("Using this combination of settings is deprecated and may someday be forbidden altogether. "), so if you have any suggestions for CREATE DATABASE that we can use now that will ensure that for future versions of Bacula + PostgreSQL that we will end up with our bacula database created with SQL_ASCII, I would much appreciate it. Best regards, Kern