Thread: UTF8 problem
I'm using DBMail running against PostgreSQL as my mailstore for our company network. I recently converted our company database from SQL_ASCII to UTF8 as I thought this would be a *good thing*. The problem now is that I think I'm loosing emails because in my postgresql logs I get this: 2006-06-08 01:17:05 EDT LOG: unexpected EOF on client connection 2006-06-08 01:17:05 EDT ERROR: invalid byte sequence for encoding "UTF8": 0xe1202c This is by far the most common, but I'm getting a few others too such as, 0xae, 0x85, 0x92 and more... The basic setup is that Postfix hands the email to a program called dbmail-smtp which parses and insert the message into the database. DBMail doesn't know anything about encoding. I tried setting the enviornment variable PGCLIENTENCODING=SQL_ASCII in the Postfix startup script, but that doesn't seem to be making any difference. Any suggestions? Thanks, Matt
Well, to answer my own question, I hacked the source code of DBMail and had it set the client encoding to LATIN1 immediately after database connect, this seems to have fixed the problem. Sorry for the noise, Matt Matthew T. O'Connor wrote: > I'm using DBMail running against PostgreSQL as my mailstore for our > company network. I recently converted our company database from > SQL_ASCII to UTF8 as I thought this would be a *good thing*. > > The problem now is that I think I'm loosing emails because in my > postgresql logs I get this: > 2006-06-08 01:17:05 EDT LOG: unexpected EOF on client connection > 2006-06-08 01:17:05 EDT ERROR: invalid byte sequence for encoding > "UTF8": 0xe1202c > > This is by far the most common, but I'm getting a few others too such > as, 0xae, 0x85, 0x92 and more... > > The basic setup is that Postfix hands the email to a program called > dbmail-smtp which parses and insert the message into the database. > DBMail doesn't know anything about encoding. I tried setting the > enviornment variable PGCLIENTENCODING=SQL_ASCII in the Postfix startup > script, but that doesn't seem to be making any difference. > > Any suggestions? > > Thanks, > > Matt > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org >
Matthew T. O'Connor schrieb: > Well, to answer my own question, I hacked the source code of DBMail and > had it set the client encoding to LATIN1 immediately after database > connect, this seems to have fixed the problem. > You could also just have set the client_encoding as an user-option in postgres so on login, the encoding is set to what you want automatically. (See ALTER USER ...) Regards Tino
Matthew T. O'Connor wrote: > Well, to answer my own question, I hacked the source code of DBMail and > had it set the client encoding to LATIN1 immediately after database > connect, this seems to have fixed the problem. > > Sorry for the noise, > > Matt I've seen this sort of problem asked about in the mailing lists often enough to think it merits a FAQ entry, so how about this text: <entry> Q. Why do I have problems inserting text into my database, with error messages like ERROR: invalid byte sequence for encoding "UTF8": 0xe1202c ? A. Almost certainly that byte sequence really is an invalid byte sequence for that encoding. The reason you are seeing the error is probably because you are providing text in some other encoding. You and the database need to agree between you what encoding you're using. PostgreSQL is fairly good at working with you, converting to and from whatever encoding you want to use, but you need to tell it what that encoding is, and then stick to that encoding consistently. If you don't set the client encoding, then PostgreSQL will use the default encoding for the database, which in modern times is often UTF8 (aka UNICODE), and is set at database creation time. However, many client apps still use other encodings, (eg Latin1, aka ISO-8859-1), so you need to either educate the client app to use UTF8, or get it to inform PostgreSQL what other encoding to use. The way to tell PostgreSQL what encoding you want to use is by use of the client_encoding GUC variable, eg set client_encoding to 'LATIN1'; One reason you may be seeing this problem now, after upgrading your version of PostgreSQL, is that recent versions have tighter validation of encoded text. Previously you may not have been conscious of what encoding you were actually using, especially if you're a speaker of a Western European language, and may have gotten away with writing incorrectly-encoded text without the database complaining. Now is the time to start getting it right. One thing to be wary of is the "SQL_ASCII" encoding. It appears to be commonly and incorrectly believed that this represents either some variant on latin1, or pure 7-bit ASCII. It is neither of those, but a completely unchecked encoding that really means whatever you want it to mean. This makes it not a very good encoding to use in practice, as it becomes prone to allowing a mixture of different encodings to be present in the same set of data, which will cause you headaches when you try to convert the whole lot to some consistent encoding in the future. See section 21.2 of the documentation for more complete information. </entry> Tim -- ----------------------------------------------- Tim Allen tim@proximity.com.au Proximity Pty Ltd http://www.proximity.com.au/
Tim Allen schrieb: > Matthew T. O'Connor wrote: > >> Well, to answer my own question, I hacked the source code of DBMail >> and had it set the client encoding to LATIN1 immediately after >> database connect, this seems to have fixed the problem. >> >> Sorry for the noise, >> >> Matt > > > I've seen this sort of problem asked about in the mailing lists often > enough to think it merits a FAQ entry, so how about this text: > > <entry> > Q. Why do I have problems inserting text into my database, with error > messages like > > ERROR: invalid byte sequence for encoding "UTF8": 0xe1202c ? > > A. Almost certainly that byte sequence really is an invalid byte > sequence for that encoding. The reason you are seeing the error is > probably because you are providing text in some other encoding. You and > the database need to agree between you what encoding you're using. > PostgreSQL is fairly good at working with you, converting to and from > whatever encoding you want to use, but you need to tell it what that > encoding is, and then stick to that encoding consistently. > > If you don't set the client encoding, then PostgreSQL will use the > default encoding for the database, which in modern times is often UTF8 > (aka UNICODE), and is set at database creation time. However, many > client apps still use other encodings, (eg Latin1, aka ISO-8859-1), so > you need to either educate the client app to use UTF8, or get it to > inform PostgreSQL what other encoding to use. > > The way to tell PostgreSQL what encoding you want to use is by use of > the client_encoding GUC variable, eg > > set client_encoding to 'LATIN1'; If you cant educate your client application to set this option on connect, you can set this per user: ALTER USER clientappuser SET client_encoding to 'what your app uses'; > > One reason you may be seeing this problem now, after upgrading your > version of PostgreSQL, is that recent versions have tighter validation > of encoded text. Previously you may not have been conscious of what > encoding you were actually using, especially if you're a speaker of a > Western European language, and may have gotten away with writing > incorrectly-encoded text without the database complaining. Now is the > time to start getting it right. > > One thing to be wary of is the "SQL_ASCII" encoding. It appears to be > commonly and incorrectly believed that this represents either some > variant on latin1, or pure 7-bit ASCII. It is neither of those, but a > completely unchecked encoding that really means whatever you want it to > mean. This makes it not a very good encoding to use in practice, as it > becomes prone to allowing a mixture of different encodings to be present > in the same set of data, which will cause you headaches when you try to > convert the whole lot to some consistent encoding in the future. > > See section 21.2 of the documentation for more complete information. > </entry> > > Tim >
Tino Wildenhain wrote: > Tim Allen schrieb: [snip] >> The way to tell PostgreSQL what encoding you want to use is by use of >> the client_encoding GUC variable, eg >> >> set client_encoding to 'LATIN1'; > > > If you cant educate your client application to set this option on connect, > you can set this per user: > > ALTER USER clientappuser SET client_encoding to 'what your app uses'; Good point. I guess it's worth mentioning this and some of the other ways you can set the encoding. Though there are actually quite a few different ways - that might make the entry overly long. Opinions? Tim -- ----------------------------------------------- Tim Allen tim@proximity.com.au Proximity Pty Ltd http://www.proximity.com.au/
Matthew T. O'Connor wrote: > Well, to answer my own question, I hacked the source code of DBMail and > had it set the client encoding to LATIN1 immediately after database > connect, this seems to have fixed the problem. LATIN1 != UTF-8. Your problem isn't solved yet. You should either tell your client to use UTF-8 or alter your database to use LATIN1. -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //
Alban Hertroys schrieb: > Matthew T. O'Connor wrote: > >> Well, to answer my own question, I hacked the source code of DBMail >> and had it set the client encoding to LATIN1 immediately after >> database connect, this seems to have fixed the problem. > > > LATIN1 != UTF-8. Your problem isn't solved yet. > Well, this enables postgres to translate the encoding. However I would be unsure if dbmail always sends latin-1 anyway. Regards Tino
Matthew T. O'Connor wrote: > The basic setup is that Postfix hands the email to a program called > dbmail-smtp which parses and insert the message into the database. > DBMail doesn't know anything about encoding. That's precisely what SQL_ASCII is for. Why not stay with it? -- Daniel PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org
Tino Wildenhain <tino@wildenhain.de> writes: > Alban Hertroys schrieb: >> Matthew T. O'Connor wrote: >> >>> Well, to answer my own question, I hacked the source code of DBMail >>> and had it set the client encoding to LATIN1 immediately after >>> database connect, this seems to have fixed the problem. >> LATIN1 != UTF-8. Your problem isn't solved yet. >> > > Well, this enables postgres to translate the encoding. > However I would be unsure if dbmail always sends latin-1 > anyway. I would think it would (at least potentially) vary with each message. The dbmail software should really set client_encoding based on the Content-Transfer-Encoding header in the message (or whatever it's called). LATIN-1 is one of the most common encodings for email but it's scarcely the only one... -Doug
Douglas McNaught wrote: > I would think it would (at least potentially) vary with each message. > The dbmail software should really set client_encoding based on the > Content-Transfer-Encoding header in the message (or whatever it's > called). That would be the "charset" parameter of the Content-Type header, Content-Transfer-Encoding having a different purpose. Anyway, doing this would be quite risky, just look for example at the security hole refered to as CVE-2006-2313. dbmail authors are aware of the issue, it's quite clearly explained here: http://mailman.fastxs.net/pipermail/dbmail-dev/2005-November/007656.html On the other hand they had a bug filed here: http://www.dbmail.org/mantis/view.php?id=218 where a user reports the same problem than the OP, and for which the analysis is pretty strange, pretending that UNICODE shouldn't be used with pg<8.1 :) IMHO they fail to draw the proper conclusion, which is that either the raw mail should be stored as either as a binary object, or as a text field in a database with SQL_ASCII encoding, in both cases providing the level of transparency that they need by design, their purpose being to store and retrieve the mail, not to check its its contents. -- Daniel PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org
Tim Allen wrote: > Tino Wildenhain wrote: >> Tim Allen schrieb: > > [snip] > >>> The way to tell PostgreSQL what encoding you want to use is by use >>> of the client_encoding GUC variable, eg >>> >>> set client_encoding to 'LATIN1'; >> >> >> If you cant educate your client application to set this option on >> connect, >> you can set this per user: >> >> ALTER USER clientappuser SET client_encoding to 'what your app uses'; > > Good point. I guess it's worth mentioning this and some of the other > ways you can set the encoding. Though there are actually quite a few > different ways - that might make the entry overly long. Opinions? > > Tim You could always go with one solution in the faq, such as the one you've already given, and list and link to others near the end. In fact, this is probably best without at least mentioning the other solutions newbies (like me) might never look furthur to find, what for them, might be a better, or more fitting, solution...
Tino Wildenhain wrote: > Matthew T. O'Connor schrieb: >> Well, to answer my own question, I hacked the source code of DBMail >> and had it set the client encoding to LATIN1 immediately after >> database connect, this seems to have fixed the problem. >> > You could also just have set the client_encoding as an user-option > in postgres so on login, the encoding is set to what you want > automatically. (See ALTER USER ...) Well that's cool. I had no idea you could do this. Thanks for the tip. Matt
Daniel Verite wrote: > Matthew T. O'Connor wrote: > >> The basic setup is that Postfix hands the email to a program called >> dbmail-smtp which parses and insert the message into the database. >> DBMail doesn't know anything about encoding. >> > That's precisely what SQL_ASCII is for. Why not stay with it? Well for one thing, you can't set encoding on a per column, or even per table basis, you have to set it for the whole database. I have all the DBMail data inside of our company database that uses UTF8.
Daniel Verite wrote: > IMHO they fail to draw the proper conclusion, which is that > either the raw mail should be stored as either as a binary object, > or as a text field in a database with SQL_ASCII encoding, in both > cases providing the level of transparency that they need by design, > their purpose being to store and retrieve the mail, not to check its > its contents. They have talked about changing the messageblks to binary instead of text. They said that one of their main objections is that bytea data is not compressed. I'm not sure that's true, but I don't see anything in the docs about it. I think they would move to bytea if it does compress.
"Matthew T. O'Connor" <matthew@zeut.net> writes: > They have talked about changing the messageblks to binary instead of > text. They said that one of their main objections is that bytea data is > not compressed. I'm not sure that's true, but I don't see anything in > the docs about it. I think they would move to bytea if it does compress. These people really don't know what they're talking about, do they? bytea goes through exactly the same compression logic as text. But really it sounds to me like the easiest solution is just to run the database in SQL_ASCII encoding. regards, tom lane
Tom Lane wrote: > "Matthew T. O'Connor" <matthew@zeut.net> writes: >> They have talked about changing the messageblks to binary instead of >> text. They said that one of their main objections is that bytea data is >> not compressed. I'm not sure that's true, but I don't see anything in >> the docs about it. I think they would move to bytea if it does compress. > > These people really don't know what they're talking about, do they? Ok, this is what I thought, but I wasn't sure. We might want to add a note to the docs that detail this fact. If it's already there, it wasn't prominent enough for me to find with a cursory glance. > bytea goes through exactly the same compression logic as text. But > really it sounds to me like the easiest solution is just to run the > database in SQL_ASCII encoding. That is what they recommend, but I have the DBMail tables in the same database as the rest of my company database which I converted to UTF8 (perhaps I didn't need to do this, but that is another issue...). Thanks.
On Thu, Jun 08, 2006 at 07:25:35AM -0400, Douglas McNaught <doug@mcnaught.org> wrote a message of 29 lines which said: > I would think it would (at least potentially) vary with each > message. The dbmail software should really set client_encoding > based on the Content-Transfer-Encoding header in the message (or > whatever it's called). A *big* warning from someone who stores email in PostgreSQL: many email messages *lie*. They have a Content-transfer-encoding and then they actually use another encoding. If you blindly try to inject the body of the message into PostgreSQL, with the indicated encoding, you will sometimes fail, for instance if the message claim to be in UTF-8 but is not (something that PostgreSQL will detect). Either you: * "sanitize" all incoming data * or you accept to reject these invalid email * or you store them in a unstructured field (a blob)
Instead of adding an FAQ entry, which might not be found when the error is generated, I added a HINT for 8.2 that will appear with the error message: errmsg("invalid byte sequence for encoding \"%s\": 0x%s", pg_enc2name_tbl[encoding].name, buf), errhint("This failure can also happen if the byte sequence does not " "match the encoding expected by the server, which is controlled " "by \"client_encoding\"."))); Supplying information at the point of error is usually the best solution, if possible. Backpatched to 8.1.X as well. --------------------------------------------------------------------------- Tim Allen wrote: > Matthew T. O'Connor wrote: > > Well, to answer my own question, I hacked the source code of DBMail and > > had it set the client encoding to LATIN1 immediately after database > > connect, this seems to have fixed the problem. > > > > Sorry for the noise, > > > > Matt > > I've seen this sort of problem asked about in the mailing lists often > enough to think it merits a FAQ entry, so how about this text: > > <entry> > Q. Why do I have problems inserting text into my database, with error > messages like > > ERROR: invalid byte sequence for encoding "UTF8": 0xe1202c ? > > A. Almost certainly that byte sequence really is an invalid byte > sequence for that encoding. The reason you are seeing the error is > probably because you are providing text in some other encoding. You and > the database need to agree between you what encoding you're using. > PostgreSQL is fairly good at working with you, converting to and from > whatever encoding you want to use, but you need to tell it what that > encoding is, and then stick to that encoding consistently. > > If you don't set the client encoding, then PostgreSQL will use the > default encoding for the database, which in modern times is often UTF8 > (aka UNICODE), and is set at database creation time. However, many > client apps still use other encodings, (eg Latin1, aka ISO-8859-1), so > you need to either educate the client app to use UTF8, or get it to > inform PostgreSQL what other encoding to use. > > The way to tell PostgreSQL what encoding you want to use is by use of > the client_encoding GUC variable, eg > > set client_encoding to 'LATIN1'; > > One reason you may be seeing this problem now, after upgrading your > version of PostgreSQL, is that recent versions have tighter validation > of encoded text. Previously you may not have been conscious of what > encoding you were actually using, especially if you're a speaker of a > Western European language, and may have gotten away with writing > incorrectly-encoded text without the database complaining. Now is the > time to start getting it right. > > One thing to be wary of is the "SQL_ASCII" encoding. It appears to be > commonly and incorrectly believed that this represents either some > variant on latin1, or pure 7-bit ASCII. It is neither of those, but a > completely unchecked encoding that really means whatever you want it to > mean. This makes it not a very good encoding to use in practice, as it > becomes prone to allowing a mixture of different encodings to be present > in the same set of data, which will cause you headaches when you try to > convert the whole lot to some consistent encoding in the future. > > See section 21.2 of the documentation for more complete information. > </entry> > > Tim > > -- > ----------------------------------------------- > Tim Allen tim@proximity.com.au > Proximity Pty Ltd http://www.proximity.com.au/ > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +