Thread: UTF8 problem

UTF8 problem

From

"Matthew T. O'Connor"

Date:

08 June 2006, 02:23:12

I'm using DBMail running against PostgreSQL as my mailstore for our
company network.  I recently converted our company database from
SQL_ASCII to UTF8 as I thought this would be a *good thing*.

The problem now is that I think I'm loosing emails because in my
postgresql logs I get this:
2006-06-08 01:17:05 EDT LOG:  unexpected EOF on client connection
2006-06-08 01:17:05 EDT ERROR:  invalid byte sequence for encoding
"UTF8": 0xe1202c

This is by far the most common, but I'm getting a few others too such
as, 0xae, 0x85, 0x92 and more...

The basic setup is that Postfix hands the email to a program called
dbmail-smtp which parses and insert the message into the database.
DBMail doesn't know anything about encoding.  I tried setting the
enviornment variable PGCLIENTENCODING=SQL_ASCII in the Postfix startup
script, but that doesn't seem to be making any difference.

Any suggestions?

Thanks,

Matt

Re: UTF8 problem

From

"Matthew T. O'Connor"

Date:

08 June 2006, 03:05:58

Well, to answer my own question, I hacked the source code of DBMail and
had it set the client encoding to LATIN1 immediately after database
connect, this seems to have fixed the problem.

Sorry for the noise,

Matt



Matthew T. O'Connor wrote:
> I'm using DBMail running against PostgreSQL as my mailstore for our
> company network.  I recently converted our company database from
> SQL_ASCII to UTF8 as I thought this would be a *good thing*.
>
> The problem now is that I think I'm loosing emails because in my
> postgresql logs I get this:
> 2006-06-08 01:17:05 EDT LOG:  unexpected EOF on client connection
> 2006-06-08 01:17:05 EDT ERROR:  invalid byte sequence for encoding
> "UTF8": 0xe1202c
>
> This is by far the most common, but I'm getting a few others too such
> as, 0xae, 0x85, 0x92 and more...
>
> The basic setup is that Postfix hands the email to a program called
> dbmail-smtp which parses and insert the message into the database.
> DBMail doesn't know anything about encoding.  I tried setting the
> enviornment variable PGCLIENTENCODING=SQL_ASCII in the Postfix startup
> script, but that doesn't seem to be making any difference.
>
> Any suggestions?
>
> Thanks,
>
> Matt
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>               http://archives.postgresql.org
>

Re: UTF8 problem

From

Tino Wildenhain

Date:

08 June 2006, 03:38:10

Matthew T. O'Connor schrieb:
> Well, to answer my own question, I hacked the source code of DBMail and
> had it set the client encoding to LATIN1 immediately after database
> connect, this seems to have fixed the problem.
>

You could also just have set the client_encoding as an user-option
in postgres so on login, the encoding is set to what you want
automatically. (See ALTER USER ...)

Regards
Tino

new FAQ entry (was:Re: UTF8 problem)

From

Tim Allen

Date:

08 June 2006, 04:00:34

Matthew T. O'Connor wrote:
> Well, to answer my own question, I hacked the source code of DBMail and
> had it set the client encoding to LATIN1 immediately after database
> connect, this seems to have fixed the problem.
>
> Sorry for the noise,
>
> Matt

I've seen this sort of problem asked about in the mailing lists often
enough to think it merits a FAQ entry, so how about this text:

<entry>
Q. Why do I have problems inserting text into my database, with error
messages like

ERROR:  invalid byte sequence for encoding "UTF8": 0xe1202c ?

A. Almost certainly that byte sequence really is an invalid byte
sequence for that encoding. The reason you are seeing the error is
probably because you are providing text in some other encoding. You and
the database need to agree between you what encoding you're using.
PostgreSQL is fairly good at working with you, converting to and from
whatever encoding you want to use, but you need to tell it what that
encoding is, and then stick to that encoding consistently.

If you don't set the client encoding, then PostgreSQL will use the
default encoding for the database, which in modern times is often UTF8
(aka UNICODE), and is set at database creation time. However, many
client apps still use other encodings, (eg Latin1, aka ISO-8859-1), so
you need to either educate the client app to use UTF8, or get it to
inform PostgreSQL what other encoding to use.

The way to tell PostgreSQL what encoding you want to use is by use of
the client_encoding GUC variable, eg

set client_encoding to 'LATIN1';

One reason you may be seeing this problem now, after upgrading your
version of PostgreSQL, is that recent versions have tighter validation
of encoded text. Previously you may not have been conscious of what
encoding you were actually using, especially if you're a speaker of a
Western European language, and may have gotten away with writing
incorrectly-encoded text without the database complaining. Now is the
time to start getting it right.

One thing to be wary of is the "SQL_ASCII" encoding. It appears to be
commonly and incorrectly believed that this represents either some
variant on latin1, or pure 7-bit ASCII. It is neither of those, but a
completely unchecked encoding that really means whatever you want it to
mean. This makes it not a very good encoding to use in practice, as it
becomes prone to allowing a mixture of different encodings to be present
in the same set of data, which will cause you headaches when you try to
convert the whole lot to some consistent encoding in the future.

See section 21.2 of the documentation for more complete information.
</entry>

Tim

--
-----------------------------------------------
Tim Allen          tim@proximity.com.au
Proximity Pty Ltd  http://www.proximity.com.au/

Re: new FAQ entry

From

Tino Wildenhain

Date:

08 June 2006, 04:11:37

Tim Allen schrieb:
> Matthew T. O'Connor wrote:
>
>> Well, to answer my own question, I hacked the source code of DBMail
>> and had it set the client encoding to LATIN1 immediately after
>> database connect, this seems to have fixed the problem.
>>
>> Sorry for the noise,
>>
>> Matt
>
>
> I've seen this sort of problem asked about in the mailing lists often
> enough to think it merits a FAQ entry, so how about this text:
>
> <entry>
> Q. Why do I have problems inserting text into my database, with error
> messages like
>
> ERROR:  invalid byte sequence for encoding "UTF8": 0xe1202c ?
>
> A. Almost certainly that byte sequence really is an invalid byte
> sequence for that encoding. The reason you are seeing the error is
> probably because you are providing text in some other encoding. You and
> the database need to agree between you what encoding you're using.
> PostgreSQL is fairly good at working with you, converting to and from
> whatever encoding you want to use, but you need to tell it what that
> encoding is, and then stick to that encoding consistently.
>
> If you don't set the client encoding, then PostgreSQL will use the
> default encoding for the database, which in modern times is often UTF8
> (aka UNICODE), and is set at database creation time. However, many
> client apps still use other encodings, (eg Latin1, aka ISO-8859-1), so
> you need to either educate the client app to use UTF8, or get it to
> inform PostgreSQL what other encoding to use.
>
> The way to tell PostgreSQL what encoding you want to use is by use of
> the client_encoding GUC variable, eg
>
> set client_encoding to 'LATIN1';

If you cant educate your client application to set this option on connect,
you can set this per user:

ALTER USER clientappuser SET client_encoding to 'what your app uses';
>
> One reason you may be seeing this problem now, after upgrading your
> version of PostgreSQL, is that recent versions have tighter validation
> of encoded text. Previously you may not have been conscious of what
> encoding you were actually using, especially if you're a speaker of a
> Western European language, and may have gotten away with writing
> incorrectly-encoded text without the database complaining. Now is the
> time to start getting it right.
>
> One thing to be wary of is the "SQL_ASCII" encoding. It appears to be
> commonly and incorrectly believed that this represents either some
> variant on latin1, or pure 7-bit ASCII. It is neither of those, but a
> completely unchecked encoding that really means whatever you want it to
> mean. This makes it not a very good encoding to use in practice, as it
> becomes prone to allowing a mixture of different encodings to be present
> in the same set of data, which will cause you headaches when you try to
> convert the whole lot to some consistent encoding in the future.
>
> See section 21.2 of the documentation for more complete information.
> </entry>
>
> Tim
>

Re: new FAQ entry

From

Tim Allen

Date:

08 June 2006, 04:19:45

Tino Wildenhain wrote:
> Tim Allen schrieb:

[snip]

>> The way to tell PostgreSQL what encoding you want to use is by use of
>> the client_encoding GUC variable, eg
>>
>> set client_encoding to 'LATIN1';
>
>
> If you cant educate your client application to set this option on connect,
> you can set this per user:
>
> ALTER USER clientappuser SET client_encoding to 'what your app uses';

Good point. I guess it's worth mentioning this and some of the other
ways you can set the encoding. Though there are actually quite a few
different ways - that might make the entry overly long. Opinions?

Tim

--
-----------------------------------------------
Tim Allen          tim@proximity.com.au
Proximity Pty Ltd  http://www.proximity.com.au/

Re: UTF8 problem

From

Alban Hertroys

Date:

08 June 2006, 06:05:23

Matthew T. O'Connor wrote:
> Well, to answer my own question, I hacked the source code of DBMail and
> had it set the client encoding to LATIN1 immediately after database
> connect, this seems to have fixed the problem.

LATIN1 != UTF-8. Your problem isn't solved yet.

You should either tell your client to use UTF-8 or alter your database
to use LATIN1.

--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
    7500 AK Enschede

// Integrate Your World //

Re: UTF8 problem

From

Tino Wildenhain

Date:

08 June 2006, 06:15:33

Alban Hertroys schrieb:
> Matthew T. O'Connor wrote:
>
>> Well, to answer my own question, I hacked the source code of DBMail
>> and had it set the client encoding to LATIN1 immediately after
>> database connect, this seems to have fixed the problem.
>
>
> LATIN1 != UTF-8. Your problem isn't solved yet.
>

Well, this enables postgres to translate the encoding.
However I would be unsure if dbmail always sends latin-1
anyway.

Regards
Tino

Re: UTF8 problem

From

"Daniel Verite"

Date:

08 June 2006, 08:02:17

    Matthew T. O'Connor wrote:

> The basic setup is that Postfix hands the email to a program called
> dbmail-smtp which parses and insert the message into the database.
> DBMail doesn't know anything about encoding.

That's precisely what SQL_ASCII is for. Why not stay with it?

--
 Daniel
 PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org

Re: UTF8 problem

From

Douglas McNaught

Date:

08 June 2006, 08:25:56

Tino Wildenhain <tino@wildenhain.de> writes:

> Alban Hertroys schrieb:
>> Matthew T. O'Connor wrote:
>>
>>> Well, to answer my own question, I hacked the source code of DBMail
>>> and had it set the client encoding to LATIN1 immediately after
>>> database connect, this seems to have fixed the problem.
>> LATIN1 != UTF-8. Your problem isn't solved yet.
>>
>
> Well, this enables postgres to translate the encoding.
> However I would be unsure if dbmail always sends latin-1
> anyway.

I would think it would (at least potentially) vary with each message.
The dbmail software should really set client_encoding based on the
Content-Transfer-Encoding header in the message (or whatever it's
called).

LATIN-1 is one of the most common encodings for email but it's
scarcely the only one...

-Doug

Re: UTF8 problem

From

"Daniel Verite"

Date:

08 June 2006, 11:08:54

    Douglas McNaught wrote:

> I would think it would (at least potentially) vary with each message.
> The dbmail software should really set client_encoding based on the
> Content-Transfer-Encoding header in the message (or whatever it's
> called).

That would be the "charset" parameter of the Content-Type header,
Content-Transfer-Encoding having a different purpose.

Anyway, doing this would be quite risky, just look for example at
the security hole refered to as CVE-2006-2313.

dbmail authors are aware of the issue, it's quite clearly explained
here:
http://mailman.fastxs.net/pipermail/dbmail-dev/2005-November/007656.html

On the other hand they had a bug filed here:
http://www.dbmail.org/mantis/view.php?id=218
where a user reports the same problem than the OP,
and for which the analysis is pretty strange, pretending that
UNICODE shouldn't be used with pg<8.1 :)

IMHO they fail to draw the proper conclusion, which is that
either the raw mail should be stored as either as a binary object,
or as a text field in a database with SQL_ASCII encoding, in both
cases providing the level of transparency that they need by design,
their purpose being to store and retrieve the mail, not to check its
its contents.

--
 Daniel
 PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org

Re: new FAQ entry

From

Erik Jones

Date:

08 June 2006, 11:17:35

Tim Allen wrote:
> Tino Wildenhain wrote:
>> Tim Allen schrieb:
>
> [snip]
>
>>> The way to tell PostgreSQL what encoding you want to use is by use
>>> of the client_encoding GUC variable, eg
>>>
>>> set client_encoding to 'LATIN1';
>>
>>
>> If you cant educate your client application to set this option on
>> connect,
>> you can set this per user:
>>
>> ALTER USER clientappuser SET client_encoding to 'what your app uses';
>
> Good point. I guess it's worth mentioning this and some of the other
> ways you can set the encoding. Though there are actually quite a few
> different ways - that might make the entry overly long. Opinions?
>
> Tim

You could always go with one solution in the faq, such as the one you've
already given, and list and link to others near the end.  In fact, this
is probably best without at least mentioning the other solutions newbies
(like me) might never look furthur to find, what for them, might be a
better, or more fitting, solution...

Re: UTF8 problem

From

"Matthew T. O'Connor"

Date:

08 June 2006, 17:42:03

Tino Wildenhain wrote:
> Matthew T. O'Connor schrieb:
>> Well, to answer my own question, I hacked the source code of DBMail
>> and had it set the client encoding to LATIN1 immediately after
>> database connect, this seems to have fixed the problem.
>>
> You could also just have set the client_encoding as an user-option
> in postgres so on login, the encoding is set to what you want
> automatically. (See ALTER USER ...)

Well that's cool.  I had no idea you could do this.  Thanks for the tip.

Matt

Re: UTF8 problem

From

"Matthew T. O'Connor"

Date:

08 June 2006, 17:47:23

Daniel Verite wrote:
>     Matthew T. O'Connor wrote:
>
>> The basic setup is that Postfix hands the email to a program called
>> dbmail-smtp which parses and insert the message into the database.
>> DBMail doesn't know anything about encoding.
>>
> That's precisely what SQL_ASCII is for. Why not stay with it?

Well for one thing, you can't set encoding on a per column, or even per
table basis, you have to set it for the whole database.  I have all the
DBMail data inside of our company database that uses UTF8.

Re: UTF8 problem

From

"Matthew T. O'Connor"

Date:

08 June 2006, 17:54:34

Daniel Verite wrote:
> IMHO they fail to draw the proper conclusion, which is that
> either the raw mail should be stored as either as a binary object,
> or as a text field in a database with SQL_ASCII encoding, in both
> cases providing the level of transparency that they need by design,
> their purpose being to store and retrieve the mail, not to check its
> its contents.

They have talked about changing the messageblks to binary instead of
text.  They said that one of their main objections is that bytea data is
not compressed.  I'm not sure that's true, but I don't see anything in
the docs about it.   I think they would move to bytea if it does compress.

Re: UTF8 problem

From

Tom Lane

Date:

08 June 2006, 18:51:40

"Matthew T. O'Connor" <matthew@zeut.net> writes:
> They have talked about changing the messageblks to binary instead of
> text.  They said that one of their main objections is that bytea data is
> not compressed.  I'm not sure that's true, but I don't see anything in
> the docs about it.   I think they would move to bytea if it does compress.

These people really don't know what they're talking about, do they?

bytea goes through exactly the same compression logic as text.  But
really it sounds to me like the easiest solution is just to run the
database in SQL_ASCII encoding.

            regards, tom lane

Re: UTF8 problem

From

"Matthew T. O'Connor"

Date:

09 June 2006, 02:37:28

Tom Lane wrote:
> "Matthew T. O'Connor" <matthew@zeut.net> writes:
>> They have talked about changing the messageblks to binary instead of
>> text.  They said that one of their main objections is that bytea data is
>> not compressed.  I'm not sure that's true, but I don't see anything in
>> the docs about it.   I think they would move to bytea if it does compress.
>
> These people really don't know what they're talking about, do they?

Ok, this is what I thought, but I wasn't sure.  We might want to add a
note to the docs that detail this fact.  If it's already there, it
wasn't prominent enough for me to find with a cursory glance.

> bytea goes through exactly the same compression logic as text.  But
> really it sounds to me like the easiest solution is just to run the
> database in SQL_ASCII encoding.

That is what they recommend, but I have the DBMail tables in the same
database as the rest of my company database which I converted to UTF8
(perhaps I didn't need to do this, but that is another issue...).

Thanks.

Re: UTF8 problem

From

Stephane Bortzmeyer

Date:

15 June 2006, 11:34:44

On Thu, Jun 08, 2006 at 07:25:35AM -0400,
 Douglas McNaught <doug@mcnaught.org> wrote
 a message of 29 lines which said:

> I would think it would (at least potentially) vary with each
> message.  The dbmail software should really set client_encoding
> based on the Content-Transfer-Encoding header in the message (or
> whatever it's called).

A *big* warning from someone who stores email in PostgreSQL: many
email messages *lie*. They have a Content-transfer-encoding and then
they actually use another encoding.

If you blindly try to inject the body of the message into PostgreSQL,
with the indicated encoding, you will sometimes fail, for instance if
the message claim to be in UTF-8 but is not (something that PostgreSQL
will detect).

Either you:

* "sanitize" all incoming data
* or you accept to reject these invalid email
* or you store them in a unstructured field (a blob)

Re: new FAQ entry (was:Re: UTF8 problem)

From

Bruce Momjian

Date:

22 August 2006, 00:37:56

Instead of adding an FAQ entry, which might not be found when the error
is generated, I added a HINT for 8.2 that will appear with the error
message:

     errmsg("invalid byte sequence for encoding \"%s\": 0x%s",
            pg_enc2name_tbl[encoding].name,
            buf),
     errhint("This failure can also happen if the byte sequence does not "
              "match the encoding expected by the server, which is controlled "
             "by \"client_encoding\".")));

Supplying information at the point of error is usually the best
solution, if possible.

Backpatched to 8.1.X as well.

---------------------------------------------------------------------------

Tim Allen wrote:
> Matthew T. O'Connor wrote:
> > Well, to answer my own question, I hacked the source code of DBMail and
> > had it set the client encoding to LATIN1 immediately after database
> > connect, this seems to have fixed the problem.
> >
> > Sorry for the noise,
> >
> > Matt
>
> I've seen this sort of problem asked about in the mailing lists often
> enough to think it merits a FAQ entry, so how about this text:
>
> <entry>
> Q. Why do I have problems inserting text into my database, with error
> messages like
>
> ERROR:  invalid byte sequence for encoding "UTF8": 0xe1202c ?
>
> A. Almost certainly that byte sequence really is an invalid byte
> sequence for that encoding. The reason you are seeing the error is
> probably because you are providing text in some other encoding. You and
> the database need to agree between you what encoding you're using.
> PostgreSQL is fairly good at working with you, converting to and from
> whatever encoding you want to use, but you need to tell it what that
> encoding is, and then stick to that encoding consistently.
>
> If you don't set the client encoding, then PostgreSQL will use the
> default encoding for the database, which in modern times is often UTF8
> (aka UNICODE), and is set at database creation time. However, many
> client apps still use other encodings, (eg Latin1, aka ISO-8859-1), so
> you need to either educate the client app to use UTF8, or get it to
> inform PostgreSQL what other encoding to use.
>
> The way to tell PostgreSQL what encoding you want to use is by use of
> the client_encoding GUC variable, eg
>
> set client_encoding to 'LATIN1';
>
> One reason you may be seeing this problem now, after upgrading your
> version of PostgreSQL, is that recent versions have tighter validation
> of encoded text. Previously you may not have been conscious of what
> encoding you were actually using, especially if you're a speaker of a
> Western European language, and may have gotten away with writing
> incorrectly-encoded text without the database complaining. Now is the
> time to start getting it right.
>
> One thing to be wary of is the "SQL_ASCII" encoding. It appears to be
> commonly and incorrectly believed that this represents either some
> variant on latin1, or pure 7-bit ASCII. It is neither of those, but a
> completely unchecked encoding that really means whatever you want it to
> mean. This makes it not a very good encoding to use in practice, as it
> becomes prone to allowing a mixture of different encodings to be present
> in the same set of data, which will cause you headaches when you try to
> convert the whole lot to some consistent encoding in the future.
>
> See section 21.2 of the documentation for more complete information.
> </entry>
>
> Tim
>
> --
> -----------------------------------------------
> Tim Allen          tim@proximity.com.au
> Proximity Pty Ltd  http://www.proximity.com.au/
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq

--
  Bruce Momjian   bruce@momjian.us
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +