Thread: 8.0, UTF8, and CLIENT_ENCODING

8.0, UTF8, and CLIENT_ENCODING

From

Paul Ramsey

Date:

17 May 2007, 17:57:28

I have a small database (PgSQL 8.0, database encoding UTF8) that folks
are inserting into via a web form. The form itself is declared
ISO-8859-1 and the prior to inserting any data, pg_client_encoding is
set to LATIN1.

Most of the high-bit characters are correctly translated from LATIN1 to
UTF8. So for e-accent-egu I see the two-byte UTF8 value in the database.

Sometimes, in their wisdom, people cut'n'paste information out of MSWord
and put that in the form. Instead of being mapped to 2-byte UTF8
high-bit equivalents, they are going into the database directly as
one-byte values > 127. That is, as illegal UTF8 values.

When I try to dump'n'restore this database into PgSQL 8.2, my data can't
made the transit.

Firstly, is this "kinda sorta" encoding handling expected in 8.0, or did
I do something wrong?

Secondly, anyone know any useful tools to pipe a stream through to strip
out illegal UTF8 bytes, so I can pipe my dump through that rather than
hand editing it?

Thanks,

Paul

--

   Paul Ramsey
   Refractions Research
   http://www.refractions.net
   pramsey@refractions.net
   Phone: 250-383-3022
   Cell: 250-885-0632

Re: 8.0, UTF8, and CLIENT_ENCODING

From

Hannes Dorbath

Date:

17 May 2007, 18:16:17

Paul Ramsey wrote:
> I have a small database (PgSQL 8.0, database encoding UTF8) that folks
> are inserting into via a web form. The form itself is declared
> ISO-8859-1 and the prior to inserting any data, pg_client_encoding is
> set to LATIN1.
>
> Most of the high-bit characters are correctly translated from LATIN1 to
> UTF8. So for e-accent-egu I see the two-byte UTF8 value in the database.
>
> Sometimes, in their wisdom, people cut'n'paste information out of MSWord
> and put that in the form. Instead of being mapped to 2-byte UTF8
> high-bit equivalents, they are going into the database directly as
> one-byte values > 127. That is, as illegal UTF8 values.
>
> When I try to dump'n'restore this database into PgSQL 8.2, my data can't
> made the transit.
>
> Firstly, is this "kinda sorta" encoding handling expected in 8.0, or did
> I do something wrong?
>
> Secondly, anyone know any useful tools to pipe a stream through to strip
> out illegal UTF8 bytes, so I can pipe my dump through that rather than
> hand editing it?

This is know issue, use

iconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql dumpfile.sql

to convert your dumps. I'm not sure if this is fixed in the 8.0 branch
at all.


--
Best regards,
Hannes Dorbath

Re: 8.0, UTF8, and CLIENT_ENCODING

From

PFC

Date:

17 May 2007, 18:45:45


> I have a small database (PgSQL 8.0, database encoding UTF8) that folks
> are inserting into via a web form. The form itself is declared
> ISO-8859-1 and the prior to inserting any data, pg_client_encoding is
> set to LATIN1.

    Wouldn't it be simpler to have the browser submit the form in utf8 ?

> Most of the high-bit characters are correctly translated from LATIN1 to
> UTF8. So for e-accent-egu I see the two-byte UTF8 value in the database.
>
> Sometimes, in their wisdom, people cut'n'paste information out of MSWord

    Argh.

> and put that in the form. Instead of being mapped to 2-byte UTF8
> high-bit equivalents, they are going into the database directly as
> one-byte values > 127. That is, as illegal UTF8 values.

    Sometimes you also get HTML entities in the mix. Who knows.
    All my web forms are UTF-8 back to back, it just works. Was I lucky ?
    Normally postgres rejects illegal UTF8 values, you wouldn't be able to
insert them...

> When I try to dump'n'restore this database into PgSQL 8.2, my data can't
> made the transit.
>
> Firstly, is this "kinda sorta" encoding handling expected in 8.0, or did
> I do something wrong?

    Duh ? pg isn't supposed to accept bad unicode data... something
suspicious is going on.
    Besides, if it was dumped, it should be reloadable... did pg_dump use a
funky encoding ?

> Secondly, anyone know any useful tools to pipe a stream through to strip
> out illegal UTF8 bytes, so I can pipe my dump through that rather than
> hand editing it?

    Yes, use iconv (see man page), it can do this for you quite easily. It's
probably already installed on your system.
    Be warned, though, that illegal multibyte characters eat quotes at night
while you aren't looking... unterminated strings are a pain.
    You could also load your database with C locale, and have a script select
 from the records you wish to convert, and update the rows.
    Python has very good Unicode support, should be easy to make such a
script.

Re: 8.0, UTF8, and CLIENT_ENCODING

From

Michael Glaesemann

Date:

17 May 2007, 19:25:29

On May 17, 2007, at 16:47 , PFC wrote:

>> and put that in the form. Instead of being mapped to 2-byte UTF8
>> high-bit equivalents, they are going into the database directly as
>> one-byte values > 127. That is, as illegal UTF8 values.
>
>     Sometimes you also get HTML entities in the mix. Who knows.
>     All my web forms are UTF-8 back to back, it just works. Was I lucky ?
>     Normally postgres rejects illegal UTF8 values, you wouldn't be
> able to insert them...

8.0 and earlier weren't quite as strict as it should have been. See
the note at the end of the migration instuctions in the release notes
for 8.1[1] That may have been part of the issue here.

Michael Glaesemann
grzm seespotcode net

[1](http://www.postgresql.org/docs/8.2/interactive/
release-8-1.html#AEN80196)

Re: 8.0, UTF8, and CLIENT_ENCODING

From

Paul Ramsey

Date:

17 May 2007, 20:56:03

Thanks all for the information. Summary is:

- 8.0 wasn't very strict, and allowed the illegal values in, instead
of mapping them over into UTF-8 space
- the values can be stripped with iconv -c
- 8.2 should be more strict

I'm in the midst of my upgrade to 8.2 now, hopefully the LATIN1->UTF8
conversion will now map the odd characters cleanly into UTF space.

On 17-May-07, at 3:25 PM, Michael Glaesemann wrote:

>
> On May 17, 2007, at 16:47 , PFC wrote:
>
>>> and put that in the form. Instead of being mapped to 2-byte UTF8
>>> high-bit equivalents, they are going into the database directly
>>> as one-byte values > 127. That is, as illegal UTF8 values.
>>
>>     Sometimes you also get HTML entities in the mix. Who knows.
>>     All my web forms are UTF-8 back to back, it just works. Was I
>> lucky ?
>>     Normally postgres rejects illegal UTF8 values, you wouldn't be
>> able to insert them...
>
> 8.0 and earlier weren't quite as strict as it should have been. See
> the note at the end of the migration instuctions in the release
> notes for 8.1[1] That may have been part of the issue here.
>
> Michael Glaesemann
> grzm seespotcode net
>
> [1](http://www.postgresql.org/docs/8.2/interactive/
> release-8-1.html#AEN80196)