Thread: BUG #5221: errors converting string to utf8

BUG #5221: errors converting string to utf8

From
"DM"
Date:
The following bug has been logged online:

Bug reference:      5221
Logged by:          DM
Email address:      dmedri@ciseonweb.it
PostgreSQL version: 8.2 / 8.4
Operating system:   linux ubuntu 9.10
Description:        errors converting string to utf8
Details:

I'm using 2 version of pg (8.2, 8.4).

On 8.2:

select convert('Vall\xe9e','LATIN1','UTF8');

convert this string without any kind of error.


On 8.4:

select convert('Vall\xe9e','LATIN1','UTF8');

i receive these messages:

WARNING:  nonstandard use of escape in a string literal
RIGA 1: select convert('Vall\xe9e','LATIN1','UTF8');
                       ^
HINT: Use the escape string syntax for escapes, e.g., E'\r\n'.
ERROR:  invalid byte sequence for encoding "UTF8": 0xe965
HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".


client charset: utf8
server charset: utf8

Re: BUG #5221: errors converting string to utf8

From
Craig Ringer
Date:
On 1/12/2009 7:36 PM, DM wrote:

> select convert('Vall\xe9e','LATIN1','UTF8');

You expect the result 'Vallée', right? As in the utf-8 bytes

   Vall\xc3\xa9e

?

> ERROR:  invalid byte sequence for encoding "UTF8": 0xe965
> HINT: This error can also happen if the byte sequence does not match the
> encoding expected by the server, which is controlled by "client_encoding".

The HINT tells you what's going on here.

The error isn't being raised by convert(...). You'll get the same error
if you just:

     select 'Vall\xe9e';

... because the problem is that the server is trying to treat that
literal as a string in your client_encoding, which is UTF-8, and convert
it to the server encoding.

You will note that the convert(...) function takes a 'bytea' argument
rather than 'text'. That's because PostgreSQL string literals are
expected to be valid text in the current encoding. Entering a string
literal with a byte sequence invalid in the current encoding will be
problematic. You need to supply a valid bytea value, which involves
juggling some interesting double-escaping and using octal escapes for
characters.

The octal escape for the byte 0xe9 is \351, so to correctly submit your
query to PostgreSQL you'd need to send:

   select convert(E'Vall\\351e','LATIN1','UTF8');

or if you really just want to convert to the server encoding:

   select convert_from(E'Vall\\351e','LATIN-1');

Here, E'' is used for the string to tell PostgreSQL to interpret
escapes, and \\351 is used as the octal escape for 0xe9. We double the
backslashes because the first backslash gets used up parsing the
literal, and we want the bytea value to actually be 'Val\351e' after
string literal parsing.

Instead, if standard_conforming_strings is on, you can omit the E'' and
use single backslashes.

Note that in 8.5 and above you could use E'Val\\xe9e' instead of the
octal form, as 8.5 understands hex escapes in bytea values.

I guess bytea is kind of ugly to use in a text-based protocol :-(
because the query string has to be valid text in the client_encoding, so
bytea values containing data not valid in the client encoding must be
escaped in the parsed query string. There's nothing much to be done
about it, though.

--
Craig Ringer

Re: BUG #5221: errors converting string to utf8

From
Tom Lane
Date:
Craig Ringer <craig@postnewspapers.com.au> writes:
> On 1/12/2009 7:36 PM, DM wrote:
>> select convert('Vall\xe9e','LATIN1','UTF8');

> You expect the result 'Vallée', right? As in the utf-8 bytes
>    Vall\xc3\xa9e
> ?

The whole thing seems like an exercise in unnecessary pain anyway.
Why not just set client_encoding to LATIN1 and send your strings
as-is, letting the server deal with the encoding conversion
automatically?

            regards, tom lane