Re: BUG #5221: errors converting string to utf8 - Mailing list pgsql-bugs

From Craig Ringer
Subject Re: BUG #5221: errors converting string to utf8
Date
Msg-id 4B15141B.5050809@postnewspapers.com.au
Whole thread Raw
In response to BUG #5221: errors converting string to utf8  ("DM" <dmedri@ciseonweb.it>)
Responses Re: BUG #5221: errors converting string to utf8  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
On 1/12/2009 7:36 PM, DM wrote:

> select convert('Vall\xe9e','LATIN1','UTF8');

You expect the result 'Vallée', right? As in the utf-8 bytes

   Vall\xc3\xa9e

?

> ERROR:  invalid byte sequence for encoding "UTF8": 0xe965
> HINT: This error can also happen if the byte sequence does not match the
> encoding expected by the server, which is controlled by "client_encoding".

The HINT tells you what's going on here.

The error isn't being raised by convert(...). You'll get the same error
if you just:

     select 'Vall\xe9e';

... because the problem is that the server is trying to treat that
literal as a string in your client_encoding, which is UTF-8, and convert
it to the server encoding.

You will note that the convert(...) function takes a 'bytea' argument
rather than 'text'. That's because PostgreSQL string literals are
expected to be valid text in the current encoding. Entering a string
literal with a byte sequence invalid in the current encoding will be
problematic. You need to supply a valid bytea value, which involves
juggling some interesting double-escaping and using octal escapes for
characters.

The octal escape for the byte 0xe9 is \351, so to correctly submit your
query to PostgreSQL you'd need to send:

   select convert(E'Vall\\351e','LATIN1','UTF8');

or if you really just want to convert to the server encoding:

   select convert_from(E'Vall\\351e','LATIN-1');

Here, E'' is used for the string to tell PostgreSQL to interpret
escapes, and \\351 is used as the octal escape for 0xe9. We double the
backslashes because the first backslash gets used up parsing the
literal, and we want the bytea value to actually be 'Val\351e' after
string literal parsing.

Instead, if standard_conforming_strings is on, you can omit the E'' and
use single backslashes.

Note that in 8.5 and above you could use E'Val\\xe9e' instead of the
octal form, as 8.5 understands hex escapes in bytea values.

I guess bytea is kind of ugly to use in a text-based protocol :-(
because the query string has to be valid text in the client_encoding, so
bytea values containing data not valid in the client encoding must be
escaped in the parsed query string. There's nothing much to be done
about it, though.

--
Craig Ringer

pgsql-bugs by date:

Previous
From: "DM"
Date:
Subject: BUG #5221: errors converting string to utf8
Next
From: Tom Lane
Date:
Subject: Re: BUG #5221: errors converting string to utf8