Home > mailing lists

Re: invalidly encoded strings - Mailing list pgsql-hackers

From	db@zigo.dhs.org
Subject	Re: invalidly encoded strings
Date	September 11, 2007 03:36:39
Msg-id	46828.192.121.104.48.1189492554.squirrel@zigo.dhs.org Whole thread Raw
In response to	Re: invalidly encoded strings (Tatsuo Ishii <ishii@postgresql.org>)
List	pgsql-hackers

Tree view

>> Try the sequence below. Then, try to dump and then reload the database.
>> When you try to reload it, you will get an error:
>>
>> ERROR:  invalid byte sequence for encoding "UTF8": 0xbd
>
> I know this could be a problem (like chr() with invalid byte pattern).

And that's enough of a problem already. We don't need more problems.

> What I really want to know is, read query something like this:
>
> SELECT * FROM japanese_table ORDER BY convert(japanese_text using
> utf8_to_euc_jp);
>
> could be a problem (I assume we use C locale).

If convert() produce a sequence of bytes that can't be interpreted as a
string in the server encoding then it's broken. Imho convert() should
return a bytea value. If we hade good encoding/charset support we could do
better, but we can't today.

The above example would work fine if convert() returned a bytea. In the C
locale the string would be compared byte for byte and that's what you get
with bytea values as well.

Strings are not sequences of bytes that can be interpreted in different
ways. That's what bytea values are. Strings are in a specific encoding
always, and in pg that encoding is fixed to a single one for a whole
cluster at initdb time. We should not confuse text with bytea.

/Dennis

pgsql-hackers by date:

From: Jeff Davis
Date: 11 September 2007, 03:35:28
Subject: Re: invalidly encoded strings

From: Tatsuo Ishii
Date: 11 September 2007, 04:17:38
Subject: Re: invalidly encoded strings

Re: invalidly encoded strings - Mailing list pgsql-hackers

Previous

Next