Re: invalidly encoded strings - Mailing list pgsql-hackers

From db@zigo.dhs.org
Subject Re: invalidly encoded strings
Date
Msg-id 46828.192.121.104.48.1189492554.squirrel@zigo.dhs.org
Whole thread Raw
In response to Re: invalidly encoded strings  (Tatsuo Ishii <ishii@postgresql.org>)
List pgsql-hackers
>> Try the sequence below. Then, try to dump and then reload the database.
>> When you try to reload it, you will get an error:
>>
>> ERROR:  invalid byte sequence for encoding "UTF8": 0xbd
>
> I know this could be a problem (like chr() with invalid byte pattern).

And that's enough of a problem already. We don't need more problems.

> What I really want to know is, read query something like this:
>
> SELECT * FROM japanese_table ORDER BY convert(japanese_text using
> utf8_to_euc_jp);
>
> could be a problem (I assume we use C locale).

If convert() produce a sequence of bytes that can't be interpreted as a
string in the server encoding then it's broken. Imho convert() should
return a bytea value. If we hade good encoding/charset support we could do
better, but we can't today.

The above example would work fine if convert() returned a bytea. In the C
locale the string would be compared byte for byte and that's what you get
with bytea values as well.

Strings are not sequences of bytes that can be interpreted in different
ways. That's what bytea values are. Strings are in a specific encoding
always, and in pg that encoding is fixed to a single one for a whole
cluster at initdb time. We should not confuse text with bytea.

/Dennis



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: invalidly encoded strings
Next
From: Tatsuo Ishii
Date:
Subject: Re: invalidly encoded strings