Thread: strange encoding behavior

strange encoding behavior

From
Jeff Davis
Date:
The following encoding behavior seems strange to me (v8.1.4). I have
read the docs, but I am still confused.

I have a UTF8 encoded database. I can do

=> SELECT '\xb9'::text;

But that seems to be the only way to get an invalid utf8 byte sequence
into a text type.

Even if I do PQexecParams and send the data as binary format (and type
text), I get a conversion error.

If I send the invalid character in a raw PQexec query, I assume that
postgres tries to convert it to cstring first, causing the conversion
error. That means it's impossible to send any character that's an
invalid UTF8 sequence in a raw query (as a value, anyway), as far as I
can tell. What motivates this question is that I have an application
inserting these invalid characters (using them in the raw query), and I
am finding it difficult to migrate to the UTF8-encoded database.

It seems strange that it's possible to put invalid utf8 byte sequences
in a text field, but only by using the E''-style escape sequences. The
only way I have found to do it using PQexecParams with the binary data
is something like:

=> SELECT textin(byteaout($1)); -- $1 is binary format, type bytea

So, if I were to sum this up in a single question, why does cstring not
accept invalid utf8 sequences? And if it doesn't, why are they allowed
in any text type?

Regards,
    Jeff Davis

PS: I posted a similar question yesterday that included a lot of useless
information. I'm not trying to repost, I'm trying to focus my question a
little more.

Re: strange encoding behavior

From
"Albe Laurenz"
Date:
Jeff Davis wrote:
> I have a UTF8 encoded database. I can do
>
> => SELECT '\xb9'::text;
>
> But that seems to be the only way to get an invalid utf8 byte sequence
> into a text type.
[...]
> So, if I were to sum this up in a single question, why does cstring
not
> accept invalid utf8 sequences? And if it doesn't, why are they allowed
> in any text type?

I would say that it should be impossible to get invalid UTF-8 bytes
into a text on an UTF-8 database, and my opinion is that it is a bug or
oversight if a typecast allows you to do so.

The program you are talking about that needs to be able to store
arbitrary bytes in a text column should be changed - maybe it is enough
to change the data type of the database column from 'text' to 'bytea'.

Yours,
Laurenz Albe

Re: strange encoding behavior

From
Jeff Davis
Date:
On Mon, 2006-10-23 at 10:26 +0200, Albe Laurenz wrote:
> Jeff Davis wrote:
> > I have a UTF8 encoded database. I can do
> >
> > => SELECT '\xb9'::text;
> >
> > But that seems to be the only way to get an invalid utf8 byte sequence
> > into a text type.
> [...]
> > So, if I were to sum this up in a single question, why does cstring
> not
> > accept invalid utf8 sequences? And if it doesn't, why are they allowed
> > in any text type?
>
> I would say that it should be impossible to get invalid UTF-8 bytes
> into a text on an UTF-8 database, and my opinion is that it is a bug or
> oversight if a typecast allows you to do so.

That wouldn't help me, but it seems like more consistent behavior.

> The program you are talking about that needs to be able to store
> arbitrary bytes in a text column should be changed - maybe it is enough
> to change the data type of the database column from 'text' to 'bytea'.
>

The problem is that all the bytes in the quoted string are converted to
a cstring first, which rejects invalid UTF8 sequences. So, even if it's
bytea type, the query itself can't contain the bytes I want to store.

The only way bytea would work is using PQexecParams and setting the type
to bytea and the format to binary. I agree that's the more robust way
for the application to be written, but unfortunately that's not how it
was written.

Regards,
    Jeff Davis