Thread: strange encoding behavior
The following encoding behavior seems strange to me (v8.1.4). I have read the docs, but I am still confused. I have a UTF8 encoded database. I can do => SELECT '\xb9'::text; But that seems to be the only way to get an invalid utf8 byte sequence into a text type. Even if I do PQexecParams and send the data as binary format (and type text), I get a conversion error. If I send the invalid character in a raw PQexec query, I assume that postgres tries to convert it to cstring first, causing the conversion error. That means it's impossible to send any character that's an invalid UTF8 sequence in a raw query (as a value, anyway), as far as I can tell. What motivates this question is that I have an application inserting these invalid characters (using them in the raw query), and I am finding it difficult to migrate to the UTF8-encoded database. It seems strange that it's possible to put invalid utf8 byte sequences in a text field, but only by using the E''-style escape sequences. The only way I have found to do it using PQexecParams with the binary data is something like: => SELECT textin(byteaout($1)); -- $1 is binary format, type bytea So, if I were to sum this up in a single question, why does cstring not accept invalid utf8 sequences? And if it doesn't, why are they allowed in any text type? Regards, Jeff Davis PS: I posted a similar question yesterday that included a lot of useless information. I'm not trying to repost, I'm trying to focus my question a little more.
Jeff Davis wrote: > I have a UTF8 encoded database. I can do > > => SELECT '\xb9'::text; > > But that seems to be the only way to get an invalid utf8 byte sequence > into a text type. [...] > So, if I were to sum this up in a single question, why does cstring not > accept invalid utf8 sequences? And if it doesn't, why are they allowed > in any text type? I would say that it should be impossible to get invalid UTF-8 bytes into a text on an UTF-8 database, and my opinion is that it is a bug or oversight if a typecast allows you to do so. The program you are talking about that needs to be able to store arbitrary bytes in a text column should be changed - maybe it is enough to change the data type of the database column from 'text' to 'bytea'. Yours, Laurenz Albe
On Mon, 2006-10-23 at 10:26 +0200, Albe Laurenz wrote: > Jeff Davis wrote: > > I have a UTF8 encoded database. I can do > > > > => SELECT '\xb9'::text; > > > > But that seems to be the only way to get an invalid utf8 byte sequence > > into a text type. > [...] > > So, if I were to sum this up in a single question, why does cstring > not > > accept invalid utf8 sequences? And if it doesn't, why are they allowed > > in any text type? > > I would say that it should be impossible to get invalid UTF-8 bytes > into a text on an UTF-8 database, and my opinion is that it is a bug or > oversight if a typecast allows you to do so. That wouldn't help me, but it seems like more consistent behavior. > The program you are talking about that needs to be able to store > arbitrary bytes in a text column should be changed - maybe it is enough > to change the data type of the database column from 'text' to 'bytea'. > The problem is that all the bytes in the quoted string are converted to a cstring first, which rejects invalid UTF8 sequences. So, even if it's bytea type, the query itself can't contain the bytes I want to store. The only way bytea would work is using PQexecParams and setting the type to bytea and the format to binary. I agree that's the more robust way for the application to be written, but unfortunately that's not how it was written. Regards, Jeff Davis