Thread: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"
Hello, Imagine a web application that process text search queries from clients. If one types a text search query in a browser it then sends proper UTF-8 characters and application after all needed processing (escaping, checks, etc) passes it to database. But if one modifies URL of the query adding some trash non-UTF-8 characters, database raises an error: invalid byte sequence for encoding "UTF8". What is the best practice to process such a broken strings before passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad characters? -- Regards, Ivan
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"
From
Martijn van Oosterhout
Date:
On Wed, Aug 15, 2007 at 03:41:30PM +0400, Ivan Zolotukhin wrote: > Hello, > > Imagine a web application that process text search queries from > clients. If one types a text search query in a browser it then sends > proper UTF-8 characters and application after all needed processing > (escaping, checks, etc) passes it to database. But if one modifies URL > of the query adding some trash non-UTF-8 characters, database raises > an error: invalid byte sequence for encoding "UTF8". > > What is the best practice to process such a broken strings before > passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad > characters? Well, the query as given by the user is invalid, so returning an error message complaining about the invalid byte sequence seems entirely reasonable. I don't see any reason to try and be smart. There's no way you can "fix" the query. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
On 15/08/07, Ivan Zolotukhin <ivan.zolotukhin@gmail.com> wrote: > Hello, > > Imagine a web application that process text search queries from > clients. If one types a text search query in a browser it then sends > proper UTF-8 characters and application after all needed processing > (escaping, checks, etc) passes it to database. But if one modifies URL > of the query adding some trash non-UTF-8 characters, database raises > an error: invalid byte sequence for encoding "UTF8". > > What is the best practice to process such a broken strings before > passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad > characters? Could you do not do this from the application itself, before sending it to the DB server? mb_convert_encoding() in PHP for instance?
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"
From
"Ivan Zolotukhin"
Date:
Hello, Well, PostgreSQL is correct entirely, I would post this message to the -hackers list otherwise :) The question was rather about application processing of user input not about change of database reaction on broken UTF-8 string. But I am 100% sure one should fix the input in this case since web site user can see some bad error (even if application caught this SQL exception for instance) otherwise. -- Regards, Ivan On 8/15/07, Martijn van Oosterhout <kleptog@svana.org> wrote: > On Wed, Aug 15, 2007 at 03:41:30PM +0400, Ivan Zolotukhin wrote: > > Hello, > > > > Imagine a web application that process text search queries from > > clients. If one types a text search query in a browser it then sends > > proper UTF-8 characters and application after all needed processing > > (escaping, checks, etc) passes it to database. But if one modifies URL > > of the query adding some trash non-UTF-8 characters, database raises > > an error: invalid byte sequence for encoding "UTF8". > > > > What is the best practice to process such a broken strings before > > passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad > > characters? > > Well, the query as given by the user is invalid, so returning an error > message complaining about the invalid byte sequence seems entirely > reasonable. > > I don't see any reason to try and be smart. There's no way you can > "fix" the query. > > Have a nice day, > -- > Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > > From each according to his ability. To each according to his ability to litigate. > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.1 (GNU/Linux) > > iD8DBQFGwuvTIB7bNG8LQkwRAudJAJ9c8gvUQ25/S54gtJAPdqMOd81pNwCfUeLi > JoWU92WJKZ1qM3UMRG5Zn0Y= > =dPLv > -----END PGP SIGNATURE----- > >
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"
From
"Ivan Zolotukhin"
Date:
Hello, Actually I tried smth like $str = @iconv("UTF-8", "UTF-8//IGNORE", $str); when preparing string for SQL query and it worked. There's probably a better way in PHP to achieve this: simply change default values in php.ini for these parameters: mbstring.encoding_translation = On mbstring.substitute_character = none and broken symbols will be automatically stripped off from the input and output. But I am interested in general solution and better practices anyway... -- Regards, Ivan On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > On 15/08/07, Ivan Zolotukhin <ivan.zolotukhin@gmail.com> wrote: > > Hello, > > > > Imagine a web application that process text search queries from > > clients. If one types a text search query in a browser it then sends > > proper UTF-8 characters and application after all needed processing > > (escaping, checks, etc) passes it to database. But if one modifies URL > > of the query adding some trash non-UTF-8 characters, database raises > > an error: invalid byte sequence for encoding "UTF8". > > > > What is the best practice to process such a broken strings before > > passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad > > characters? > > > > Could you do not do this from the application itself, before sending > it to the DB server? mb_convert_encoding() in PHP for instance? >
On Aug 15, 2007, at 7:41 AM, Ivan Zolotukhin wrote: > What is the best practice to process such a broken strings before > passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad > characters? This rings of GIGO... if your user enters garbage, how do you know what they wanted? You don't. You tell them they entered garbage, else the result will be garbage. Data validation... learn from microsoft's mistakes :-)
On 15/08/07, Ivan Zolotukhin <ivan.zolotukhin@gmail.com> wrote: > Hello, > > Actually I tried smth like $str = @iconv("UTF-8", "UTF-8//IGNORE", > $str); when preparing string for SQL query and it worked. There's > probably a better way in PHP to achieve this: simply change default > values in php.ini for these parameters: > > mbstring.encoding_translation = On > mbstring.substitute_character = none > > and broken symbols will be automatically stripped off from the input > and output. Sadly, they don't always do that, not with Asian scripts. And I do not completely agree, like the other poster suggested, with the concept of GIGO. Sometimes you want the end-user's experience to be seamless. For example, in one of our web sites, we allow users to submit text through a bookmarklet, where the title of the webpage comes in rawurlencoded format. We try to rawurldecode() it on our end but most of the times the Asian interpretation is wrong. We have all the usual mbstring settings in php.ini. In this scenario, the user did not enter any garbage. Our application should have the ability to recognize the text. We do what we can with mb_convert...etc, but the database just throws an error. PGSQL really needs to get with the program when it comes to utf-8 input.
On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > On 15/08/07, Ivan Zolotukhin <ivan.zolotukhin@gmail.com> wrote: > > Hello, > > > > Actually I tried smth like $str = @iconv("UTF-8", "UTF-8//IGNORE", > > $str); when preparing string for SQL query and it worked. There's > > probably a better way in PHP to achieve this: simply change default > > values in php.ini for these parameters: > > > > mbstring.encoding_translation = On > > mbstring.substitute_character = none > > > > and broken symbols will be automatically stripped off from the input > > and output. > > > Sadly, they don't always do that, not with Asian scripts. > > And I do not completely agree, like the other poster suggested, with > the concept of GIGO. Sometimes you want the end-user's experience to > be seamless. For example, in one of our web sites, we allow users to > submit text through a bookmarklet, where the title of the webpage > comes in rawurlencoded format. We try to rawurldecode() it on our end > but most of the times the Asian interpretation is wrong. We have all > the usual mbstring settings in php.ini. In this scenario, the user did > not enter any garbage. Our application should have the ability to > recognize the text. We do what we can with mb_convert...etc, but the > database just throws an error. > > PGSQL really needs to get with the program when it comes to utf-8 input. What, exactly, does that mean? That PostgreSQL should take things in invalid utf-8 format and just store them? Or that PostgreSQL should autoconvert from invalid utf-8 to valid utf-8, guessing the proper codes? Seriously, what do you want pgsql to do with these invalid inputs?
> What, exactly, does that mean? > > That PostgreSQL should take things in invalid utf-8 format and just store them? > Or that PostgreSQL should autoconvert from invalid utf-8 to valid > utf-8, guessing the proper codes? > > Seriously, what do you want pgsql to do with these invalid inputs? PG should let me, as the administrator of the database, decide whether I mind my DB to have an option to: 1. Either allow the "invalid" input 2. Or to continue storing the other information in the table even if an exception was thrown for the utf-8 column (which may be an unrequired column, for instance, so I may want it not to block the storage of other valid input which is more important) I am not advocating what others should do. But I know what I need my DB to do. If I want it to store data that does not match puritanical standards of textual storage, then it should allow me to... It's just a wishlist item from me, but I realize my voice may be inconsequential.
On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > > What, exactly, does that mean? > > > > That PostgreSQL should take things in invalid utf-8 format and just store them? > > Or that PostgreSQL should autoconvert from invalid utf-8 to valid > > utf-8, guessing the proper codes? > > > > Seriously, what do you want pgsql to do with these invalid inputs? > > > PG should let me, as the administrator of the database, decide whether > I mind my DB to have an option to: > > 1. Either allow the "invalid" input > > 2. Or to continue storing the other information in the table even if > an exception was thrown for the utf-8 column (which may be an > unrequired column, for instance, so I may want it not to block the > storage of other valid input which is more important) > > I am not advocating what others should do. But I know what I need my > DB to do. If I want it to store data that does not match puritanical > standards of textual storage, then it should allow me to... > > It's just a wishlist item from me, but I realize my voice may be > inconsequential. Can't you just create your db as SQL_ASCII and let it store anything then? I mean if you create a database in UTF-8 you're asking it to carefully check the input. If you set it to SQL_ASCII, you're telling it to store anything.
On Thu, 16 Aug 2007, Phoenix Kiula wrote: > I am not advocating what others should do. But I know what I need my > DB to do. If I want it to store data that does not match puritanical > standards of textual storage, then it should allow me to... It does allow that: store it as a BLOB, and then treat it as text in your app. Me personally, if I expect something to be text, I expect it to be valid text.
On 16/08/07, Ben <bench@silentmedia.com> wrote: > On Thu, 16 Aug 2007, Phoenix Kiula wrote: > > > I am not advocating what others should do. But I know what I need my > > DB to do. If I want it to store data that does not match puritanical > > standards of textual storage, then it should allow me to... > > It does allow that: store it as a BLOB, and then treat it as text in your > app. > > Me personally, if I expect something to be text, I expect it to be valid > text. > This is very useful, thanks. This would be "bytea"? Quick questions: 1. Even if it were bytea, would it work with regular SQL operators such as regexp and LIKE? 2. Would tsearch2 work with bytea in the future as long as the stuff in it was text?
On 16/08/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > On 16/08/07, Ben <bench@silentmedia.com> wrote: > > On Thu, 16 Aug 2007, Phoenix Kiula wrote: > > > > > I am not advocating what others should do. But I know what I need my > > > DB to do. If I want it to store data that does not match puritanical > > > standards of textual storage, then it should allow me to... > > > > It does allow that: store it as a BLOB, and then treat it as text in your > > app. > > > > Me personally, if I expect something to be text, I expect it to be valid > > text. > > > > > This is very useful, thanks. This would be "bytea"? Quick questions: > > 1. Even if it were bytea, would it work with regular SQL operators > such as regexp and LIKE? > > 2. Would tsearch2 work with bytea in the future as long as the stuff > in it was text? I get the following error when I try to change a column that was hitherto "varchar(500)" to bytea-- SQL error: ERROR: column "title" cannot be cast to type "bytea" In statement: ALTER TABLE "links" ALTER COLUMN "title" TYPE bytea Any thoughts on how I can overcome this limitation?
On Thu, 16 Aug 2007, Phoenix Kiula wrote: > 1. Even if it were bytea, would it work with regular SQL operators > such as regexp and LIKE? > > 2. Would tsearch2 work with bytea in the future as long as the stuff > in it was text? As far as I know, regexp, [i]like, tsearch2, etc. all require valid text encodings, and so only work with text, not blobs. But I might be wrong. I'm sure the manual would be enlightening. :)
Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"
From
Martijn van Oosterhout
Date:
On Thu, Aug 16, 2007 at 01:56:52AM +0800, Phoenix Kiula wrote: > This is very useful, thanks. This would be "bytea"? Quick questions: > > 1. Even if it were bytea, would it work with regular SQL operators > such as regexp and LIKE? bytea is specifically designed for binary data, as such it has all sorts of quoting rules for dealing with embedded nulls and such. It's not quite a drop in replacement. The earlier suggestion of SQL_ASCII is probably closer to what you want. It does to regexes and LIKE, however postgres will treat all your data as bytes. If you want you regexes to match Unicode character classes that's too bad; you can't have it both ways. Sorting it goes in byte order, you don't have a lot of choice there either. > 2. Would tsearch2 work with bytea in the future as long as the stuff > in it was text? Doubt it, SQL_ASCII would work though. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
Phoenix Kiula wrote: > I am not advocating what others should do. But I know what I need my > DB to do. If I want it to store data that does not match puritanical > standards of textual storage, then it should allow me to... Instead you want it to store tyrannically-chosen alternatives to the user's invalid input without regard for their intent. Calling rejection of invalid input "puritanical" is not an engineering evaluation. -- Lew