Thread: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

From
"Ivan Zolotukhin"
Date:
Hello,

Imagine a web application that process text search queries from
clients. If one types a text search query in a browser it then sends
proper UTF-8 characters and application after all needed processing
(escaping, checks, etc) passes it to database. But if one modifies URL
of the query adding some trash non-UTF-8 characters, database raises
an error: invalid byte sequence for encoding "UTF8".

What is the best practice to process such a broken strings before
passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad
characters?

--
Regards,
 Ivan

Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

From
Martijn van Oosterhout
Date:
On Wed, Aug 15, 2007 at 03:41:30PM +0400, Ivan Zolotukhin wrote:
> Hello,
>
> Imagine a web application that process text search queries from
> clients. If one types a text search query in a browser it then sends
> proper UTF-8 characters and application after all needed processing
> (escaping, checks, etc) passes it to database. But if one modifies URL
> of the query adding some trash non-UTF-8 characters, database raises
> an error: invalid byte sequence for encoding "UTF8".
>
> What is the best practice to process such a broken strings before
> passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad
> characters?

Well, the query as given by the user is invalid, so returning an error
message complaining about the invalid byte sequence seems entirely
reasonable.

I don't see any reason to try and be smart. There's no way you can
"fix" the query.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

From
"Phoenix Kiula"
Date:
On 15/08/07, Ivan Zolotukhin <ivan.zolotukhin@gmail.com> wrote:
> Hello,
>
> Imagine a web application that process text search queries from
> clients. If one types a text search query in a browser it then sends
> proper UTF-8 characters and application after all needed processing
> (escaping, checks, etc) passes it to database. But if one modifies URL
> of the query adding some trash non-UTF-8 characters, database raises
> an error: invalid byte sequence for encoding "UTF8".
>
> What is the best practice to process such a broken strings before
> passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad
> characters?



Could you do not do this from the application itself, before sending
it to the DB server? mb_convert_encoding() in PHP for instance?

Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

From
"Ivan Zolotukhin"
Date:
Hello,

Well, PostgreSQL is correct entirely, I would post this message to the
-hackers list otherwise :) The question was rather about application
processing of user input not about change of database reaction on
broken UTF-8 string. But I am 100% sure one should fix the input in
this case since web site user can see some bad error (even if
application caught this SQL exception for instance) otherwise.

--
Regards,
 Ivan


On 8/15/07, Martijn van Oosterhout <kleptog@svana.org> wrote:
> On Wed, Aug 15, 2007 at 03:41:30PM +0400, Ivan Zolotukhin wrote:
> > Hello,
> >
> > Imagine a web application that process text search queries from
> > clients. If one types a text search query in a browser it then sends
> > proper UTF-8 characters and application after all needed processing
> > (escaping, checks, etc) passes it to database. But if one modifies URL
> > of the query adding some trash non-UTF-8 characters, database raises
> > an error: invalid byte sequence for encoding "UTF8".
> >
> > What is the best practice to process such a broken strings before
> > passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad
> > characters?
>
> Well, the query as given by the user is invalid, so returning an error
> message complaining about the invalid byte sequence seems entirely
> reasonable.
>
> I don't see any reason to try and be smart. There's no way you can
> "fix" the query.
>
> Have a nice day,
> --
> Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> > From each according to his ability. To each according to his ability to litigate.
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.1 (GNU/Linux)
>
> iD8DBQFGwuvTIB7bNG8LQkwRAudJAJ9c8gvUQ25/S54gtJAPdqMOd81pNwCfUeLi
> JoWU92WJKZ1qM3UMRG5Zn0Y=
> =dPLv
> -----END PGP SIGNATURE-----
>
>

Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

From
"Ivan Zolotukhin"
Date:
Hello,

Actually I tried smth like $str = @iconv("UTF-8", "UTF-8//IGNORE",
$str); when preparing string for SQL query and it worked. There's
probably a better way in PHP to achieve this: simply change default
values in php.ini for these parameters:

mbstring.encoding_translation = On
mbstring.substitute_character = none

and broken symbols will be automatically stripped off from the input
and output. But I am interested in general solution and better
practices anyway...

--
Regards,
 Ivan


On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
> On 15/08/07, Ivan Zolotukhin <ivan.zolotukhin@gmail.com> wrote:
> > Hello,
> >
> > Imagine a web application that process text search queries from
> > clients. If one types a text search query in a browser it then sends
> > proper UTF-8 characters and application after all needed processing
> > (escaping, checks, etc) passes it to database. But if one modifies URL
> > of the query adding some trash non-UTF-8 characters, database raises
> > an error: invalid byte sequence for encoding "UTF8".
> >
> > What is the best practice to process such a broken strings before
> > passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad
> > characters?
>
>
>
> Could you do not do this from the application itself, before sending
> it to the DB server? mb_convert_encoding() in PHP for instance?
>

Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

From
Vivek Khera
Date:
On Aug 15, 2007, at 7:41 AM, Ivan Zolotukhin wrote:

> What is the best practice to process such a broken strings before
> passing them to PostgreSQL? Iconv from utf-8 to utf-8 dropping bad
> characters?

This rings of GIGO... if your user enters garbage, how do you know
what they wanted?  You don't.  You tell them they entered garbage,
else the result will be garbage.  Data validation... learn from
microsoft's mistakes :-)


Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

From
"Phoenix Kiula"
Date:
On 15/08/07, Ivan Zolotukhin <ivan.zolotukhin@gmail.com> wrote:
> Hello,
>
> Actually I tried smth like $str = @iconv("UTF-8", "UTF-8//IGNORE",
> $str); when preparing string for SQL query and it worked. There's
> probably a better way in PHP to achieve this: simply change default
> values in php.ini for these parameters:
>
> mbstring.encoding_translation = On
> mbstring.substitute_character = none
>
> and broken symbols will be automatically stripped off from the input
> and output.


Sadly, they don't always do that, not with Asian scripts.

And I do not completely agree, like the other poster suggested, with
the concept of GIGO. Sometimes you want the end-user's experience to
be seamless. For example, in one of our web sites, we allow users to
submit text through a bookmarklet, where the title of the webpage
comes in rawurlencoded format. We try to rawurldecode() it on our end
but most of the times the Asian interpretation is wrong. We have all
the usual mbstring settings in php.ini. In this scenario, the user did
not enter any garbage. Our application should have the ability to
recognize the text. We do what we can with mb_convert...etc, but the
database just throws an error.

PGSQL really needs to get with the program when it comes to utf-8 input.

Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

From
"Scott Marlowe"
Date:
On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
> On 15/08/07, Ivan Zolotukhin <ivan.zolotukhin@gmail.com> wrote:
> > Hello,
> >
> > Actually I tried smth like $str = @iconv("UTF-8", "UTF-8//IGNORE",
> > $str); when preparing string for SQL query and it worked. There's
> > probably a better way in PHP to achieve this: simply change default
> > values in php.ini for these parameters:
> >
> > mbstring.encoding_translation = On
> > mbstring.substitute_character = none
> >
> > and broken symbols will be automatically stripped off from the input
> > and output.
>
>
> Sadly, they don't always do that, not with Asian scripts.
>
> And I do not completely agree, like the other poster suggested, with
> the concept of GIGO. Sometimes you want the end-user's experience to
> be seamless. For example, in one of our web sites, we allow users to
> submit text through a bookmarklet, where the title of the webpage
> comes in rawurlencoded format. We try to rawurldecode() it on our end
> but most of the times the Asian interpretation is wrong. We have all
> the usual mbstring settings in php.ini. In this scenario, the user did
> not enter any garbage. Our application should have the ability to
> recognize the text. We do what we can with mb_convert...etc, but the
> database just throws an error.
>
> PGSQL really needs to get with the program when it comes to utf-8 input.

What, exactly, does that mean?

That PostgreSQL should take things in invalid utf-8 format and just store them?
Or that PostgreSQL should autoconvert from invalid utf-8 to valid
utf-8, guessing the proper codes?

Seriously, what do you want pgsql to do with these invalid inputs?

Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

From
"Phoenix Kiula"
Date:
> What, exactly, does that mean?
>
> That PostgreSQL should take things in invalid utf-8 format and just store them?
> Or that PostgreSQL should autoconvert from invalid utf-8 to valid
> utf-8, guessing the proper codes?
>
> Seriously, what do you want pgsql to do with these invalid inputs?


PG should let me, as the administrator of the database, decide whether
I mind my DB to have an option to:

1. Either allow the "invalid" input

2. Or to continue storing the other information in the table even if
an exception was thrown for the utf-8 column (which may be an
unrequired column, for instance, so I may want it not to block the
storage of other valid input which is more important)

I am not advocating what others should do. But I know what I need my
DB to do. If I want it to store data that does not match puritanical
standards of textual storage, then it should allow me to...

It's just a wishlist item from me, but I realize my voice may be
inconsequential.

Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

From
"Scott Marlowe"
Date:
On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
> > What, exactly, does that mean?
> >
> > That PostgreSQL should take things in invalid utf-8 format and just store them?
> > Or that PostgreSQL should autoconvert from invalid utf-8 to valid
> > utf-8, guessing the proper codes?
> >
> > Seriously, what do you want pgsql to do with these invalid inputs?
>
>
> PG should let me, as the administrator of the database, decide whether
> I mind my DB to have an option to:
>
> 1. Either allow the "invalid" input
>
> 2. Or to continue storing the other information in the table even if
> an exception was thrown for the utf-8 column (which may be an
> unrequired column, for instance, so I may want it not to block the
> storage of other valid input which is more important)
>
> I am not advocating what others should do. But I know what I need my
> DB to do. If I want it to store data that does not match puritanical
> standards of textual storage, then it should allow me to...
>
> It's just a wishlist item from me, but I realize my voice may be
> inconsequential.

Can't you just create your db as SQL_ASCII and let it store anything
then?  I mean if you create a database in UTF-8 you're asking it to
carefully check the input.  If you set it to SQL_ASCII, you're telling
it to store anything.

On Thu, 16 Aug 2007, Phoenix Kiula wrote:

> I am not advocating what others should do. But I know what I need my
> DB to do. If I want it to store data that does not match puritanical
> standards of textual storage, then it should allow me to...

It does allow that: store it as a BLOB, and then treat it as text in your
app.

Me personally, if I expect something to be text, I expect it to be valid
text.

Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

From
"Phoenix Kiula"
Date:
On 16/08/07, Ben <bench@silentmedia.com> wrote:
> On Thu, 16 Aug 2007, Phoenix Kiula wrote:
>
> > I am not advocating what others should do. But I know what I need my
> > DB to do. If I want it to store data that does not match puritanical
> > standards of textual storage, then it should allow me to...
>
> It does allow that: store it as a BLOB, and then treat it as text in your
> app.
>
> Me personally, if I expect something to be text, I expect it to be valid
> text.
>


This is very useful, thanks. This would be "bytea"? Quick questions:

1. Even if it were bytea, would it work with regular SQL operators
such as regexp and LIKE?

2. Would tsearch2 work with bytea in the future as long as the stuff
in it was text?

Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

From
"Phoenix Kiula"
Date:
On 16/08/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
> On 16/08/07, Ben <bench@silentmedia.com> wrote:
> > On Thu, 16 Aug 2007, Phoenix Kiula wrote:
> >
> > > I am not advocating what others should do. But I know what I need my
> > > DB to do. If I want it to store data that does not match puritanical
> > > standards of textual storage, then it should allow me to...
> >
> > It does allow that: store it as a BLOB, and then treat it as text in your
> > app.
> >
> > Me personally, if I expect something to be text, I expect it to be valid
> > text.
> >
>
>
> This is very useful, thanks. This would be "bytea"? Quick questions:
>
> 1. Even if it were bytea, would it work with regular SQL operators
> such as regexp and LIKE?
>
> 2. Would tsearch2 work with bytea in the future as long as the stuff
> in it was text?


I get the following error when I try to change a column that was
hitherto "varchar(500)" to bytea--

SQL error:
ERROR:  column "title" cannot be cast to type "bytea"

In statement:
ALTER TABLE "links" ALTER COLUMN "title" TYPE bytea


Any thoughts on how I can overcome this limitation?

On Thu, 16 Aug 2007, Phoenix Kiula wrote:

> 1. Even if it were bytea, would it work with regular SQL operators
> such as regexp and LIKE?
>
> 2. Would tsearch2 work with bytea in the future as long as the stuff
> in it was text?

As far as I know, regexp, [i]like, tsearch2, etc. all require valid text
encodings, and so only work with text, not blobs. But I might be wrong.
I'm sure the manual would be enlightening. :)

Re: Best practice for: ERROR: invalid byte sequence for encoding "UTF8"

From
Martijn van Oosterhout
Date:
On Thu, Aug 16, 2007 at 01:56:52AM +0800, Phoenix Kiula wrote:
> This is very useful, thanks. This would be "bytea"? Quick questions:
>
> 1. Even if it were bytea, would it work with regular SQL operators
> such as regexp and LIKE?

bytea is specifically designed for binary data, as such it has all
sorts of quoting rules for dealing with embedded nulls and such. It's
not quite a drop in replacement.

The earlier suggestion of SQL_ASCII is probably closer to what you
want. It does to regexes and LIKE, however postgres will treat all your
data as bytes. If you want you regexes to match Unicode character
classes that's too bad; you can't have it both ways. Sorting it goes in
byte order, you don't have a lot of choice there either.

> 2. Would tsearch2 work with bytea in the future as long as the stuff
> in it was text?

Doubt it, SQL_ASCII would work though.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment
Phoenix Kiula wrote:
> I am not advocating what others should do. But I know what I need my
> DB to do. If I want it to store data that does not match puritanical
> standards of textual storage, then it should allow me to...

Instead you want it to store tyrannically-chosen alternatives to the user's
invalid input without regard for their intent.

Calling rejection of invalid input "puritanical" is not an engineering evaluation.

--
Lew