Home > mailing lists

Re: BUG #5532: Valid UTF8 sequence errors as invalid - Mailing list pgsql-bugs

From	Tom Lane
Subject	Re: BUG #5532: Valid UTF8 sequence errors as invalid
Date	June 30, 2010 13:44:55
Msg-id	12210.1277916285@sss.pgh.pa.us Whole thread Raw
In response to	BUG #5532: Valid UTF8 sequence errors as invalid ("Michael Lewis" <mikelikespie@gmail.com>)
Responses	Re: BUG #5532: Valid UTF8 sequence errors as invalid
List	pgsql-bugs

Tree view

"Michael Lewis" <mikelikespie@gmail.com> writes:
> I'm using Python to sanitize my logs from invalid UTF8 characters before
> COPYing them into postgres.  I came across this one sequence that seems to
> be valid UTF8 (in the extended range I believe).

It is not valid.  See http://tools.ietf.org/html/rfc3629 --- a sequence
beginning with ED must have a second byte in the range 80-9F to be
legal, and this doesn't.  The example you give would decode as U+DF2D,
ie part of a surrogate pair, which is specifically disallowed in UTF8
--- you're supposed to code the original character directly, not via a
surrogate pair.  The primary reason for this rule is that otherwise
there are multiple ways to encode the same character, which can be a
security hazard.

> It goes through both pythons encoding as well as iconv without an error

You should file bugs against those tools.

            regards, tom lane

pgsql-bugs by date:

From: Tom Lane
Date: 30 June 2010, 13:25:42
Subject: Re: BUG #5531: REGEXP_ REPLACE causes connection drop

From: Mike Lewis
Date: 30 June 2010, 15:07:11
Subject: Re: BUG #5532: Valid UTF8 sequence errors as invalid

Re: BUG #5532: Valid UTF8 sequence errors as invalid - Mailing list pgsql-bugs

Previous

Next