"Michael Lewis" <mikelikespie@gmail.com> writes:
> I'm using Python to sanitize my logs from invalid UTF8 characters before
> COPYing them into postgres. I came across this one sequence that seems to
> be valid UTF8 (in the extended range I believe).
It is not valid. See http://tools.ietf.org/html/rfc3629 --- a sequence
beginning with ED must have a second byte in the range 80-9F to be
legal, and this doesn't. The example you give would decode as U+DF2D,
ie part of a surrogate pair, which is specifically disallowed in UTF8
--- you're supposed to code the original character directly, not via a
surrogate pair. The primary reason for this rule is that otherwise
there are multiple ways to encode the same character, which can be a
security hazard.
> It goes through both pythons encoding as well as iconv without an error
You should file bugs against those tools.
regards, tom lane