On Tue, Jan 31, 2012 at 1:29 PM, Abhijit Menon-Sen <ams@toroid.org> wrote:
> At 2012-01-31 12:04:31 -0500, robertmhaas@gmail.com wrote:
>>
>> That fails to answer the question of what we ought to do if we get an
>> invalid sequence there.
>
> I think it's best to categorically reject invalid surrogates as early as
> possible, considering the number of bugs that are related to them (not
> in Postgres, just in general). I can't see anything good coming from
> letting them in and leaving them to surprise someone in future.
>
> -- ams
+1
Another sequence to beware of is \u0000. While escaped NUL characters
are perfectly valid in JSON, NUL characters aren't allowed in TEXT
values. This means not all JSON strings can be converted to TEXT,
even in UTF-8. This may also complicate collation, if comparison
functions demand null-terminated strings.
I'm mostly in favor of allowing \u0000. Banning \u0000 means users
can't use JSON strings to marshal binary blobs, e.g. by escaping
non-printable characters and only using U+0000..U+00FF. Instead, they
have to use base64 or similar.
Banning \u0000 doesn't quite violate the RFC:
An implementation may set limits on the length and character contents of strings.
-Joey