Re: JSON for PG 9.2 - Mailing list pgsql-hackers

From Robert Haas
Subject Re: JSON for PG 9.2
Date
Msg-id CA+TgmoYg_SdB70gxx2vFW3z+oB8K7aU8XnQwp+sB0_H7c2FehQ@mail.gmail.com
Whole thread Raw
In response to Re: JSON for PG 9.2  (Peter Eisentraut <peter_e@gmx.net>)
Responses Re: JSON for PG 9.2  (Abhijit Menon-Sen <ams@toroid.org>)
List pgsql-hackers
On Mon, Jan 23, 2012 at 3:20 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On sön, 2012-01-22 at 11:43 -0500, Andrew Dunstan wrote:
>> Actually, given recent discussion I think that test should just be
>> removed from json.c. We don't actually have any test that the code
>> point is valid (e.g. that it doesn't refer to an unallocated code
>> point). We don't do that elsewhere either - the unicode_to_utf8()
>> function the scanner uses to turn \unnnn escapes into utf8 doesn't
>> look for unallocated code points. I'm not sure how much other
>> validation we should do - for example on correct use of surrogate
>> pairs.
>
> We do check the correctness of surrogate pairs elsewhere.  Search for
> "surrogate" in scan.l; should be easy to copy.

I've committed a version of this that does NOT do surrogate pair
validation.  Per discussion elsewhere, I also removed the check for
\uXXXX with XXXX > 007F and database encoding != UTF8.  This will
complicate things somewhat when we get around to doing
canonicalization and comparison, but Tom seems confident that those
issues are manageble.  I did not commit Andrew's further changes,
either; I'm assuming he'll do that himself.

With respect to the issue of whether we ought to check surrogate
pairs, the JSON spec is not a whole lot of help.  RFC4627 says:
  To escape an extended character that is not in the Basic Multilingual  Plane, the character is represented as a
twelve-charactersequence,  encoding the UTF-16 surrogate pair.  So, for example, a string  containing only the G clef
character(U+1D11E) may be represented as  "\uD834\uDD1E". 

That fails to answer the question of what we ought to do if we get an
invalid sequence there.  You could make an argument that we ought to
just allow it; it doesn't particularly hinder our ability to
canonicalize or compare strings, because our notion of sort-ordering
for characters that may span multiple encodings is going to be pretty
funky anyway.  We can just leave those bits as \uXXXX sequences and
call it good.  However, it would hinder our ability to convert a JSON
string to a string in the database encoding: we could find an
invalidate surrogate pair that was allowable as JSON but
unrepresentable in the database encoding.  On the flip side, given our
decision to allow all \uXXXX sequences even when not using UTF-8, we
could also run across a perfectly valid UTF-8 sequence that's not
representable as a character in the server encoding, so it seems we
have that problem anyway, so maybe it's not much worse to have two
reasons why it can happen rather than one.  On the third hand, most
people are probably using UTF-8, and those people aren't going to have
any transcoding issues, so the invalid surrogate pair case may be the
only one they can hit (unless invalid code points are also an issue?),
so maybe it's worth avoiding on that basis.

Anyway, I defer to the wisdom of the collective on this one: how
should we handle this?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: foreign key locks, 2nd attempt
Next
From: Lionel Elie Mamane
Date:
Subject: Re: information schema/aclexplode doesn't know about default privileges