Re: JSON for PG 9.2 - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: JSON for PG 9.2 |
Date | |
Msg-id | CA+TgmoYg_SdB70gxx2vFW3z+oB8K7aU8XnQwp+sB0_H7c2FehQ@mail.gmail.com Whole thread Raw |
In response to | Re: JSON for PG 9.2 (Peter Eisentraut <peter_e@gmx.net>) |
Responses |
Re: JSON for PG 9.2
|
List | pgsql-hackers |
On Mon, Jan 23, 2012 at 3:20 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On sön, 2012-01-22 at 11:43 -0500, Andrew Dunstan wrote: >> Actually, given recent discussion I think that test should just be >> removed from json.c. We don't actually have any test that the code >> point is valid (e.g. that it doesn't refer to an unallocated code >> point). We don't do that elsewhere either - the unicode_to_utf8() >> function the scanner uses to turn \unnnn escapes into utf8 doesn't >> look for unallocated code points. I'm not sure how much other >> validation we should do - for example on correct use of surrogate >> pairs. > > We do check the correctness of surrogate pairs elsewhere. Search for > "surrogate" in scan.l; should be easy to copy. I've committed a version of this that does NOT do surrogate pair validation. Per discussion elsewhere, I also removed the check for \uXXXX with XXXX > 007F and database encoding != UTF8. This will complicate things somewhat when we get around to doing canonicalization and comparison, but Tom seems confident that those issues are manageble. I did not commit Andrew's further changes, either; I'm assuming he'll do that himself. With respect to the issue of whether we ought to check surrogate pairs, the JSON spec is not a whole lot of help. RFC4627 says: To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-charactersequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character(U+1D11E) may be represented as "\uD834\uDD1E". That fails to answer the question of what we ought to do if we get an invalid sequence there. You could make an argument that we ought to just allow it; it doesn't particularly hinder our ability to canonicalize or compare strings, because our notion of sort-ordering for characters that may span multiple encodings is going to be pretty funky anyway. We can just leave those bits as \uXXXX sequences and call it good. However, it would hinder our ability to convert a JSON string to a string in the database encoding: we could find an invalidate surrogate pair that was allowable as JSON but unrepresentable in the database encoding. On the flip side, given our decision to allow all \uXXXX sequences even when not using UTF-8, we could also run across a perfectly valid UTF-8 sequence that's not representable as a character in the server encoding, so it seems we have that problem anyway, so maybe it's not much worse to have two reasons why it can happen rather than one. On the third hand, most people are probably using UTF-8, and those people aren't going to have any transcoding issues, so the invalid surrogate pair case may be the only one they can hit (unless invalid code points are also an issue?), so maybe it's worth avoiding on that basis. Anyway, I defer to the wisdom of the collective on this one: how should we handle this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: