Re: JSON for PG 9.2 - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: JSON for PG 9.2 |
Date | |
Msg-id | CA+TgmoaL9ftBac4xY_TfhU1WCQx59NP5YxiDRdR2U0Yt3zSzQw@mail.gmail.com Whole thread Raw |
In response to | Re: JSON for PG 9.2 (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: JSON for PG 9.2
Re: JSON for PG 9.2 |
List | pgsql-hackers |
On Fri, Jan 20, 2012 at 12:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 01/19/2012 04:12 PM, Robert Haas wrote: >>> On Thu, Jan 19, 2012 at 4:07 PM, Andrew Dunstan<andrew@dunslane.net> wrote: >>>> The spec only allows unescaped Unicode chars (and for our purposes that >>>> means UTF8). An unescaped non-ASCII character in, say, ISO-8859-1 will >>>> result in something that's not legal JSON. > >>> I understand. I'm proposing that we not care. In other words, if the >>> server encoding is UTF-8, it'll really be JSON. But if the server >>> encoding is something else, it'll be almost-JSON. > >> Of course, for data going to the client, if the client encoding is UTF8, >> they should get legal JSON, regardless of what the database encoding is, >> and conversely too, no? > > Yes. I think this argument has been mostly theologizing, along the > lines of how many JSON characters can dance on the head of a pin. > From a user's perspective, the database encoding is only a constraint on > which characters he can store. Bingo. > He does not know or care what the bit > representation is inside the server. As such, if we store a non-ASCII > character in a JSON string, it's valid JSON as far as the user is > concerned, so long as that character exists in the Unicode standard. > If his client encoding is UTF8, the value will be letter-perfect JSON > when it gets to him; and if his client encoding is not UTF8, then he's > already pretty much decided that he doesn't give a fig about the > Unicode-centricity of the JSON spec, no? Also agreed. Personally, I think it may not have been a great idea to tie the JSON spec so closely to Unicode, but I understand that it would have been difficult to define an encoding-agnostic equivalent of \uXXXX, since it's hard to know for sure whether an arbitrary encoding even has a (sensible?) definition of code points, and they probably wanted to avoid ambiguity. But, it's bound to cause problems for any system that runs in some other encoding, which, when so requested, we do. Even if we had the ability to support multiple encodings in the same database, I'm not sure I'd be very excited about insisting that JSON data always be stored in UTF-8, because that would introduce a lot of unnecessary transcoding for people using other encodings and basically unnecessarily handicap the functionality provided by the datatype. But at least if we had that, people would have the *option* to use JSON with UTF-8 and get the fully spec-compliant behavior. As it is, they don't; the system we have forces the database encoding on all datatypes whether they like it or not, and that ain't changing for 9.2. > So I'm with Robert: we should just plain not care. I would further > suggest that maybe what we should do with incoming JSON escape sequences > is convert them to Unicode code points and then to the equivalent > character in the database encoding (or throw error if there is none). The code I've written so far does no canonicalization of the input value of any kind, just as we do for XML. I'm inclined to leave it that way. Eventually, we might want to make the JSON datatype support equality comparisons and so on, and that will require the system to knowing that the letter r can be encoded as some \uXXXX sequence and that the escape \r is equivalent to some other escape \uXXXX, but right now all the code does is try to validate that the JSON is legal, NOT second-guess the user's choice about how to spell things or where to insert whitespace. I think that's a good choice because (1) AFAIK, there's no official canonicalization method for JSON, so whatever we pick will be something we think is best, not an official method sanction by the spec, (2) users might prefer the way they chose to represent a given value over the way we choose to represent it, and (3) by simply validating and storing the JSON object, rather than doing any canonicalization, the input function avoids the need to do any data copying, hopefully maximizing speed. Canonicalization can be added on top of what I've done here and people who want or need it can use it; I have some ideas around how to make that leverage the existing code that I intend to pursue for 9.3, but right now I'd rather not go there. So, given that framework, what the patch does is this: if you're using UTF-8, then \uXXXX is accepted, provided that XXXX is something that equates to a legal Unicode code point. It isn't converted to the corresponding character: it's just validated. If you're NOT using UTF-8, then it allows \uXXXX for code points up through 127 (which we assume are the same in all encodings) and anything higher than that is rejected. If someone knows an easy way to check whether a \uXXXX sequence for XXXX > 007F is a legal Unicode code point that has an equivalent in the current server encoding, then we can add logic to allow that case also, but personally I'm not that excited about it. Anyone who is using \uXXXX escapes with a non-Unicode coding is probably hoping that we can store arbitrary code points, not just the ones that happen to exist in the server encoding, so they're probably going to be disappointed whether or not we spend time eating away at the edges of the problem. It's awfully interesting to think about how we could actually make cross-encoding stuff work for real. Would we make it a separate property that can be associated with each column, like we did for collations? Or would we handle it more like range types - give me an encoding, and I'll give you a new type OID that represents text stored in that encoding? But I guess that's a question for another day. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: