Thread: Duplicate JSON Object Keys
This behavior surprised me a bit: david=# select '{"foo": 1, "foo": 2}'::json; json ---------------------- {"foo": 1, "foo": 2} I had expected something more like this: david=# select '{"foo": 1, "foo": 2}'::json; json ------------ {"foo": 2} This hasn’t been much of an issue before, but with Andrew’s JSON enhancements going in, it will start to cause problems: david=# select json_get('{"foo": 1, "foo": 2}', 'foo'); ERROR: field name is not unique in json object Andrew tells me that the spec requires this. I think that’s fine, but I would rather that it never got to there. In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON object key uniquenessshould be enforced either by throwing an error on duplicate keys, or by flattening so that the latest key wins(as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it more memory-intensive,but such is the price for correctness. Thoughts? Thanks, David
On 03/07/2013 02:48 PM, David E. Wheeler wrote: > This behavior surprised me a bit: > > david=# select '{"foo": 1, "foo": 2}'::json; > json > ---------------------- > {"foo": 1, "foo": 2} > > I had expected something more like this: > > david=# select '{"foo": 1, "foo": 2}'::json; > json > ------------ > {"foo": 2} > > This hasn’t been much of an issue before, but with Andrew’s JSON enhancements going in, it will start to cause problems: > > david=# select json_get('{"foo": 1, "foo": 2}', 'foo'); > ERROR: field name is not unique in json object > > Andrew tells me that the spec requires this. I think that’s fine, but I would rather that it never got to there. Specifically, rfc4627 says (note last sentence): 2.2. Objects An object structure is represented as a pair of curly brackets surrounding zero or more name/value pairs (ormembers). A name is a string. A single colon comes after each name, separating the name from the value. A single comma separates a value from a following name. The names within an object SHOULD be unique. cheers andrew
On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david@justatheory.com> wrote: > In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON object keyuniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the latest key wins(as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it more memory-intensive,but such is the price for correctness. I'm with Andrew. That's a rathole I emphatically don't want to go down. I wrote this code originally, and I had the thought clearly in mind that I wanted to accept JSON that was syntactically well-formed, not JSON that met certain semantic constraints. We could add functions like json_is_non_stupid(json) so that people can easily add a CHECK constraint that enforces this if they so desire. But enforcing it categorically seems like a bad plan, especially since at this point it would require a compatibility break with previous releases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 03/08/2013 09:39 PM, Robert Haas wrote: > On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david@justatheory.com> wrote: >> In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON object keyuniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the latest key wins(as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it more memory-intensive,but such is the price for correctness. > I'm with Andrew. That's a rathole I emphatically don't want to go > down. I wrote this code originally, and I had the thought clearly in > mind that I wanted to accept JSON that was syntactically well-formed, > not JSON that met certain semantic constraints. If it does not meet these "semantic" constraints, then it is not really JSON - it is merely JSON-like. this sounds very much like MySQLs decision to support timestamp "0000-00-00 00:00" - syntactically correct, but semantically wrong. > We could add > functions like json_is_non_stupid(json) so that people can easily add > a CHECK constraint that enforces this if they so desire. But > enforcing it categorically seems like a bad plan, especially since at > this point it would require a compatibility break with previous > releases If we ever will support "real" spec-compliant JSON (maybe based on recursive hstore ?) then there will be a compatibility break anyway, so why not do it now. Or do you seriously believe that somebody is using "PostgreSQL JSON" to store these kind of "json documents" Cheers Hannu Krosing
Hannu Krosing escribió: > On 03/08/2013 09:39 PM, Robert Haas wrote: > >On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david@justatheory.com> wrote: > >>In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON object keyuniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the latest key wins(as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it more memory-intensive,but such is the price for correctness. > >I'm with Andrew. That's a rathole I emphatically don't want to go > >down. I wrote this code originally, and I had the thought clearly in > >mind that I wanted to accept JSON that was syntactically well-formed, > >not JSON that met certain semantic constraints. > > If it does not meet these "semantic" constraints, then it is not > really JSON - it is merely JSON-like. > > this sounds very much like MySQLs decision to support timestamp > "0000-00-00 00:00" - syntactically correct, but semantically wrong. Is it wrong? The standard cited says SHOULD, not MUST. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mar 8, 2013, at 1:01 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> If it does not meet these "semantic" constraints, then it is not >> really JSON - it is merely JSON-like. >> >> this sounds very much like MySQLs decision to support timestamp >> "0000-00-00 00:00" - syntactically correct, but semantically wrong. > > Is it wrong? The standard cited says SHOULD, not MUST. Yes, it is wrong, because multiple keys are specifically disallowed for accessing values. Hence this new error: david=# select json_get('{"foo": 1, "foo": 2}', 'foo'); ERROR: field name is not unique in json object I really don’t think that should be possible. Best, David
Well I would much prefer to find out sooner rather than later that there is a problem, so I would much prefer know I've created a duplicate as soon as the system can detect it. In general, Postgresql appears much better at this than MySQL On 09/03/13 10:01, Alvaro Herrera wrote: > Hannu Krosing escribió: >> On 03/08/2013 09:39 PM, Robert Haas wrote: >>> On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david@justatheory.com> wrote: >>>> In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON objectkey uniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the latestkey wins (as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it morememory-intensive, but such is the price for correctness. >>> I'm with Andrew. That's a rathole I emphatically don't want to go >>> down. I wrote this code originally, and I had the thought clearly in >>> mind that I wanted to accept JSON that was syntactically well-formed, >>> not JSON that met certain semantic constraints. >> If it does not meet these "semantic" constraints, then it is not >> really JSON - it is merely JSON-like. >> >> this sounds very much like MySQLs decision to support timestamp >> "0000-00-00 00:00" - syntactically correct, but semantically wrong. > Is it wrong? The standard cited says SHOULD, not MUST. >
On 03/08/2013 04:01 PM, Alvaro Herrera wrote: > Hannu Krosing escribió: >> On 03/08/2013 09:39 PM, Robert Haas wrote: >>> On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david@justatheory.com> wrote: >>>> In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON objectkey uniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the latestkey wins (as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it morememory-intensive, but such is the price for correctness. >>> I'm with Andrew. That's a rathole I emphatically don't want to go >>> down. I wrote this code originally, and I had the thought clearly in >>> mind that I wanted to accept JSON that was syntactically well-formed, >>> not JSON that met certain semantic constraints. >> If it does not meet these "semantic" constraints, then it is not >> really JSON - it is merely JSON-like. >> >> this sounds very much like MySQLs decision to support timestamp >> "0000-00-00 00:00" - syntactically correct, but semantically wrong. > Is it wrong? The standard cited says SHOULD, not MUST. Here's what rfc2119 says about that wording: 4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that there may exist valid reasons in particular circumstanceswhen the particular behavior is acceptable or even useful, but the full implications should be understoodand the case carefully weighed before implementing any behavior described with this label. So we're allowed to do as Robert chose, and I think there are good reasons for doing so (apart from anything else, checking it would slow down the parser enormously). Now you could argue that in that case the extractor functions should allow it too, and it's probably fairly easy to change them to allow it. In that case we need to decide who wins. We could treat a later field lexically as overriding an earlier field of the same name, which I think is what David expected. That's what plv8 does (i.e. it's how v8 interprets JSON): andrew=# create or replace function jget(t json, fld text) returns text language plv8 as ' return t[fld]; '; CREATEFUNCTION andrew=# select jget('{"f1":"x","f1":"y"}','f1'); jget ------ y (1 row) Or you could take the view I originally took that in view of the RFC wording we should raise an error if this was found. I can live with either view. cheers andrew
On Mar 8, 2013, at 1:21 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > Here's what rfc2119 says about that wording: > > 4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that > there may exist valid reasons in particular circumstances when the > particular behavior is acceptable or even useful, but the full > implications should be understood and the case carefully weighed > before implementing any behavior described with this label. I suspect this was allowed for the JavaScript behavior where multiple keys are allowed, but the last key in the list wins. > So we're allowed to do as Robert chose, and I think there are good reasons for doing so (apart from anything else, checkingit would slow down the parser enormously). Yes, but the implications are going to start biting us on the ass now. > Now you could argue that in that case the extractor functions should allow it too, and it's probably fairly easy to changethem to allow it. In that case we need to decide who wins. We could treat a later field lexically as overriding anearlier field of the same name, which I think is what David expected. That's what plv8 does (i.e. it's how v8 interpretsJSON): > > andrew=# create or replace function jget(t json, fld text) returns > text language plv8 as ' return t[fld]; '; > CREATE FUNCTION > andrew=# select jget('{"f1":"x","f1":"y"}','f1'); > jget > ------ > y > (1 row) > > > Or you could take the view I originally took that in view of the RFC wording we should raise an error if this was found. > > I can live with either view. I’m on the fence. On the one hand, I like the plv8 behavior, which is nice for a dynamic language. On the other hand, I don'tmuch care for it in my database, where I want data storage requirements to be quite strict. I hate the idea of "0000-00-00"being allowed as a date, and am uncomfortable with allowing duplicate keys to be stored in the JSON data type. So my order of preference for the options would be: 1. Have the JSON type collapse objects so the last instance of a key wins and is actually stored 2. Throw an error when a JSON type has duplicate keys 3. Have the accessors find the last instance of a key and return that value 4. Let things remain as they are now On second though, I don't like 4 at all. It means that the JSON type things a value is valid while the accessor does not.They contradict one another. Best, David
On 03/08/2013 10:01 PM, Alvaro Herrera wrote: > Hannu Krosing escribió: >> On 03/08/2013 09:39 PM, Robert Haas wrote: >>> On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david@justatheory.com> wrote: >>>> In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON objectkey uniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the latestkey wins (as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it morememory-intensive, but such is the price for correctness. >>> I'm with Andrew. That's a rathole I emphatically don't want to go >>> down. I wrote this code originally, and I had the thought clearly in >>> mind that I wanted to accept JSON that was syntactically well-formed, >>> not JSON that met certain semantic constraints. >> If it does not meet these "semantic" constraints, then it is not >> really JSON - it is merely JSON-like. >> >> this sounds very much like MySQLs decision to support timestamp >> "0000-00-00 00:00" - syntactically correct, but semantically wrong. > Is it wrong? The standard cited says SHOULD, not MUST. I think one MAY start implementation with loose interpretation of SHOULD, but if at all possible we SHOULD implement the SHOULD-qualified features :) http://www.ietf.org/rfc/rfc2119.txt: SHOULD This word, or the adjective "RECOMMENDED", mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course. We might start with just throwing a warning for duplicate keys, but I can see no good reason to do so. Except ease of implementation and with current JSON-AS-TEXT implenetation performance. And providing a boolean function is_really_json_object(json) to be used in check constraints seems plain weird . Otoh, as the spec defines JSON as being designed to be a subset of javascript, it SHOULD accept select '{"foo": 1, "foo": 2}'::json; but turn it into '{"foo": 2}'::json; for storage. I do not think it would be a good idea to leave it to data extraction functions to always get the last value for foo in this case 2 ------------------ Hannu
On 03/08/2013 04:28 PM, David E. Wheeler wrote: > On Mar 8, 2013, at 1:21 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > >> Here's what rfc2119 says about that wording: >> >> 4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that >> there may exist valid reasons in particular circumstances when the >> particular behavior is acceptable or even useful, but the full >> implications should be understood and the case carefully weighed >> before implementing any behavior described with this label. > I suspect this was allowed for the JavaScript behavior where multiple keys are allowed, but the last key in the list wins. > >> So we're allowed to do as Robert chose, and I think there are good reasons for doing so (apart from anything else, checkingit would slow down the parser enormously). > Yes, but the implications are going to start biting us on the ass now. > >> Now you could argue that in that case the extractor functions should allow it too, and it's probably fairly easy to changethem to allow it. In that case we need to decide who wins. We could treat a later field lexically as overriding anearlier field of the same name, which I think is what David expected. That's what plv8 does (i.e. it's how v8 interpretsJSON): >> >> andrew=# create or replace function jget(t json, fld text) returns >> text language plv8 as ' return t[fld]; '; >> CREATE FUNCTION >> andrew=# select jget('{"f1":"x","f1":"y"}','f1'); >> jget >> ------ >> y >> (1 row) >> >> >> Or you could take the view I originally took that in view of the RFC wording we should raise an error if this was found. >> >> I can live with either view. > I’m on the fence. On the one hand, I like the plv8 behavior, which is nice for a dynamic language. On the other hand, Idon't much care for it in my database, where I want data storage requirements to be quite strict. I hate the idea of "0000-00-00"being allowed as a date, and am uncomfortable with allowing duplicate keys to be stored in the JSON data type. > > So my order of preference for the options would be: > > 1. Have the JSON type collapse objects so the last instance of a key wins and is actually stored > > 2. Throw an error when a JSON type has duplicate keys > > 3. Have the accessors find the last instance of a key and return that value > > 4. Let things remain as they are now > > On second though, I don't like 4 at all. It means that the JSON type things a value is valid while the accessor does not.They contradict one another. > > You can forget 1. We are not going to have the parser collapse anything. Either the JSON it gets is valid or it's not. But the parser isn't going to try to MAKE it valid. cheers andrew
On 03/08/2013 10:42 PM, Andrew Dunstan wrote: > > On 03/08/2013 04:28 PM, David E. Wheeler wrote: >> On Mar 8, 2013, at 1:21 PM, Andrew Dunstan <andrew@dunslane.net> wrote: >> >>> Here's what rfc2119 says about that wording: >>> >>> 4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that >>> there may exist valid reasons in particular circumstances when the >>> particular behavior is acceptable or even useful, but the full >>> implications should be understood and the case carefully weighed >>> before implementing any behavior described with this label. >> I suspect this was allowed for the JavaScript behavior where multiple >> keys are allowed, but the last key in the list wins. >> >>> So we're allowed to do as Robert chose, and I think there are good >>> reasons for doing so (apart from anything else, checking it would >>> slow down the parser enormously). >> Yes, but the implications are going to start biting us on the ass now. >> >>> Now you could argue that in that case the extractor functions should >>> allow it too, and it's probably fairly easy to change them to allow >>> it. In that case we need to decide who wins. We could treat a later >>> field lexically as overriding an earlier field of the same name, >>> which I think is what David expected. That's what plv8 does (i.e. >>> it's how v8 interprets JSON): >>> >>> andrew=# create or replace function jget(t json, fld text) returns >>> text language plv8 as ' return t[fld]; '; >>> CREATE FUNCTION >>> andrew=# select jget('{"f1":"x","f1":"y"}','f1'); >>> jget >>> ------ >>> y >>> (1 row) >>> >>> >>> Or you could take the view I originally took that in view of the RFC >>> wording we should raise an error if this was found. >>> >>> I can live with either view. >> I’m on the fence. On the one hand, I like the plv8 behavior, which is >> nice for a dynamic language. On the other hand, I don't much care for >> it in my database, where I want data storage requirements to be quite >> strict. I hate the idea of "0000-00-00" being allowed as a date, and >> am uncomfortable with allowing duplicate keys to be stored in the >> JSON data type. >> >> So my order of preference for the options would be: >> >> 1. Have the JSON type collapse objects so the last instance of a key >> wins and is actually stored >> >> 2. Throw an error when a JSON type has duplicate keys >> >> 3. Have the accessors find the last instance of a key and return that >> value >> >> 4. Let things remain as they are now >> >> On second though, I don't like 4 at all. It means that the JSON type >> things a value is valid while the accessor does not. They contradict >> one another. >> >> > > > You can forget 1. We are not going to have the parser collapse anything. > Either the JSON it gets is valid or it's not. But the parser isn't > going to try to MAKE it valid. Ok, so the "make valid" part will have to wait for http://www.pgcon.org/2013/schedule/events/518.en.html if this will ever happen ;) Which means that all extractor functions will need to do much more work in case of complex json, think of json_get('("a":{"b":1},"a":{"1":"x"}, "a":[1,{"b":7}]}'::json, ["a","1","b"]) the true value in javascript is but here the get_json function has several options to error out early and real confusion as to where to report the error if in the end it is not found. essentially all extractor functions have to do what we omitted doing at input time. Cheers Hannu
On 03/08/2013 04:42 PM, Andrew Dunstan wrote: > >> >> So my order of preference for the options would be: >> >> 1. Have the JSON type collapse objects so the last instance of a key >> wins and is actually stored >> >> 2. Throw an error when a JSON type has duplicate keys >> >> 3. Have the accessors find the last instance of a key and return that >> value >> >> 4. Let things remain as they are now >> >> On second though, I don't like 4 at all. It means that the JSON type >> things a value is valid while the accessor does not. They contradict >> one another. >> >> > > > You can forget 1. We are not going to have the parser collapse > anything. Either the JSON it gets is valid or it's not. But the parser > isn't going to try to MAKE it valid. Actually, now I think more about it 3 is the best answer. Here's why: even the JSON generators can produce JSON with non-unique field names: andrew=# select row_to_json(q) from (select x as a, y as a from generate_series(1,2) x, generate_series(3,4) y) q; row_to_json --------------- {"a":1,"a":3} {"a":1,"a":4} {"a":2,"a":3} {"a":2,"a":4} So I think we have no option but to say, in terms of rfc 2119, that we have careful considered and decided not to comply with the RFC's recommendation (and we should note that in the docs). cheers andrew
> Actually, now I think more about it 3 is the best answer. Here's why: > even the JSON generators can produce JSON with non-unique field names: +1 Also, I think we should add a json_normalize() function to the TODO list. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 03/08/2013 11:03 PM, Andrew Dunstan wrote: > > On 03/08/2013 04:42 PM, Andrew Dunstan wrote: >> >>> >>> So my order of preference for the options would be: >>> >>> 1. Have the JSON type collapse objects so the last instance of a key >>> wins and is actually stored >>> >>> 2. Throw an error when a JSON type has duplicate keys >>> >>> 3. Have the accessors find the last instance of a key and return >>> that value >>> >>> 4. Let things remain as they are now >>> >>> On second though, I don't like 4 at all. It means that the JSON type >>> things a value is valid while the accessor does not. They contradict >>> one another. >>> >>> >> >> >> You can forget 1. We are not going to have the parser collapse >> anything. Either the JSON it gets is valid or it's not. But the >> parser isn't going to try to MAKE it valid. > > > Actually, now I think more about it 3 is the best answer. > Here's why: even the JSON generators can produce JSON with non-unique > field names: Yes, especially if you consider popular json generators vim and strcat() :) It is not a "serialisation" of some existing object, but it is something that JavaScript could interpret as valid subset of JavaScript which producees a JavaScript Object when interpreted. In this sense it is way better than MySQL timestamp 0000-00-00 00:00 So the loose (without implementing the SHOULD part) meaning of JSON spec is "anything that can be read into JavaScript producing a JS Object" and not "serialisation of a JavaScript Object" as I wanted to read it initially. > > andrew=# select row_to_json(q) from (select x as a, y as a from > generate_series(1,2) x, generate_series(3,4) y) q; > row_to_json > --------------- > {"a":1,"a":3} > {"a":1,"a":4} > {"a":2,"a":3} > {"a":2,"a":4} > > > So I think we have no option but to say, in terms of rfc 2119, that we > have careful considered and decided not to comply with the RFC's > recommendation The downside is, that the we have just shifted the burden of JS Object generation to the getter functions. I suspect that 99.98% of the time we will get valid and unique JS Object serializations or equivalent as input to json_in() If we want the getter functions to handle the "loose JSON" to Object conversion side assuming our stored JSON can contain non-unique keys then these are bound to be slower, as they have to do these checks. Thay can't just grab the first matching one and return or recurse on that. > (and we should note that in the docs). definitely +1 > > cheers > > andrew > > > >
On 03/08/2013 06:37 PM, Hannu Krosing wrote: > > I suspect that 99.98% of the time we will get valid and unique JS > Object serializations or equivalent as input to json_in() > > If we want the getter functions to handle the "loose JSON" to Object > conversion > side assuming our stored JSON can contain non-unique keys then these are > bound to be slower, as they have to do these checks. Thay can't just > grab the first > matching one and return or recurse on that. No, there will be no slowdown. The parser doesn't short circuit. Read the code. cheers andrew
On Fri, Mar 08, 2013 at 10:34:20PM +0100, Hannu Krosing wrote: > On 03/08/2013 10:01 PM, Alvaro Herrera wrote: >> Hannu Krosing escribi?: >>> If it does not meet these "semantic" constraints, then it is not >>> really JSON - it is merely JSON-like. >> Is it wrong? The standard cited says SHOULD, not MUST. > > > I think one MAY start implementation with loose interpretation of > SHOULD, but if at all possible we SHOULD implement the > SHOULD-qualified features :) > > http://www.ietf.org/rfc/rfc2119.txt: > > SHOULD This word, or the adjective "RECOMMENDED", mean that there > may exist valid reasons in particular circumstances to ignore a > particular item, but the full implications must be understood and > carefully weighed before choosing a different course. That "SHOULD" in section 2.2 of RFC 4627 constrains JSON data, not JSON parsers. Section 4 addresses parsers, saying "A JSON parser MUST accept all texts that conform to the JSON grammar." > We might start with just throwing a warning for duplicate keys, but I > can see no good reason to do so. Except ease of implementation and with > current JSON-AS-TEXT implenetation performance. Since its departure from a "SHOULD" item does not impugn the conformance of an input text, it follows that json_in(), to be a conforming JSON parser, MUST not reject objects with duplicate keys. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On Fri, Mar 8, 2013 at 4:42 PM, Andrew Dunstan <andrew@dunslane.net> wrote: >> So my order of preference for the options would be: >> >> 1. Have the JSON type collapse objects so the last instance of a key wins >> and is actually stored >> >> 2. Throw an error when a JSON type has duplicate keys >> >> 3. Have the accessors find the last instance of a key and return that >> value >> >> 4. Let things remain as they are now >> >> On second though, I don't like 4 at all. It means that the JSON type >> things a value is valid while the accessor does not. They contradict one >> another. > > You can forget 1. We are not going to have the parser collapse anything. > Either the JSON it gets is valid or it's not. But the parser isn't going to > try to MAKE it valid. Why not? Because it's the wrong thing to do, or because it would be slower? What I think is tricky here is that there's more than one way to conceptualize what the JSON data type really is. Is it a key-value store of sorts, or just a way to store text values that meet certain minimalist syntactic criteria? I had imagined it as the latter, in which case normalization isn't sensible. But if you think of it the first way, then normalization is not only sensible, but almost obligatory. For example, we don't feel bad about this: rhaas=# select '1e1'::numeric;numeric --------- 10 (1 row) I think Andrew and I had envisioned this as basically a text data type that enforces some syntax checking on its input, hence the current design. But I'm not sure that's the ONLY sensible design. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 03/13/2013 08:17 PM, Robert Haas wrote: > I think Andrew and I had envisioned this as basically a text data type > that enforces some syntax checking on its input, hence the current > design. But I'm not sure that's the ONLY sensible design. We're probably stuck with it at this point, but it may well be worth considering the later introduction of a compatible `jsonobj` that stores parsed and normalized json objects in some internal format the client doesn't have to care about, like serialized V8 JS VM objects. I suspect that such a type is better offered by a contrib until/unless PL/V8 or a similar becomes a core language. It'd be nuts to try to re-implement all of the JSON and javascript object functionality in a javascript engine when we can just plug an existing one in and use its JSON and javascript object manipulation. The minimalist approach makes sense for the json type precisely because it's just validated text, but I don't think it makes sense to continually extend it and slowly reinvent a whole javascript engine in Pg. If we're going to do things like normalizing json I think that's a job for a real JavaScript engine that understands Javascript objects. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 03/13/2013 08:17 AM, Robert Haas wrote: > On Fri, Mar 8, 2013 at 4:42 PM, Andrew Dunstan <andrew@dunslane.net> wrote: >>> So my order of preference for the options would be: >>> >>> 1. Have the JSON type collapse objects so the last instance of a key wins >>> and is actually stored >>> >>> 2. Throw an error when a JSON type has duplicate keys >>> >>> 3. Have the accessors find the last instance of a key and return that >>> value >>> >>> 4. Let things remain as they are now >>> >>> On second though, I don't like 4 at all. It means that the JSON type >>> things a value is valid while the accessor does not. They contradict one >>> another. >> You can forget 1. We are not going to have the parser collapse anything. >> Either the JSON it gets is valid or it's not. But the parser isn't going to >> try to MAKE it valid. > Why not? Because it's the wrong thing to do, or because it would be slower? > > What I think is tricky here is that there's more than one way to > conceptualize what the JSON data type really is. Is it a key-value > store of sorts, or just a way to store text values that meet certain > minimalist syntactic criteria? I had imagined it as the latter, in > which case normalization isn't sensible. But if you think of it the > first way, then normalization is not only sensible, but almost > obligatory. For example, we don't feel bad about this: > > rhaas=# select '1e1'::numeric; > numeric > --------- > 10 > (1 row) > > I think Andrew and I had envisioned this as basically a text data type > that enforces some syntax checking on its input, hence the current > design. But I'm not sure that's the ONLY sensible design. > I think we've moved on from this point, because a) other implementations allow duplicate keys, b) it's trivially easy to make Postgres generate such json, and c) there is some dispute about exactly what the spec mandates. I'll be posting a revised patch shortly that doesn't error out but simply uses the value for the later key lexically. cheers andrew
On 2013-03-13 20:48:57 +0800, Craig Ringer wrote: > On 03/13/2013 08:17 PM, Robert Haas wrote: > > I think Andrew and I had envisioned this as basically a text data type > > that enforces some syntax checking on its input, hence the current > > design. But I'm not sure that's the ONLY sensible design. > We're probably stuck with it at this point, but it may well be worth > considering the later introduction of a compatible `jsonobj` that stores > parsed and normalized json objects in some internal format the client > doesn't have to care about, like serialized V8 JS VM objects. > > I suspect that such a type is better offered by a contrib until/unless > PL/V8 or a similar becomes a core language. It'd be nuts to try to > re-implement all of the JSON and javascript object functionality in a > javascript engine when we can just plug an existing one in and use its > JSON and javascript object manipulation. The minimalist approach makes > sense for the json type precisely because it's just validated text, but > I don't think it makes sense to continually extend it and slowly > reinvent a whole javascript engine in Pg. While I am not convinced - but not the contrary either - that using something like V8 is a good idea, I wish the patch adding json had reserved the first byte in the varlena for the 'json encoding' or something similar. That would have left the road open for easily adding different encodings in the future. Now youre left of marking it with a nullbyte in the beginning or similar atrocities... Just wanted to say that we might want to think about such stuff now that we preserve cross-version compatibility of on-disk data the next time we add a type. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mar 13, 2013, at 5:17 AM, Robert Haas <robertmhaas@gmail.com> wrote: > What I think is tricky here is that there's more than one way to > conceptualize what the JSON data type really is. Is it a key-value > store of sorts, or just a way to store text values that meet certain > minimalist syntactic criteria? I had imagined it as the latter, in > which case normalization isn't sensible. But if you think of it the > first way, then normalization is not only sensible, but almost > obligatory. That makes a lot of sense. Given the restrictions I tend to prefer in my database data types, I had imagined it as the former.And since I'm using it now to store key/value pairs (killing off some awful EAV implementations in the process, BTW),I certainly think of it more formally as an object. But I can live with the other interpretation, as long as the differences are clearly understood and documented. Perhaps anote could be added to the docs explaining this difference, and what one can do to adapt for it. A normalizing functionwould certainly help. Best, David
<div class="moz-cite-prefix">On 14/03/13 02:02, Andrew Dunstan wrote:<br /></div><blockquote cite="mid:514078E1.2090905@dunslane.net"type="cite"><br /> On 03/13/2013 08:17 AM, Robert Haas wrote: <br /><blockquote type="cite">OnFri, Mar 8, 2013 at 4:42 PM, Andrew Dunstan <a class="moz-txt-link-rfc2396E" href="mailto:andrew@dunslane.net"><andrew@dunslane.net></a>wrote: <br /><blockquote type="cite"><blockquote type="cite">Somy order of preference for the options would be: <br /><br /> 1. Have the JSON type collapse objects so thelast instance of a key wins <br /> and is actually stored <br /><br /> 2. Throw an error when a JSON type has duplicatekeys <br /><br /> 3. Have the accessors find the last instance of a key and return that <br /> value <br /><br />4. Let things remain as they are now <br /><br /> On second though, I don't like 4 at all. It means that the JSON type<br /> things a value is valid while the accessor does not. They contradict one <br /> another. <br /></blockquote> Youcan forget 1. We are not going to have the parser collapse anything. <br /> Either the JSON it gets is valid or it's not.But the parser isn't going to <br /> try to MAKE it valid. <br /></blockquote> Why not? Because it's the wrong thingto do, or because it would be slower? <br /><br /> What I think is tricky here is that there's more than one way to<br /> conceptualize what the JSON data type really is. Is it a key-value <br /> store of sorts, or just a way to storetext values that meet certain <br /> minimalist syntactic criteria? I had imagined it as the latter, in <br /> whichcase normalization isn't sensible. But if you think of it the <br /> first way, then normalization is not only sensible,but almost <br /> obligatory. For example, we don't feel bad about this: <br /><br /> rhaas=# select '1e1'::numeric;<br /> numeric <br /> --------- <br /> 10 <br /> (1 row) <br /><br /> I think Andrew and I had envisionedthis as basically a text data type <br /> that enforces some syntax checking on its input, hence the current <br/> design. But I'm not sure that's the ONLY sensible design. <br /><br /></blockquote><br /><br /> I think we've movedon from this point, because a) other implementations allow duplicate keys, b) it's trivially easy to make Postgres generatesuch json, and c) there is some dispute about exactly what the spec mandates. <br /><br /> I'll be posting a revisedpatch shortly that doesn't error out but simply uses the value for the later key lexically. <br /><br /> cheers <br/><br /> andrew <br /><br /><br /><br /><br /></blockquote><font size="-1">How about adding a new func<font size="-1">tionwith '_stric<font size="-1">t' added to the existing name, with an extra parameter </font></font></font>'coalesce'- or using other names, if considered more appropriate!<br /><br /> That way slower more stringentfunctionality can be added where required. This way, the existing function need not be changed.<br /><br /> Ifcoalesce = true, <br /> then: the last duplicate is used<br /> else: an error is returned when the new key is a duplicate.<br/><br /><br /> Cheers,<br /> Gavin<br /><br /><br /><br />
On 03/13/2013 12:51 PM, Gavin Flower wrote: > On 14/03/13 02:02, Andrew Dunstan wrote: >> >> On 03/13/2013 08:17 AM, Robert Haas wrote: >>> On Fri, Mar 8, 2013 at 4:42 PM, Andrew Dunstan <andrew@dunslane.net> >>> wrote: >>>>> So my order of preference for the options would be: >>>>> >>>>> 1. Have the JSON type collapse objects so the last instance of a >>>>> key wins >>>>> and is actually stored >>>>> >>>>> 2. Throw an error when a JSON type has duplicate keys >>>>> >>>>> 3. Have the accessors find the last instance of a key and return that >>>>> value >>>>> >>>>> 4. Let things remain as they are now >>>>> >>>>> On second though, I don't like 4 at all. It means that the JSON type >>>>> things a value is valid while the accessor does not. They >>>>> contradict one >>>>> another. >>>> You can forget 1. We are not going to have the parser collapse >>>> anything. >>>> Either the JSON it gets is valid or it's not. But the parser isn't >>>> going to >>>> try to MAKE it valid. >>> Why not? Because it's the wrong thing to do, or because it would be >>> slower? >>> >>> What I think is tricky here is that there's more than one way to >>> conceptualize what the JSON data type really is. Is it a key-value >>> store of sorts, or just a way to store text values that meet certain >>> minimalist syntactic criteria? I had imagined it as the latter, in >>> which case normalization isn't sensible. But if you think of it the >>> first way, then normalization is not only sensible, but almost >>> obligatory. For example, we don't feel bad about this: >>> >>> rhaas=# select '1e1'::numeric; >>> numeric >>> --------- >>> 10 >>> (1 row) >>> >>> I think Andrew and I had envisioned this as basically a text data type >>> that enforces some syntax checking on its input, hence the current >>> design. But I'm not sure that's the ONLY sensible design. >>> >> >> >> I think we've moved on from this point, because a) other >> implementations allow duplicate keys, b) it's trivially easy to make >> Postgres generate such json, and c) there is some dispute about >> exactly what the spec mandates. >> >> I'll be posting a revised patch shortly that doesn't error out but >> simply uses the value for the later key lexically. >> >> cheers >> >> andrew >> >> >> >> > How about adding a new function with '_strict' added to the existing > name, with an extra parameter 'coalesce' - or using other names, if > considered more appropriate! > > That way slower more stringent functionality can be added where > required. This way, the existing function need not be changed. > > If coalesce = true, > then: the last duplicate is used > else: an error is returned when the new key is a duplicate. > > > For good or ill, we now already have a json type that will accept strings with duplicate keys, and generator functions which can now generate such strings. If someone wants functions to enforce a stricter validity check (e.g. via a check constraint on a domain), or to convert json to a canonical version which strips out prior keys of the same name and their associated values, then these should be relatively simple to implement given the parser API in the current patch. But they aren't part of the current patch, and I think it's way too late to be adding such things. I have been persuaded by arguments made upthread that the best thing to do is exactly what other well known json-accepting implementations do (e.g. V8), which is to accept json with duplicate keys and to treat the later key/value as overriding the former key/value. If I'd done that from the start nobody would now be talking about this at all. cheers andrew
On Mar 13, 2013, at 10:45 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > If someone wants functions to enforce a stricter validity check (e.g. via a check constraint on a domain), or to convertjson to a canonical version which strips out prior keys of the same name and their associated values, then these shouldbe relatively simple to implement given the parser API in the current patch. But they aren't part of the current patch,and I think it's way too late to be adding such things. I think it might be good to get something like this into core eventually, otherwise I suspect that there will be a differentversion of it for every JSON-using project out there. And my first cut at it won’t descend into sub-objects. > I have been persuaded by arguments made upthread that the best thing to do is exactly what other well known json-acceptingimplementations do (e.g. V8), which is to accept json with duplicate keys and to treat the later key/valueas overriding the former key/value. If I'd done that from the start nobody would now be talking about this at all. That’s true, though I might have started thinking about a canonicalizing function before long. :-) Best, David
On 03/13/2013 01:50 PM, David E. Wheeler wrote: > On Mar 13, 2013, at 10:45 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > >> If someone wants functions to enforce a stricter validity check (e.g. via a check constraint on a domain), or to convertjson to a canonical version which strips out prior keys of the same name and their associated values, then these shouldbe relatively simple to implement given the parser API in the current patch. But they aren't part of the current patch,and I think it's way too late to be adding such things. > I think it might be good to get something like this into core eventually, otherwise I suspect that there will be a differentversion of it for every JSON-using project out there. And my first cut at it won’t descend into sub-objects. The you wouldn't be doing it right. The whole thing about a recursive descent parser is that it's, well, recursive. cheers andrew
On Mar 13, 2013, at 10:59 AM, Andrew Dunstan <andrew@dunslane.net> wrote: >> And my first cut at it won’t descend into sub-objects. > > > The you wouldn't be doing it right. The whole thing about a recursive descent parser is that it's, well, recursive. Right, but it would serve my immediate needs. I have a column that just stores key/value pairs with no nesting. So I knowI can write something like this and have it be good enough: create or replace function json_smash( json ) RETURNS JSON language SQL STRICT IMMUTABLE AS $$ SELECT format('{%s}',array_to_string(ARRAY( SELECT format('%s: %s', to_json(key), value) FROM ( SELECT key, value, row_number() OVER ( partition by key order by rnum desc ) AS rnum FROM ( SELECT key, value, row_number() OVER ( partition by key ) AS rnum FROM json_each($1) ) a ) b WHERE rnum = 1 ), ','))::json; $$; And do you really want to see that unloosed on the world? :-P (Yes, I know there is no guarantee on the order of rows returnedby json_each()). Best, David
On 03/13/2013 12:40 PM, David E. Wheeler wrote: > On Mar 13, 2013, at 5:17 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> What I think is tricky here is that there's more than one way to >> conceptualize what the JSON data type really is. Is it a key-value >> store of sorts, or just a way to store text values that meet certain >> minimalist syntactic criteria? I had imagined it as the latter, in >> which case normalization isn't sensible. But if you think of it the >> first way, then normalization is not only sensible, but almost >> obligatory. > That makes a lot of sense. Given the restrictions I tend to prefer in my database data types, I had imagined it as theformer. And since I'm using it now to store key/value pairs (killing off some awful EAV implementations in the process,BTW), I certainly think of it more formally as an object. > > > But I can live with the other interpretation, as long as the differences are clearly understood and documented. Perhapsa note could be added to the docs explaining this difference, and what one can do to adapt for it. A normalizing functionwould certainly help. I guess the easiest and most generic way to normalize is to actually convert to some internal representation and back. in pl/python this would look like this: hannu=# create function normalize(IN ij json, OUT oj json) language plpythonu as $$ import json return json.dumps(json.loads(ij)) $$; CREATE FUNCTION hannu=# select normalize('{"a":1, "a":"b", "a":true}'); normalize ------------- {"a": true} (1 row) If we could want to be really fancy we could start storing our json in some format which is faster to parse, like tnetstrings, but probably it is too late in release cycle to change this now. Hannu > > Best, > > David >