Thread: Duplicate JSON Object Keys

Duplicate JSON Object Keys

From
"David E. Wheeler"
Date:
This behavior surprised me a bit:
   david=# select '{"foo": 1, "foo": 2}'::json;            json            ----------------------    {"foo": 1, "foo":
2}

I had expected something more like this:
   david=# select '{"foo": 1, "foo": 2}'::json;       json       ------------    {"foo": 2}

This hasn’t been much of an issue before, but with Andrew’s JSON enhancements going in, it will start to cause
problems:
   david=# select json_get('{"foo": 1, "foo": 2}', 'foo');   ERROR:  field name is not unique in json object

Andrew tells me that the spec requires this. I think that’s fine, but I would rather that it never got to there.

In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON object key
uniquenessshould be enforced either by throwing an error on duplicate keys, or by flattening so that the latest key
wins(as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it more
memory-intensive,but such is the price for correctness. 

Thoughts?

Thanks,

David




Re: Duplicate JSON Object Keys

From
Andrew Dunstan
Date:
On 03/07/2013 02:48 PM, David E. Wheeler wrote:
> This behavior surprised me a bit:
>
>      david=# select '{"foo": 1, "foo": 2}'::json;
>               json
>      ----------------------
>       {"foo": 1, "foo": 2}
>
> I had expected something more like this:
>
>      david=# select '{"foo": 1, "foo": 2}'::json;
>          json
>      ------------
>       {"foo": 2}
>
> This hasn’t been much of an issue before, but with Andrew’s JSON enhancements going in, it will start to cause
problems:
>
>      david=# select json_get('{"foo": 1, "foo": 2}', 'foo');
>      ERROR:  field name is not unique in json object
>
> Andrew tells me that the spec requires this. I think that’s fine, but I would rather that it never got to there.

Specifically, rfc4627 says (note last sentence):
   2.2.  Objects
       An object structure is represented as a pair of curly brackets       surrounding zero or more name/value pairs
(ormembers).  A name is a       string.  A single colon comes after each name, separating the name       from the
value. A single comma separates a value from a following       name.  The names within an object SHOULD be unique. 



cheers

andrew



Re: Duplicate JSON Object Keys

From
Robert Haas
Date:
On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david@justatheory.com> wrote:
> In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON object
keyuniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the latest key
wins(as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it more
memory-intensive,but such is the price for correctness. 

I'm with Andrew.  That's a rathole I emphatically don't want to go
down.  I wrote this code originally, and I had the thought clearly in
mind that I wanted to accept JSON that was syntactically well-formed,
not JSON that met certain semantic constraints.  We could add
functions like json_is_non_stupid(json) so that people can easily add
a CHECK constraint that enforces this if they so desire.  But
enforcing it categorically seems like a bad plan, especially since at
this point it would require a compatibility break with previous
releases.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Duplicate JSON Object Keys

From
Hannu Krosing
Date:
On 03/08/2013 09:39 PM, Robert Haas wrote:
> On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david@justatheory.com> wrote:
>> In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON object
keyuniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the latest key
wins(as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it more
memory-intensive,but such is the price for correctness.
 
> I'm with Andrew.  That's a rathole I emphatically don't want to go
> down.  I wrote this code originally, and I had the thought clearly in
> mind that I wanted to accept JSON that was syntactically well-formed,
> not JSON that met certain semantic constraints.

If it does not meet these "semantic" constraints, then it is not
really JSON - it is merely JSON-like.

this sounds very much like MySQLs decision to support timestamp
"0000-00-00 00:00" - syntactically correct, but semantically wrong.

> We could add
> functions like json_is_non_stupid(json) so that people can easily add
> a CHECK constraint that enforces this if they so desire.  But
> enforcing it categorically seems like a bad plan, especially since at
> this point it would require a compatibility break with previous
> releases
If we ever will support "real" spec-compliant JSON (maybe based
on recursive hstore ?) then there will be a compatibility break
anyway, so why not do it now.

Or do you seriously believe that somebody is using "PostgreSQL JSON"
to store these kind of "json documents"

Cheers
Hannu Krosing




Re: Duplicate JSON Object Keys

From
Alvaro Herrera
Date:
Hannu Krosing escribió:
> On 03/08/2013 09:39 PM, Robert Haas wrote:
> >On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david@justatheory.com> wrote:
> >>In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON object
keyuniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the latest key
wins(as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it more
memory-intensive,but such is the price for correctness. 
> >I'm with Andrew.  That's a rathole I emphatically don't want to go
> >down.  I wrote this code originally, and I had the thought clearly in
> >mind that I wanted to accept JSON that was syntactically well-formed,
> >not JSON that met certain semantic constraints.
>
> If it does not meet these "semantic" constraints, then it is not
> really JSON - it is merely JSON-like.
>
> this sounds very much like MySQLs decision to support timestamp
> "0000-00-00 00:00" - syntactically correct, but semantically wrong.

Is it wrong?  The standard cited says SHOULD, not MUST.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Duplicate JSON Object Keys

From
"David E. Wheeler"
Date:
On Mar 8, 2013, at 1:01 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

>> If it does not meet these "semantic" constraints, then it is not
>> really JSON - it is merely JSON-like.
>>
>> this sounds very much like MySQLs decision to support timestamp
>> "0000-00-00 00:00" - syntactically correct, but semantically wrong.
>
> Is it wrong?  The standard cited says SHOULD, not MUST.

Yes, it is wrong, because multiple keys are specifically disallowed for accessing values. Hence this new error:
  david=# select json_get('{"foo": 1, "foo": 2}', 'foo');  ERROR:  field name is not unique in json object

I really don’t think that should be possible.

Best,

David


Re: Duplicate JSON Object Keys

From
Gavin Flower
Date:
Well I would much prefer to find out sooner rather than later that there
is a problem, so I would much prefer know I've created a duplicate as
soon as the system can detect it.  In general, Postgresql appears much
better at this than MySQL


On 09/03/13 10:01, Alvaro Herrera wrote:
> Hannu Krosing escribió:
>> On 03/08/2013 09:39 PM, Robert Haas wrote:
>>> On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david@justatheory.com> wrote:
>>>> In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON
objectkey uniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the
latestkey wins (as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it
morememory-intensive, but such is the price for correctness. 
>>> I'm with Andrew.  That's a rathole I emphatically don't want to go
>>> down.  I wrote this code originally, and I had the thought clearly in
>>> mind that I wanted to accept JSON that was syntactically well-formed,
>>> not JSON that met certain semantic constraints.
>> If it does not meet these "semantic" constraints, then it is not
>> really JSON - it is merely JSON-like.
>>
>> this sounds very much like MySQLs decision to support timestamp
>> "0000-00-00 00:00" - syntactically correct, but semantically wrong.
> Is it wrong?  The standard cited says SHOULD, not MUST.
>




Re: Duplicate JSON Object Keys

From
Andrew Dunstan
Date:
On 03/08/2013 04:01 PM, Alvaro Herrera wrote:
> Hannu Krosing escribió:
>> On 03/08/2013 09:39 PM, Robert Haas wrote:
>>> On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david@justatheory.com> wrote:
>>>> In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON
objectkey uniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the
latestkey wins (as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it
morememory-intensive, but such is the price for correctness. 
>>> I'm with Andrew.  That's a rathole I emphatically don't want to go
>>> down.  I wrote this code originally, and I had the thought clearly in
>>> mind that I wanted to accept JSON that was syntactically well-formed,
>>> not JSON that met certain semantic constraints.
>> If it does not meet these "semantic" constraints, then it is not
>> really JSON - it is merely JSON-like.
>>
>> this sounds very much like MySQLs decision to support timestamp
>> "0000-00-00 00:00" - syntactically correct, but semantically wrong.
> Is it wrong?  The standard cited says SHOULD, not MUST.


Here's what rfc2119 says about that wording:
   4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that   there may exist valid reasons in particular
circumstanceswhen the   particular behavior is acceptable or even useful, but the full   implications should be
understoodand the case carefully weighed   before implementing any behavior described with this label. 


So we're allowed to do as Robert chose, and I think there are good
reasons for doing so (apart from anything else, checking it would slow
down the parser enormously).

Now you could argue that in that case the extractor functions should
allow it too, and it's probably fairly easy to change them to allow it.
In that case we need to decide who wins. We could treat a later field
lexically as overriding an earlier field of the same name, which I think
is what David expected. That's what plv8 does (i.e. it's how v8
interprets JSON):
   andrew=# create or replace function jget(t json, fld text) returns   text language plv8 as ' return t[fld]; ';
CREATEFUNCTION   andrew=# select jget('{"f1":"x","f1":"y"}','f1');     jget   ------     y   (1 row) 


Or you could take the view I originally took that in view of the RFC
wording we should raise an error if this was found.

I can live with either view.

cheers

andrew



Re: Duplicate JSON Object Keys

From
"David E. Wheeler"
Date:
On Mar 8, 2013, at 1:21 PM, Andrew Dunstan <andrew@dunslane.net> wrote:

> Here's what rfc2119 says about that wording:
>
>   4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that
>   there may exist valid reasons in particular circumstances when the
>   particular behavior is acceptable or even useful, but the full
>   implications should be understood and the case carefully weighed
>   before implementing any behavior described with this label.

I suspect this was allowed for the JavaScript behavior where multiple keys are allowed, but the last key in the list
wins.

> So we're allowed to do as Robert chose, and I think there are good reasons for doing so (apart from anything else,
checkingit would slow down the parser enormously). 

Yes, but the implications are going to start biting us on the ass now.

> Now you could argue that in that case the extractor functions should allow it too, and it's probably fairly easy to
changethem to allow it. In that case we need to decide who wins. We could treat a later field lexically as overriding
anearlier field of the same name, which I think is what David expected. That's what plv8 does (i.e. it's how v8
interpretsJSON): 
>
>   andrew=# create or replace function jget(t json, fld text) returns
>   text language plv8 as ' return t[fld]; ';
>   CREATE FUNCTION
>   andrew=# select jget('{"f1":"x","f1":"y"}','f1');
>     jget
>   ------
>     y
>   (1 row)
>
>
> Or you could take the view I originally took that in view of the RFC wording we should raise an error if this was
found.
>
> I can live with either view.

I’m on the fence. On the one hand, I like the plv8 behavior, which is nice for a dynamic language. On the other hand, I
don'tmuch care for it in my database, where I want data storage requirements to be quite strict. I hate the idea of
"0000-00-00"being allowed as a date, and am uncomfortable with allowing duplicate keys to be stored in the JSON data
type.

So my order of preference for the options would be:

1. Have the JSON type collapse objects so the last instance of a key wins and is actually stored

2. Throw an error when a JSON type has duplicate keys

3. Have the accessors find the last instance of a key and return that value

4. Let things remain as they are now

On second though, I don't like 4 at all. It means that the JSON type things a value is valid while the accessor does
not.They contradict one another. 

Best,

David


Re: Duplicate JSON Object Keys

From
Hannu Krosing
Date:
On 03/08/2013 10:01 PM, Alvaro Herrera wrote:
> Hannu Krosing escribió:
>> On 03/08/2013 09:39 PM, Robert Haas wrote:
>>> On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david@justatheory.com> wrote:
>>>> In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON
objectkey uniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the
latestkey wins (as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it
morememory-intensive, but such is the price for correctness.
 
>>> I'm with Andrew.  That's a rathole I emphatically don't want to go
>>> down.  I wrote this code originally, and I had the thought clearly in
>>> mind that I wanted to accept JSON that was syntactically well-formed,
>>> not JSON that met certain semantic constraints.
>> If it does not meet these "semantic" constraints, then it is not
>> really JSON - it is merely JSON-like.
>>
>> this sounds very much like MySQLs decision to support timestamp
>> "0000-00-00 00:00" - syntactically correct, but semantically wrong.
> Is it wrong?  The standard cited says SHOULD, not MUST.


I think one MAY start implementation with loose interpretation of
SHOULD, but if at all possible we SHOULD implement the
SHOULD-qualified features :)

http://www.ietf.org/rfc/rfc2119.txt:

SHOULD   This word, or the adjective "RECOMMENDED", mean that there
may exist valid reasons in particular circumstances to ignore a
particular item, but the full implications must be understood and
carefully weighed before choosing a different course.

We might start with just throwing a warning for duplicate keys, but I
can see no good reason to do so. Except ease of implementation and with
current JSON-AS-TEXT implenetation performance.

And providing a boolean function is_really_json_object(json) to be used in check
constraints seems plain weird .

Otoh, as the spec defines JSON as being designed to be a subset of javascript,
it SHOULD accept select '{"foo": 1, "foo": 2}'::json; but turn it into
'{"foo": 2}'::json; for storage.

I do not think it would be a good idea to leave it to data extraction
functions to always get the last value for foo in this case 2

------------------
Hannu







Re: Duplicate JSON Object Keys

From
Andrew Dunstan
Date:
On 03/08/2013 04:28 PM, David E. Wheeler wrote:
> On Mar 8, 2013, at 1:21 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>
>> Here's what rfc2119 says about that wording:
>>
>>    4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that
>>    there may exist valid reasons in particular circumstances when the
>>    particular behavior is acceptable or even useful, but the full
>>    implications should be understood and the case carefully weighed
>>    before implementing any behavior described with this label.
> I suspect this was allowed for the JavaScript behavior where multiple keys are allowed, but the last key in the list
wins.
>
>> So we're allowed to do as Robert chose, and I think there are good reasons for doing so (apart from anything else,
checkingit would slow down the parser enormously). 
> Yes, but the implications are going to start biting us on the ass now.
>
>> Now you could argue that in that case the extractor functions should allow it too, and it's probably fairly easy to
changethem to allow it. In that case we need to decide who wins. We could treat a later field lexically as overriding
anearlier field of the same name, which I think is what David expected. That's what plv8 does (i.e. it's how v8
interpretsJSON): 
>>
>>    andrew=# create or replace function jget(t json, fld text) returns
>>    text language plv8 as ' return t[fld]; ';
>>    CREATE FUNCTION
>>    andrew=# select jget('{"f1":"x","f1":"y"}','f1');
>>      jget
>>    ------
>>      y
>>    (1 row)
>>
>>
>> Or you could take the view I originally took that in view of the RFC wording we should raise an error if this was
found.
>>
>> I can live with either view.
> I’m on the fence. On the one hand, I like the plv8 behavior, which is nice for a dynamic language. On the other hand,
Idon't much care for it in my database, where I want data storage requirements to be quite strict. I hate the idea of
"0000-00-00"being allowed as a date, and am uncomfortable with allowing duplicate keys to be stored in the JSON data
type.
>
> So my order of preference for the options would be:
>
> 1. Have the JSON type collapse objects so the last instance of a key wins and is actually stored
>
> 2. Throw an error when a JSON type has duplicate keys
>
> 3. Have the accessors find the last instance of a key and return that value
>
> 4. Let things remain as they are now
>
> On second though, I don't like 4 at all. It means that the JSON type things a value is valid while the accessor does
not.They contradict one another. 
>
>


You can forget 1. We are not going to have the parser collapse anything.
Either the JSON it gets is valid or it's not. But the parser isn't going
to try to MAKE it valid.

cheers

andrew



Re: Duplicate JSON Object Keys

From
Hannu Krosing
Date:
On 03/08/2013 10:42 PM, Andrew Dunstan wrote:
>
> On 03/08/2013 04:28 PM, David E. Wheeler wrote:
>> On Mar 8, 2013, at 1:21 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>>
>>> Here's what rfc2119 says about that wording:
>>>
>>>    4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that
>>>    there may exist valid reasons in particular circumstances when the
>>>    particular behavior is acceptable or even useful, but the full
>>>    implications should be understood and the case carefully weighed
>>>    before implementing any behavior described with this label.
>> I suspect this was allowed for the JavaScript behavior where multiple 
>> keys are allowed, but the last key in the list wins.
>>
>>> So we're allowed to do as Robert chose, and I think there are good 
>>> reasons for doing so (apart from anything else, checking it would 
>>> slow down the parser enormously).
>> Yes, but the implications are going to start biting us on the ass now.
>>
>>> Now you could argue that in that case the extractor functions should 
>>> allow it too, and it's probably fairly easy to change them to allow 
>>> it. In that case we need to decide who wins. We could treat a later 
>>> field lexically as overriding an earlier field of the same name, 
>>> which I think is what David expected. That's what plv8 does (i.e. 
>>> it's how v8 interprets JSON):
>>>
>>>    andrew=# create or replace function jget(t json, fld text) returns
>>>    text language plv8 as ' return t[fld]; ';
>>>    CREATE FUNCTION
>>>    andrew=# select jget('{"f1":"x","f1":"y"}','f1');
>>>      jget
>>>    ------
>>>      y
>>>    (1 row)
>>>
>>>
>>> Or you could take the view I originally took that in view of the RFC 
>>> wording we should raise an error if this was found.
>>>
>>> I can live with either view.
>> I’m on the fence. On the one hand, I like the plv8 behavior, which is 
>> nice for a dynamic language. On the other hand, I don't much care for 
>> it in my database, where I want data storage requirements to be quite 
>> strict. I hate the idea of "0000-00-00" being allowed as a date, and 
>> am uncomfortable with allowing duplicate keys to be stored in the 
>> JSON data type.
>>
>> So my order of preference for the options would be:
>>
>> 1. Have the JSON type collapse objects so the last instance of a key 
>> wins and is actually stored
>>
>> 2. Throw an error when a JSON type has duplicate keys
>>
>> 3. Have the accessors find the last instance of a key and return that 
>> value
>>
>> 4. Let things remain as they are now
>>
>> On second though, I don't like 4 at all. It means that the JSON type 
>> things a value is valid while the accessor does not. They contradict 
>> one another.
>>
>>
>
>
> You can forget 1. We are not going to have the parser collapse anything.
> Either the JSON it gets is valid or it's not. But the parser isn't 
> going to try to MAKE it valid.
Ok, so the "make valid" part will have to wait for 
http://www.pgcon.org/2013/schedule/events/518.en.html if this will ever 
happen ;)

Which means that all extractor functions will need to do much more work 
in case of complex json, think of

json_get('("a":{"b":1},"a":{"1":"x"}, "a":[1,{"b":7}]}'::json, 
["a","1","b"])

the true value in javascript is but here the get_json function has several
options to error out early and real confusion as to where to report the
error if in the end it is not found. essentially all extractor functions 
have
to do what we omitted doing at input time.

Cheers
Hannu



Re: Duplicate JSON Object Keys

From
Andrew Dunstan
Date:
On 03/08/2013 04:42 PM, Andrew Dunstan wrote:
>
>>
>> So my order of preference for the options would be:
>>
>> 1. Have the JSON type collapse objects so the last instance of a key 
>> wins and is actually stored
>>
>> 2. Throw an error when a JSON type has duplicate keys
>>
>> 3. Have the accessors find the last instance of a key and return that 
>> value
>>
>> 4. Let things remain as they are now
>>
>> On second though, I don't like 4 at all. It means that the JSON type 
>> things a value is valid while the accessor does not. They contradict 
>> one another.
>>
>>
>
>
> You can forget 1. We are not going to have the parser collapse 
> anything. Either the JSON it gets is valid or it's not. But the parser 
> isn't going to try to MAKE it valid.


Actually, now I think more about it 3 is the best answer. Here's why: 
even the JSON generators can produce JSON with non-unique field names:
   andrew=# select row_to_json(q) from (select x as a, y as a from   generate_series(1,2) x, generate_series(3,4) y) q;
    row_to_json   ---------------     {"a":1,"a":3}     {"a":1,"a":4}     {"a":2,"a":3}     {"a":2,"a":4}
 


So I think we have no option but to say, in terms of rfc 2119, that we 
have careful considered and decided not to comply with the RFC's 
recommendation (and we should note that in the docs).

cheers

andrew





Re: Duplicate JSON Object Keys

From
Josh Berkus
Date:
> Actually, now I think more about it 3 is the best answer. Here's why:
> even the JSON generators can produce JSON with non-unique field names:

+1

Also, I think we should add a json_normalize() function to the TODO list.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Duplicate JSON Object Keys

From
Hannu Krosing
Date:
On 03/08/2013 11:03 PM, Andrew Dunstan wrote:
>
> On 03/08/2013 04:42 PM, Andrew Dunstan wrote:
>>
>>>
>>> So my order of preference for the options would be:
>>>
>>> 1. Have the JSON type collapse objects so the last instance of a key 
>>> wins and is actually stored
>>>
>>> 2. Throw an error when a JSON type has duplicate keys
>>>
>>> 3. Have the accessors find the last instance of a key and return 
>>> that value
>>>
>>> 4. Let things remain as they are now
>>>
>>> On second though, I don't like 4 at all. It means that the JSON type 
>>> things a value is valid while the accessor does not. They contradict 
>>> one another.
>>>
>>>
>>
>>
>> You can forget 1. We are not going to have the parser collapse 
>> anything. Either the JSON it gets is valid or it's not. But the 
>> parser isn't going to try to MAKE it valid.
>
>
> Actually, now I think more about it 3 is the best answer.
> Here's why: even the JSON generators can produce JSON with non-unique 
> field names:
Yes, especially if you consider popular json generators vim and strcat() :)

It is not a "serialisation" of some existing object, but it is something
that JavaScript could interpret as valid subset of JavaScript which
producees a JavaScript Object when interpreted.
In this sense it is way better than MySQL timestamp 0000-00-00 00:00

So the loose (without implementing the SHOULD part) meaning of
JSON spec is "anything that can be read  into JavaScript producing
a JS Object" and not "serialisation of a JavaScript Object" as I wanted
to read it initially.

>
>    andrew=# select row_to_json(q) from (select x as a, y as a from
>    generate_series(1,2) x, generate_series(3,4) y) q;
>       row_to_json
>    ---------------
>      {"a":1,"a":3}
>      {"a":1,"a":4}
>      {"a":2,"a":3}
>      {"a":2,"a":4}
>
>
> So I think we have no option but to say, in terms of rfc 2119, that we 
> have careful considered and decided not to comply with the RFC's 
> recommendation
The downside is, that the we have just shifted the burden of JS Object 
generation to the getter functions.

I suspect that 99.98% of the time we will get valid and unique JS Object 
serializations or equivalent as input to json_in()

If we want the getter functions to handle the "loose JSON" to Object 
conversion
side  assuming our stored JSON can contain non-unique keys then these are
bound to be slower, as they have to do these checks. Thay can't just 
grab the first
matching one and return or recurse on that.

> (and we should note that in the docs).
definitely +1
>
> cheers
>
> andrew
>
>
>
>




Re: Duplicate JSON Object Keys

From
Andrew Dunstan
Date:
On 03/08/2013 06:37 PM, Hannu Krosing wrote:

>
> I suspect that 99.98% of the time we will get valid and unique JS 
> Object serializations or equivalent as input to json_in()
>
> If we want the getter functions to handle the "loose JSON" to Object 
> conversion
> side  assuming our stored JSON can contain non-unique keys then these are
> bound to be slower, as they have to do these checks. Thay can't just 
> grab the first
> matching one and return or recurse on that.


No, there will be no slowdown. The parser doesn't short circuit.

Read the code.

cheers

andrew



Re: Duplicate JSON Object Keys

From
Noah Misch
Date:
On Fri, Mar 08, 2013 at 10:34:20PM +0100, Hannu Krosing wrote:
> On 03/08/2013 10:01 PM, Alvaro Herrera wrote:
>> Hannu Krosing escribi?:
>>> If it does not meet these "semantic" constraints, then it is not
>>> really JSON - it is merely JSON-like.

>> Is it wrong?  The standard cited says SHOULD, not MUST.
>
>
> I think one MAY start implementation with loose interpretation of
> SHOULD, but if at all possible we SHOULD implement the
> SHOULD-qualified features :)
>
> http://www.ietf.org/rfc/rfc2119.txt:
>
> SHOULD   This word, or the adjective "RECOMMENDED", mean that there
> may exist valid reasons in particular circumstances to ignore a
> particular item, but the full implications must be understood and
> carefully weighed before choosing a different course.

That "SHOULD" in section 2.2 of RFC 4627 constrains JSON data, not JSON
parsers.  Section 4 addresses parsers, saying "A JSON parser MUST accept all
texts that conform to the JSON grammar."  

> We might start with just throwing a warning for duplicate keys, but I
> can see no good reason to do so. Except ease of implementation and with
> current JSON-AS-TEXT implenetation performance.

Since its departure from a "SHOULD" item does not impugn the conformance of an
input text, it follows that json_in(), to be a conforming JSON parser, MUST
not reject objects with duplicate keys.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Re: Duplicate JSON Object Keys

From
Robert Haas
Date:
On Fri, Mar 8, 2013 at 4:42 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>> So my order of preference for the options would be:
>>
>> 1. Have the JSON type collapse objects so the last instance of a key wins
>> and is actually stored
>>
>> 2. Throw an error when a JSON type has duplicate keys
>>
>> 3. Have the accessors find the last instance of a key and return that
>> value
>>
>> 4. Let things remain as they are now
>>
>> On second though, I don't like 4 at all. It means that the JSON type
>> things a value is valid while the accessor does not. They contradict one
>> another.
>
> You can forget 1. We are not going to have the parser collapse anything.
> Either the JSON it gets is valid or it's not. But the parser isn't going to
> try to MAKE it valid.

Why not?  Because it's the wrong thing to do, or because it would be slower?

What I think is tricky here is that there's more than one way to
conceptualize what the JSON data type really is.  Is it a key-value
store of sorts, or just a way to store text values that meet certain
minimalist syntactic criteria?  I had imagined it as the latter, in
which case normalization isn't sensible.  But if you think of it the
first way, then normalization is not only sensible, but almost
obligatory.  For example, we don't feel bad about this:

rhaas=# select '1e1'::numeric;numeric
---------     10
(1 row)

I think Andrew and I had envisioned this as basically a text data type
that enforces some syntax checking on its input, hence the current
design.  But I'm not sure that's the ONLY sensible design.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Duplicate JSON Object Keys

From
Craig Ringer
Date:
On 03/13/2013 08:17 PM, Robert Haas wrote:
> I think Andrew and I had envisioned this as basically a text data type
> that enforces some syntax checking on its input, hence the current
> design.  But I'm not sure that's the ONLY sensible design.
We're probably stuck with it at this point, but it may well be worth
considering the later introduction of a compatible `jsonobj` that stores
parsed and normalized json objects in some internal format the client
doesn't have to care about, like serialized V8 JS VM objects.

I suspect that such a type is better offered by a contrib until/unless
PL/V8 or a similar becomes a core language. It'd be nuts to try to
re-implement all of the JSON and javascript object functionality in a
javascript engine when we can just plug an existing one in and use its
JSON and javascript object manipulation. The minimalist approach makes
sense for the json type precisely because it's just validated text, but
I don't think it makes sense to continually extend it and slowly
reinvent a whole javascript engine in Pg.

If we're going to do things like normalizing json I think that's a job
for a real JavaScript engine that understands Javascript objects.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services




Re: Duplicate JSON Object Keys

From
Andrew Dunstan
Date:
On 03/13/2013 08:17 AM, Robert Haas wrote:
> On Fri, Mar 8, 2013 at 4:42 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>>> So my order of preference for the options would be:
>>>
>>> 1. Have the JSON type collapse objects so the last instance of a key wins
>>> and is actually stored
>>>
>>> 2. Throw an error when a JSON type has duplicate keys
>>>
>>> 3. Have the accessors find the last instance of a key and return that
>>> value
>>>
>>> 4. Let things remain as they are now
>>>
>>> On second though, I don't like 4 at all. It means that the JSON type
>>> things a value is valid while the accessor does not. They contradict one
>>> another.
>> You can forget 1. We are not going to have the parser collapse anything.
>> Either the JSON it gets is valid or it's not. But the parser isn't going to
>> try to MAKE it valid.
> Why not?  Because it's the wrong thing to do, or because it would be slower?
>
> What I think is tricky here is that there's more than one way to
> conceptualize what the JSON data type really is.  Is it a key-value
> store of sorts, or just a way to store text values that meet certain
> minimalist syntactic criteria?  I had imagined it as the latter, in
> which case normalization isn't sensible.  But if you think of it the
> first way, then normalization is not only sensible, but almost
> obligatory.  For example, we don't feel bad about this:
>
> rhaas=# select '1e1'::numeric;
>   numeric
> ---------
>        10
> (1 row)
>
> I think Andrew and I had envisioned this as basically a text data type
> that enforces some syntax checking on its input, hence the current
> design.  But I'm not sure that's the ONLY sensible design.
>


I think we've moved on from this point, because a) other implementations 
allow duplicate keys, b) it's trivially easy to make Postgres generate 
such json, and c) there is some dispute about exactly what the spec 
mandates.

I'll be posting a revised patch shortly that doesn't error out but 
simply uses the value for the later key lexically.

cheers

andrew





Re: Duplicate JSON Object Keys

From
Andres Freund
Date:
On 2013-03-13 20:48:57 +0800, Craig Ringer wrote:
> On 03/13/2013 08:17 PM, Robert Haas wrote:
> > I think Andrew and I had envisioned this as basically a text data type
> > that enforces some syntax checking on its input, hence the current
> > design.  But I'm not sure that's the ONLY sensible design.
> We're probably stuck with it at this point, but it may well be worth
> considering the later introduction of a compatible `jsonobj` that stores
> parsed and normalized json objects in some internal format the client
> doesn't have to care about, like serialized V8 JS VM objects.
> 
> I suspect that such a type is better offered by a contrib until/unless
> PL/V8 or a similar becomes a core language. It'd be nuts to try to
> re-implement all of the JSON and javascript object functionality in a
> javascript engine when we can just plug an existing one in and use its
> JSON and javascript object manipulation. The minimalist approach makes
> sense for the json type precisely because it's just validated text, but
> I don't think it makes sense to continually extend it and slowly
> reinvent a whole javascript engine in Pg.

While I am not convinced - but not the contrary either - that using
something like V8 is a good idea, I wish the patch adding json had
reserved the first byte in the varlena for the 'json encoding' or
something similar. That would have left the road open for easily adding
different encodings in the future. Now youre left of marking it with a
nullbyte in the beginning or similar atrocities...

Just wanted to say that we might want to think about such stuff now that
we preserve cross-version compatibility of on-disk data the next time we
add a type.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Duplicate JSON Object Keys

From
"David E. Wheeler"
Date:
On Mar 13, 2013, at 5:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> What I think is tricky here is that there's more than one way to
> conceptualize what the JSON data type really is.  Is it a key-value
> store of sorts, or just a way to store text values that meet certain
> minimalist syntactic criteria?  I had imagined it as the latter, in
> which case normalization isn't sensible.  But if you think of it the
> first way, then normalization is not only sensible, but almost
> obligatory.

That makes a lot of sense. Given the restrictions I tend to prefer in my database data types, I had imagined it as the
former.And since I'm using it now to store key/value pairs (killing off some awful EAV implementations in the process,
BTW),I certainly think of it more formally as an object. 


But I can live with the other interpretation, as long as the differences are clearly understood and documented. Perhaps
anote could be added to the docs explaining this difference, and what one can do to adapt for it. A normalizing
functionwould certainly help. 

Best,

David




Re: Duplicate JSON Object Keys

From
Gavin Flower
Date:
<div class="moz-cite-prefix">On 14/03/13 02:02, Andrew Dunstan wrote:<br /></div><blockquote
cite="mid:514078E1.2090905@dunslane.net"type="cite"><br /> On 03/13/2013 08:17 AM, Robert Haas wrote: <br /><blockquote
type="cite">OnFri, Mar 8, 2013 at 4:42 PM, Andrew Dunstan <a class="moz-txt-link-rfc2396E"
href="mailto:andrew@dunslane.net"><andrew@dunslane.net></a>wrote: <br /><blockquote type="cite"><blockquote
type="cite">Somy order of preference for the options would be: <br /><br /> 1. Have the JSON type collapse objects so
thelast instance of a key wins <br /> and is actually stored <br /><br /> 2. Throw an error when a JSON type has
duplicatekeys <br /><br /> 3. Have the accessors find the last instance of a key and return that <br /> value <br /><br
/>4. Let things remain as they are now <br /><br /> On second though, I don't like 4 at all. It means that the JSON
type<br /> things a value is valid while the accessor does not. They contradict one <br /> another. <br /></blockquote>
Youcan forget 1. We are not going to have the parser collapse anything. <br /> Either the JSON it gets is valid or it's
not.But the parser isn't going to <br /> try to MAKE it valid. <br /></blockquote> Why not?  Because it's the wrong
thingto do, or because it would be slower? <br /><br /> What I think is tricky here is that there's more than one way
to<br /> conceptualize what the JSON data type really is.  Is it a key-value <br /> store of sorts, or just a way to
storetext values that meet certain <br /> minimalist syntactic criteria?  I had imagined it as the latter, in <br />
whichcase normalization isn't sensible.  But if you think of it the <br /> first way, then normalization is not only
sensible,but almost <br /> obligatory.  For example, we don't feel bad about this: <br /><br /> rhaas=# select
'1e1'::numeric;<br />   numeric <br /> --------- <br />        10 <br /> (1 row) <br /><br /> I think Andrew and I had
envisionedthis as basically a text data type <br /> that enforces some syntax checking on its input, hence the current
<br/> design.  But I'm not sure that's the ONLY sensible design. <br /><br /></blockquote><br /><br /> I think we've
movedon from this point, because a) other implementations allow duplicate keys, b) it's trivially easy to make Postgres
generatesuch json, and c) there is some dispute about exactly what the spec mandates. <br /><br /> I'll be posting a
revisedpatch shortly that doesn't error out but simply uses the value for the later key lexically. <br /><br /> cheers
<br/><br /> andrew <br /><br /><br /><br /><br /></blockquote><font size="-1">How about adding a new func<font
size="-1">tionwith '_stric<font size="-1">t' added to the existing name, with an extra parameter
</font></font></font>'coalesce'- or using other names, if considered more appropriate!<br /><br /> That way slower more
stringentfunctionality can be added where required.  This way, the existing function need not be changed.<br /><br />
Ifcoalesce = true, <br /> then: the last duplicate is used<br /> else: an error is returned when the new key is a
duplicate.<br/><br /><br /> Cheers,<br /> Gavin<br /><br /><br /><br /> 

Re: Duplicate JSON Object Keys

From
Andrew Dunstan
Date:
On 03/13/2013 12:51 PM, Gavin Flower wrote:
> On 14/03/13 02:02, Andrew Dunstan wrote:
>>
>> On 03/13/2013 08:17 AM, Robert Haas wrote:
>>> On Fri, Mar 8, 2013 at 4:42 PM, Andrew Dunstan <andrew@dunslane.net> 
>>> wrote:
>>>>> So my order of preference for the options would be:
>>>>>
>>>>> 1. Have the JSON type collapse objects so the last instance of a 
>>>>> key wins
>>>>> and is actually stored
>>>>>
>>>>> 2. Throw an error when a JSON type has duplicate keys
>>>>>
>>>>> 3. Have the accessors find the last instance of a key and return that
>>>>> value
>>>>>
>>>>> 4. Let things remain as they are now
>>>>>
>>>>> On second though, I don't like 4 at all. It means that the JSON type
>>>>> things a value is valid while the accessor does not. They 
>>>>> contradict one
>>>>> another.
>>>> You can forget 1. We are not going to have the parser collapse 
>>>> anything.
>>>> Either the JSON it gets is valid or it's not. But the parser isn't 
>>>> going to
>>>> try to MAKE it valid.
>>> Why not?  Because it's the wrong thing to do, or because it would be 
>>> slower?
>>>
>>> What I think is tricky here is that there's more than one way to
>>> conceptualize what the JSON data type really is.  Is it a key-value
>>> store of sorts, or just a way to store text values that meet certain
>>> minimalist syntactic criteria?  I had imagined it as the latter, in
>>> which case normalization isn't sensible.  But if you think of it the
>>> first way, then normalization is not only sensible, but almost
>>> obligatory.  For example, we don't feel bad about this:
>>>
>>> rhaas=# select '1e1'::numeric;
>>>   numeric
>>> ---------
>>>        10
>>> (1 row)
>>>
>>> I think Andrew and I had envisioned this as basically a text data type
>>> that enforces some syntax checking on its input, hence the current
>>> design.  But I'm not sure that's the ONLY sensible design.
>>>
>>
>>
>> I think we've moved on from this point, because a) other 
>> implementations allow duplicate keys, b) it's trivially easy to make 
>> Postgres generate such json, and c) there is some dispute about 
>> exactly what the spec mandates.
>>
>> I'll be posting a revised patch shortly that doesn't error out but 
>> simply uses the value for the later key lexically.
>>
>> cheers
>>
>> andrew
>>
>>
>>
>>
> How about adding a new function with '_strict' added to the existing 
> name, with an extra parameter 'coalesce' - or using other names, if 
> considered more appropriate!
>
> That way slower more stringent functionality can be added where 
> required.  This way, the existing function need not be changed.
>
> If coalesce = true,
> then: the last duplicate is used
> else: an error is returned when the new key is a duplicate.
>
>
>


For good or ill, we now already have a json type that will accept 
strings with duplicate keys, and generator functions which can now 
generate such strings. If someone wants functions to enforce a stricter 
validity check (e.g. via a check constraint on a domain), or to convert 
json to a canonical version which strips out prior keys of the same name 
and their associated values, then these should be relatively simple to 
implement given the parser API in the current patch. But they aren't 
part of the current patch, and I think it's way too late to be adding 
such things. I have been persuaded by arguments made upthread that the 
best thing to do is exactly what other well known json-accepting 
implementations do (e.g. V8), which is to accept json with duplicate 
keys and to treat the later key/value as overriding the former 
key/value. If I'd done that from the start nobody would now be talking 
about this at all.

cheers

andrew



Re: Duplicate JSON Object Keys

From
"David E. Wheeler"
Date:
On Mar 13, 2013, at 10:45 AM, Andrew Dunstan <andrew@dunslane.net> wrote:

> If someone wants functions to enforce a stricter validity check (e.g. via a check constraint on a domain), or to
convertjson to a canonical version which strips out prior keys of the same name and their associated values, then these
shouldbe relatively simple to implement given the parser API in the current patch. But they aren't part of the current
patch,and I think it's way too late to be adding such things. 

I think it might be good to get something like this into core eventually, otherwise I suspect that there will be a
differentversion of it for every JSON-using project out there. And my first cut at it won’t descend into sub-objects. 

> I have been persuaded by arguments made upthread that the best thing to do is exactly what other well known
json-acceptingimplementations do (e.g. V8), which is to accept json with duplicate keys and to treat the later
key/valueas overriding the former key/value. If I'd done that from the start nobody would now be talking about this at
all.

That’s true, though I might have started thinking about a canonicalizing function before long. :-)

Best,

David




Re: Duplicate JSON Object Keys

From
Andrew Dunstan
Date:
On 03/13/2013 01:50 PM, David E. Wheeler wrote:
> On Mar 13, 2013, at 10:45 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
>
>> If someone wants functions to enforce a stricter validity check (e.g. via a check constraint on a domain), or to
convertjson to a canonical version which strips out prior keys of the same name and their associated values, then these
shouldbe relatively simple to implement given the parser API in the current patch. But they aren't part of the current
patch,and I think it's way too late to be adding such things. 
> I think it might be good to get something like this into core eventually, otherwise I suspect that there will be a
differentversion of it for every JSON-using project out there. And my first cut at it won’t descend into sub-objects. 


The you wouldn't be doing it right. The whole thing about a recursive
descent parser is that it's, well, recursive.

cheers

andrew



Re: Duplicate JSON Object Keys

From
"David E. Wheeler"
Date:
On Mar 13, 2013, at 10:59 AM, Andrew Dunstan <andrew@dunslane.net> wrote:

>> And my first cut at it won’t descend into sub-objects.
>
>
> The you wouldn't be doing it right. The whole thing about a recursive descent parser is that it's, well, recursive.

Right, but it would serve my immediate needs. I have a column that just stores key/value pairs with no nesting. So I
knowI can write something like this and have it be good enough: 
   create or replace function json_smash(       json   ) RETURNS JSON language SQL STRICT IMMUTABLE AS $$       SELECT
format('{%s}',array_to_string(ARRAY(           SELECT format('%s: %s', to_json(key), value)             FROM (
      SELECT key, value, row_number() OVER (                            partition by key order by rnum desc
          ) AS rnum                   FROM (                       SELECT key, value, row_number() OVER (
                  partition by key                              ) AS rnum                         FROM json_each($1)
              ) a            ) b            WHERE rnum = 1       ), ','))::json;   $$; 

And do you really want to see that unloosed on the world? :-P (Yes, I know there is no guarantee on the order of rows
returnedby json_each()). 

Best,

David


Re: Duplicate JSON Object Keys

From
Hannu Krosing
Date:
On 03/13/2013 12:40 PM, David E. Wheeler wrote:
> On Mar 13, 2013, at 5:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> What I think is tricky here is that there's more than one way to
>> conceptualize what the JSON data type really is.  Is it a key-value
>> store of sorts, or just a way to store text values that meet certain
>> minimalist syntactic criteria?  I had imagined it as the latter, in
>> which case normalization isn't sensible.  But if you think of it the
>> first way, then normalization is not only sensible, but almost
>> obligatory.
> That makes a lot of sense. Given the restrictions I tend to prefer in my database data types, I had imagined it as
theformer. And since I'm using it now to store key/value pairs (killing off some awful EAV implementations in the
process,BTW), I certainly think of it more formally as an object.
 
>
>
> But I can live with the other interpretation, as long as the differences are clearly understood and documented.
Perhapsa note could be added to the docs explaining this difference, and what one can do to adapt for it. A normalizing
functionwould certainly help.
 
I guess the easiest and most generic way to normalize is to actually 
convert to some internal representation and back.

in pl/python this would look like this:

hannu=# create function normalize(IN ij json, OUT oj json) language 
plpythonu as $$
import json
return json.dumps(json.loads(ij))
$$;
CREATE FUNCTION
hannu=# select normalize('{"a":1, "a":"b", "a":true}');  normalize
------------- {"a": true}
(1 row)

If we could want to be really fancy we could start storing our json in 
some format which
is faster to parse, like tnetstrings, but probably it is too late in 
release cycle to change this now.

Hannu
>
> Best,
>
> David
>