Thread: jsonb, unicode escapes and escaped backslashes

jsonb, unicode escapes and escaped backslashes

From
Andrew Dunstan
Date:
The following case has just been brought to my attention (look at the 
differing number of backslashes):
   andrew=# select jsonb '"\\u0000"';      jsonb   ----------     "\u0000"   (1 row)
   andrew=# select jsonb '"\u0000"';      jsonb   ----------     "\u0000"   (1 row)
   andrew=# select json '"\u0000"';       json   ----------     "\u0000"   (1 row)
   andrew=# select json '"\\u0000"';       json   -----------     "\\u0000"   (1 row)

The problem is that jsonb uses the parsed, unescaped value of the 
string, while json does not. when the string parser sees the input with 
the 2 backslashes, it outputs a single backslash, and then it encounters 
the remaining chareacters and emits them as is, resulting in a token of 
'\u0000'. When it encounters the input with one backslash, it recognizes 
a unicode escape, and because it's for u+0000 emits '\u0000'. All other 
unicode escapes are resolved, so the only abiguity on input concerns 
this case.

Things get worse, though. On output, '\uabcd' for any four hex digits is 
recognized as a unicode escape, and thus the backslash is not escaped, 
so that we get:
   andrew=# select jsonb '"\\uabcd"';      jsonb   ----------     "\uabcd"   (1 row)


We could probably fix this fairly easily for non- U+0000 cases by having 
jsonb_to_cstring use a different escape_json routine.

But it's a mess, sadly, and I'm not sure what a good fix for the U+0000 
case would look like. Maybe we should detect such input and emit a 
warning of ambiguity? It's likely to be rare enough, but clearly not as 
rare as we'd like, since this is a report from the field.

cheers

andrew



Re: jsonb, unicode escapes and escaped backslashes

From
Noah Misch
Date:
On Wed, Jan 21, 2015 at 06:51:34PM -0500, Andrew Dunstan wrote:
> The following case has just been brought to my attention (look at the
> differing number of backslashes):
> 
>    andrew=# select jsonb '"\\u0000"';
>       jsonb
>    ----------
>      "\u0000"
>    (1 row)
> 
>    andrew=# select jsonb '"\u0000"';
>       jsonb
>    ----------
>      "\u0000"
>    (1 row)

A mess indeed.  The input is unambiguous, but the jsonb representation can't
distinguish "\u0000" from "\\u0000".  Some operations on the original json
type have similar problems, since they use an in-memory binary representation
with the same shortcoming:

[local] test=# select json_array_element_text($$["\u0000"]$$, 0) =
test-# json_array_element_text($$["\\u0000"]$$, 0);?column? 
----------t
(1 row)

> Things get worse, though. On output, '\uabcd' for any four hex digits is
> recognized as a unicode escape, and thus the backslash is not escaped, so
> that we get:
> 
>    andrew=# select jsonb '"\\uabcd"';
>       jsonb
>    ----------
>      "\uabcd"
>    (1 row)
> 
> 
> We could probably fix this fairly easily for non- U+0000 cases by having
> jsonb_to_cstring use a different escape_json routine.

Sounds reasonable.  For 9.4.1, before more people upgrade?

> But it's a mess, sadly, and I'm not sure what a good fix for the U+0000 case
> would look like.

Agreed.  When a string unescape algorithm removes some kinds of backslash
escapes and not others, it's nigh inevitable that two semantically-distinct
inputs can yield the same output.  json_lex_string() fell into that trap by
making an exception for \u0000.  To fix this, the result needs to be fully
unescaped (\u0000 converted to the NUL byte) or retain all backslash escapes.
(Changing that either way is no fun now that an on-disk format is at stake.)

> Maybe we should detect such input and emit a warning of
> ambiguity? It's likely to be rare enough, but clearly not as rare as we'd
> like, since this is a report from the field.

Perhaps.  Something like "WARNING:  jsonb cannot represent \\u0000; reading as
\u0000"?  Alas, but I do prefer that to silent data corruption.

Thanks,
nm



Re: jsonb, unicode escapes and escaped backslashes

From
Andrew Dunstan
Date:
On 01/23/2015 02:18 AM, Noah Misch wrote:
> On Wed, Jan 21, 2015 at 06:51:34PM -0500, Andrew Dunstan wrote:
>> The following case has just been brought to my attention (look at the
>> differing number of backslashes):
>>
>>     andrew=# select jsonb '"\\u0000"';
>>        jsonb
>>     ----------
>>       "\u0000"
>>     (1 row)
>>
>>     andrew=# select jsonb '"\u0000"';
>>        jsonb
>>     ----------
>>       "\u0000"
>>     (1 row)
> A mess indeed.  The input is unambiguous, but the jsonb representation can't
> distinguish "\u0000" from "\\u0000".  Some operations on the original json
> type have similar problems, since they use an in-memory binary representation
> with the same shortcoming:
>
> [local] test=# select json_array_element_text($$["\u0000"]$$, 0) =
> test-# json_array_element_text($$["\\u0000"]$$, 0);
>   ?column?
> ----------
>   t
> (1 row)
>
>> Things get worse, though. On output, '\uabcd' for any four hex digits is
>> recognized as a unicode escape, and thus the backslash is not escaped, so
>> that we get:
>>
>>     andrew=# select jsonb '"\\uabcd"';
>>        jsonb
>>     ----------
>>       "\uabcd"
>>     (1 row)
>>
>>
>> We could probably fix this fairly easily for non- U+0000 cases by having
>> jsonb_to_cstring use a different escape_json routine.
> Sounds reasonable.  For 9.4.1, before more people upgrade?
>
>> But it's a mess, sadly, and I'm not sure what a good fix for the U+0000 case
>> would look like.
> Agreed.  When a string unescape algorithm removes some kinds of backslash
> escapes and not others, it's nigh inevitable that two semantically-distinct
> inputs can yield the same output.  json_lex_string() fell into that trap by
> making an exception for \u0000.  To fix this, the result needs to be fully
> unescaped (\u0000 converted to the NUL byte) or retain all backslash escapes.
> (Changing that either way is no fun now that an on-disk format is at stake.)
>
>> Maybe we should detect such input and emit a warning of
>> ambiguity? It's likely to be rare enough, but clearly not as rare as we'd
>> like, since this is a report from the field.
> Perhaps.  Something like "WARNING:  jsonb cannot represent \\u0000; reading as
> \u0000"?  Alas, but I do prefer that to silent data corruption.
>



Maybe something like this patch. I have two concerns about it, though.
The first is the possible performance impact of looking for the
offending string in every jsonb input, and the second is that the test
isn't quite right, since input of \\\u0000 doesn't raise this issue -
i.e. the problem arises when u0000 is preceded by an even number of
backslashes.

For the moment, maybe I could commit the fix for the non U+0000 case for
escape_json, and we could think some more about detecting and warning
about the problem strings.

cheers

andrew

Attachment

Re: jsonb, unicode escapes and escaped backslashes

From
Noah Misch
Date:
On Mon, Jan 26, 2015 at 09:20:54AM -0500, Andrew Dunstan wrote:
> On 01/23/2015 02:18 AM, Noah Misch wrote:
> >On Wed, Jan 21, 2015 at 06:51:34PM -0500, Andrew Dunstan wrote:
> >>We could probably fix this fairly easily for non- U+0000 cases by having
> >>jsonb_to_cstring use a different escape_json routine.

> >>Maybe we should detect such input and emit a warning of
> >>ambiguity? It's likely to be rare enough, but clearly not as rare as we'd
> >>like, since this is a report from the field.

> >Perhaps.  Something like "WARNING:  jsonb cannot represent \\u0000; reading as
> >\u0000"?  Alas, but I do prefer that to silent data corruption.
> 
> Maybe something like this patch. I have two concerns about it, though. The
> first is the possible performance impact of looking for the offending string
> in every jsonb input, and the second is that the test isn't quite right,
> since input of \\\u0000 doesn't raise this issue - i.e. the problem arises
> when u0000 is preceded by an even number of backslashes.

I share your second concern.  It is important that this warning not cry wolf;
it should never fire for an input that is actually unaffected.

> For the moment, maybe I could commit the fix for the non U+0000 case for
> escape_json, and we could think some more about detecting and warning about
> the problem strings.

+1 for splitting development that way.  Fixing the use of escape_json() is
objective, but the choices around the warning are more subtle.



Re: jsonb, unicode escapes and escaped backslashes

From
Andrew Dunstan
Date:
On 01/27/2015 12:24 AM, Noah Misch wrote:
>> For the moment, maybe I could commit the fix for the non U+0000 case for
>> escape_json, and we could think some more about detecting and warning about
>> the problem strings.
> +1 for splitting development that way.  Fixing the use of escape_json() is
> objective, but the choices around the warning are more subtle.
>


OK, so something like this patch? I'm mildly concerned that you and I
are the only ones commenting on this.

cheers

andrew



Attachment

Re: jsonb, unicode escapes and escaped backslashes

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 01/27/2015 12:24 AM, Noah Misch wrote:
>> +1 for splitting development that way.  Fixing the use of escape_json() is
>> objective, but the choices around the warning are more subtle.

> OK, so something like this patch? I'm mildly concerned that you and I 
> are the only ones commenting on this.

Doesn't seem to me like this fixes anything.  If the content of a jsonb
value is correct, the output will be the same with or without this patch;
and if it's not, this isn't going to really improve matters.

I think coding anything is premature until we decide how we're going to
deal with the fundamental ambiguity.
        regards, tom lane



Re: jsonb, unicode escapes and escaped backslashes

From
Andrew Dunstan
Date:
On 01/27/2015 12:23 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 01/27/2015 12:24 AM, Noah Misch wrote:
>>> +1 for splitting development that way.  Fixing the use of escape_json() is
>>> objective, but the choices around the warning are more subtle.
>> OK, so something like this patch? I'm mildly concerned that you and I
>> are the only ones commenting on this.
> Doesn't seem to me like this fixes anything.  If the content of a jsonb
> value is correct, the output will be the same with or without this patch;
> and if it's not, this isn't going to really improve matters.
>
> I think coding anything is premature until we decide how we're going to
> deal with the fundamental ambiguity.
>
>             


The input \\uabcd will be stored correctly as \uabcd, but this will in 
turn be rendered as \uabcd, whereas it should be rendered as \\uabcd. 
That's what the patch fixes.

There are two problems here and this addresses one of them. The other 
problem is the ambiguity regarding \\u0000 and \u0000.

cheers

andrew




Re: jsonb, unicode escapes and escaped backslashes

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 01/27/2015 12:23 PM, Tom Lane wrote:
>> I think coding anything is premature until we decide how we're going to
>> deal with the fundamental ambiguity.

> The input \\uabcd will be stored correctly as \uabcd, but this will in 
> turn be rendered as \uabcd, whereas it should be rendered as \\uabcd. 
> That's what the patch fixes.
> There are two problems here and this addresses one of them. The other 
> problem is the ambiguity regarding \\u0000 and \u0000.

It's the same problem really, and until we have an answer about
what to do with \u0000, I think any patch is half-baked and possibly
counterproductive.

In particular, I would like to suggest that the current representation of
\u0000 is fundamentally broken and that we have to change it, not try to
band-aid around it.  This will mean an on-disk incompatibility for jsonb
data containing U+0000, but hopefully there is very little of that out
there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
consider such solutions.

The most obvious way to store such data unambiguously is to just go ahead
and store U+0000 as a NUL byte (\000).  The only problem with that is that
then such a string cannot be considered to be a valid value of type TEXT,
which would mean that we'd need to throw an error if we were asked to
convert a JSON field containing such a character to text.  I don't
particularly have a problem with that, except possibly for the time cost
of checking for \000 before allowing a conversion to occur.  While a
memchr() check might be cheap enough, we could also consider inventing a
new JEntry type code for string-containing-null, so that there's a
distinction in the type system between strings that are coercible to text
and those that are not.

If we went down a path like that, the currently proposed patch would be
quite useless.
        regards, tom lane



Re: jsonb, unicode escapes and escaped backslashes

From
Andrew Dunstan
Date:
On 01/27/2015 01:40 PM, Tom Lane wrote:

> In particular, I would like to suggest that the current representation of
> \u0000 is fundamentally broken and that we have to change it, not try to
> band-aid around it.  This will mean an on-disk incompatibility for jsonb
> data containing U+0000, but hopefully there is very little of that out
> there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
> consider such solutions.
>
>

Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on 
the table what I suggested is unnecessary.

cheers

andrew



Re: jsonb, unicode escapes and escaped backslashes

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 01/27/2015 01:40 PM, Tom Lane wrote:
>> In particular, I would like to suggest that the current representation of
>> \u0000 is fundamentally broken and that we have to change it, not try to
>> band-aid around it.  This will mean an on-disk incompatibility for jsonb
>> data containing U+0000, but hopefully there is very little of that out
>> there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
>> consider such solutions.

> Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on 
> the table what I suggested is unnecessary.

Well, we can either fix it now or suffer with a broken representation
forever.  I'm not wedded to the exact solution I described, but I think
we'll regret it if we don't change the representation.

The only other plausible answer seems to be to flat out reject \u0000.
But I assume nobody likes that.
        regards, tom lane



Re: jsonb, unicode escapes and escaped backslashes

From
Andrew Dunstan
Date:
On 01/27/2015 02:28 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 01/27/2015 01:40 PM, Tom Lane wrote:
>>> In particular, I would like to suggest that the current representation of
>>> \u0000 is fundamentally broken and that we have to change it, not try to
>>> band-aid around it.  This will mean an on-disk incompatibility for jsonb
>>> data containing U+0000, but hopefully there is very little of that out
>>> there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
>>> consider such solutions.
>> Hmm, OK. I had thought we'd be ruling that out, but I agree if it's on
>> the table what I suggested is unnecessary.
> Well, we can either fix it now or suffer with a broken representation
> forever.  I'm not wedded to the exact solution I described, but I think
> we'll regret it if we don't change the representation.
>
> The only other plausible answer seems to be to flat out reject \u0000.
> But I assume nobody likes that.
>
>             

I don't think we can be in the business of rejecting valid JSON.

cheers

andrew



Re: jsonb, unicode escapes and escaped backslashes

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 01/27/2015 02:28 PM, Tom Lane wrote:
>> Well, we can either fix it now or suffer with a broken representation
>> forever.  I'm not wedded to the exact solution I described, but I think
>> we'll regret it if we don't change the representation.
>> 
>> The only other plausible answer seems to be to flat out reject \u0000.
>> But I assume nobody likes that.

> I don't think we can be in the business of rejecting valid JSON.

Actually, after studying the code a bit, I wonder if we wouldn't be best
off to do exactly that, at least for 9.4.x.  At minimum we're talking
about an API change for JsonSemAction functions (which currently get the
already-de-escaped string as a C string; not gonna work for embedded
nulls).  I'm not sure if there are already third-party extensions using
that API, but it seems possible, in which case changing it in a minor
release wouldn't be nice.  Even ignoring that risk, making sure
we'd fixed everything seems like more than a day's work, which is as
much as I for one could spare before 9.4.1.

Also, while the idea of throwing error only when a \0 needs to be
converted to text seems logically clean, it looks like that might pretty
much cripple the usability of such values anyhow, because we convert to
text at the drop of a hat.  So some investigation and probably additional
work would be needed to ensure you could do at least basic things with
such values.  (A function for direct conversion to bytea might be useful
too.)

I think the "it would mean rejecting valid JSON" argument is utter
hogwash.  We already reject, eg, "\u00A0" if you're not using a UTF8
encoding.  And we reject "1e10000", not because that's invalid JSON
but because of an implementation restriction of our underlying numeric
type.  I don't see any moral superiority of that over rejecting "\u0000"
because of an implementation restriction of our underlying text type.

So at this point I propose that we reject \u0000 when de-escaping JSON.
Anybody who's seriously unhappy with that can propose a patch to fix it
properly in 9.5 or later.

We probably need to rethink the re-escaping behavior as well; I'm not
sure if your latest patch is the right answer for that.
        regards, tom lane



Re: jsonb, unicode escapes and escaped backslashes

From
Merlin Moncure
Date:
On Tue, Jan 27, 2015 at 12:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 01/27/2015 12:23 PM, Tom Lane wrote:
>>> I think coding anything is premature until we decide how we're going to
>>> deal with the fundamental ambiguity.
>
>> The input \\uabcd will be stored correctly as \uabcd, but this will in
>> turn be rendered as \uabcd, whereas it should be rendered as \\uabcd.
>> That's what the patch fixes.
>> There are two problems here and this addresses one of them. The other
>> problem is the ambiguity regarding \\u0000 and \u0000.
>
> It's the same problem really, and until we have an answer about
> what to do with \u0000, I think any patch is half-baked and possibly
> counterproductive.
>
> In particular, I would like to suggest that the current representation of
> \u0000 is fundamentally broken and that we have to change it, not try to
> band-aid around it.  This will mean an on-disk incompatibility for jsonb
> data containing U+0000, but hopefully there is very little of that out
> there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
> consider such solutions.
>
> The most obvious way to store such data unambiguously is to just go ahead
> and store U+0000 as a NUL byte (\000).  The only problem with that is that
> then such a string cannot be considered to be a valid value of type TEXT,
> which would mean that we'd need to throw an error if we were asked to
> convert a JSON field containing such a character to text.

Hm, does this include text out operations for display purposes?   If
so, then any query selecting jsonb objects with null bytes would fail.
How come we have to error out?  How about a warning indicating the
string was truncated?

merlin



Re: jsonb, unicode escapes and escaped backslashes

From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes:
> On Tue, Jan 27, 2015 at 12:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In particular, I would like to suggest that the current representation of
>> \u0000 is fundamentally broken and that we have to change it, not try to
>> band-aid around it.  This will mean an on-disk incompatibility for jsonb
>> data containing U+0000, but hopefully there is very little of that out
>> there yet.  If we can get a fix into 9.4.1, I think it's reasonable to
>> consider such solutions.
>> 
>> The most obvious way to store such data unambiguously is to just go ahead
>> and store U+0000 as a NUL byte (\000).  The only problem with that is that
>> then such a string cannot be considered to be a valid value of type TEXT,
>> which would mean that we'd need to throw an error if we were asked to
>> convert a JSON field containing such a character to text.

> Hm, does this include text out operations for display purposes?   If
> so, then any query selecting jsonb objects with null bytes would fail.
> How come we have to error out?  How about a warning indicating the
> string was truncated?

No, that's not a problem, because jsonb_out would produce an escaped
textual representation, so any null would come out as \u0000 again.
The trouble comes up when you do something that's supposed to produce
a *non escaped* text equivalent of a JSON string value, such as
the ->> operator.

Arguably, ->> is broken already with the current coding, in that
these results are entirely inconsistent:

regression=# select '{"a":"foo\u0040bar"}'::jsonb ->> 'a';?column? 
----------foo@bar
(1 row)

regression=# select '{"a":"foo\u0000bar"}'::jsonb ->> 'a';  ?column?   
--------------foo\u0000bar
(1 row)

regression=# select '{"a":"foo\\u0000bar"}'::jsonb ->> 'a';  ?column?   
--------------foo\u0000bar
(1 row)
        regards, tom lane



Re: jsonb, unicode escapes and escaped backslashes

From
Noah Misch
Date:
On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > On 01/27/2015 02:28 PM, Tom Lane wrote:
> >> Well, we can either fix it now or suffer with a broken representation
> >> forever.  I'm not wedded to the exact solution I described, but I think
> >> we'll regret it if we don't change the representation.

> So at this point I propose that we reject \u0000 when de-escaping JSON.

I would have agreed on 2014-12-09, and this release is the last chance to make
such a change.  It is a bold wager that could pay off, but -1 from me anyway.
I can already envision the blog post from the DBA staying on 9.4.0 because
9.4.1 pulled his ability to store U+0000 in jsonb.  jsonb was *the* top-billed
9.4 feature, and this thread started with Andrew conveying a field report of a
scenario more obscure than storing U+0000.  Therefore, we have to assume many
users will notice the change.  This move would also add to the growing
evidence that our .0 releases are really beta(N+1) releases in disguise.

> Anybody who's seriously unhappy with that can propose a patch to fix it
> properly in 9.5 or later.

Someone can still do that by introducing a V2 of the jsonb binary format and
preserving the ability to read both formats.  (Too bad Andres's proposal to
include a format version didn't inform the final format, but we can wing it.)
I agree that storing U+0000 as 0x00 is the best end state.

> We probably need to rethink the re-escaping behavior as well; I'm not
> sure if your latest patch is the right answer for that.

Yes, we do.  No change to the representation of U+0000 is going to fix the
following bug, but that patch does fix it:

[local] test=# select
test-# $$"\\u05e2"$$::jsonb = $$"\\u05e2"$$::jsonb,
test-# $$"\\u05e2"$$::jsonb = $$"\\u05e2"$$::jsonb::text::jsonb;?column? | ?column? 
----------+----------t        | f
(1 row)



Re: jsonb, unicode escapes and escaped backslashes

From
Andrew Dunstan
Date:
On 01/28/2015 12:50 AM, Noah Misch wrote:
> On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote:
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>> On 01/27/2015 02:28 PM, Tom Lane wrote:
>>>> Well, we can either fix it now or suffer with a broken representation
>>>> forever.  I'm not wedded to the exact solution I described, but I think
>>>> we'll regret it if we don't change the representation.
>> So at this point I propose that we reject \u0000 when de-escaping JSON.
> I would have agreed on 2014-12-09, and this release is the last chance to make
> such a change.  It is a bold wager that could pay off, but -1 from me anyway.
> I can already envision the blog post from the DBA staying on 9.4.0 because
> 9.4.1 pulled his ability to store U+0000 in jsonb.  jsonb was *the* top-billed
> 9.4 feature, and this thread started with Andrew conveying a field report of a
> scenario more obscure than storing U+0000.  Therefore, we have to assume many
> users will notice the change.  This move would also add to the growing
> evidence that our .0 releases are really beta(N+1) releases in disguise.
>
>> Anybody who's seriously unhappy with that can propose a patch to fix it
>> properly in 9.5 or later.
> Someone can still do that by introducing a V2 of the jsonb binary format and
> preserving the ability to read both formats.  (Too bad Andres's proposal to
> include a format version didn't inform the final format, but we can wing it.)
> I agree that storing U+0000 as 0x00 is the best end state.
>
>


We need to make up our minds about this pretty quickly. The more radical 
move is likely to involve quite a bit of work, ISTM.

It's not clear to me how we should represent a unicode null. i.e. given 
a json of '["foo\u0000bar"]', I get that we'd store the element as 
'foo\x00bar', but what is the result of
   (jsonb '["foo\u0000bar"')->>0

It's defined to be text so we can't just shove a binary null in the 
middle of it. Do we throw an error?

And I still want to hear more voices on the whole direction we want to 
take this.

cheers

andrew



Re: jsonb, unicode escapes and escaped backslashes

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote:
>> So at this point I propose that we reject \u0000 when de-escaping JSON.

> I would have agreed on 2014-12-09, and this release is the last chance to make
> such a change.  It is a bold wager that could pay off, but -1 from me anyway.

You only get to vote -1 if you have a credible alternative.  I don't see
one.

> I can already envision the blog post from the DBA staying on 9.4.0 because
> 9.4.1 pulled his ability to store U+0000 in jsonb.

Those will be more or less the same people who bitch about text not
storing NULs; the world has not fallen.

> jsonb was *the* top-billed
> 9.4 feature, and this thread started with Andrew conveying a field report of a
> scenario more obscure than storing U+0000.

There is a separate issue, which is that our earlier attempt to make the
world safe for \u0000 actually broke other things.  We still need to fix
that, but I think the fix probably consists of reverting that patch and
instead disallowing \u0000.

>> Anybody who's seriously unhappy with that can propose a patch to fix it
>> properly in 9.5 or later.

> Someone can still do that by introducing a V2 of the jsonb binary format and
> preserving the ability to read both formats.  (Too bad Andres's proposal to
> include a format version didn't inform the final format, but we can wing it.)
> I agree that storing U+0000 as 0x00 is the best end state.

We will not need a v2 format, merely values that contain NULs.  Existing
data containing \u0000 will be read as those six ASCII characters, which
is not really wrong since that might well be what it was anyway.

>> We probably need to rethink the re-escaping behavior as well; I'm not
>> sure if your latest patch is the right answer for that.

> Yes, we do.  No change to the representation of U+0000 is going to fix the
> following bug, but that patch does fix it:

> [local] test=# select
> test-# $$"\\u05e2"$$::jsonb = $$"\\u05e2"$$::jsonb,
> test-# $$"\\u05e2"$$::jsonb = $$"\\u05e2"$$::jsonb::text::jsonb;

The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d,
which I'm inclined to think we need to simply revert, not render even
more squirrely.
        regards, tom lane



Re: jsonb, unicode escapes and escaped backslashes

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> It's not clear to me how we should represent a unicode null. i.e. given 
> a json of '["foo\u0000bar"]', I get that we'd store the element as 
> 'foo\x00bar', but what is the result of

>     (jsonb '["foo\u0000bar"')->>0

> It's defined to be text so we can't just shove a binary null in the 
> middle of it. Do we throw an error?

Yes, that is what I was proposing upthread.  Obviously, this needs some
thought to ensure that there's *something* useful you can do with a field
containing a nul, but we'd have little choice but to throw an error if
the user asks us to convert such a field to unescaped text.

I'd be a bit inclined to reject nuls in object field names even if we
allow them in field values, since just about everything you can usefully
do with a field name involves regarding it as text.

Another interesting implementation problem is what does indexing do with
such values --- ISTR there's an implicit conversion to C strings in there
too, at least in GIN indexes.

Anyway, there is a significant amount of work involved here, and there's
no way we're getting it done for 9.4.1, or probably 9.4.anything.  I think
our only realistic choice right now is to throw error for \u0000 so that
we can preserve our options for doing something useful with it later.
        regards, tom lane



Re: jsonb, unicode escapes and escaped backslashes

From
David G Johnston
Date:
Tom Lane-2 wrote
> Andrew Dunstan <

> andrew@

> > writes:
>> On 01/27/2015 02:28 PM, Tom Lane wrote:
>>> Well, we can either fix it now or suffer with a broken representation
>>> forever.  I'm not wedded to the exact solution I described, but I think
>>> we'll regret it if we don't change the representation.
>>> 
>>> The only other plausible answer seems to be to flat out reject \u0000.
>>> But I assume nobody likes that.
> 
>> I don't think we can be in the business of rejecting valid JSON.
> 
> Actually, after studying the code a bit, I wonder if we wouldn't be best
> off to do exactly that, at least for 9.4.x.  At minimum we're talking
> about an API change for JsonSemAction functions (which currently get the
> already-de-escaped string as a C string; not gonna work for embedded
> nulls).  I'm not sure if there are already third-party extensions using
> that API, but it seems possible, in which case changing it in a minor
> release wouldn't be nice.  Even ignoring that risk, making sure
> we'd fixed everything seems like more than a day's work, which is as
> much as I for one could spare before 9.4.1.
> 
> So at this point I propose that we reject \u0000 when de-escaping JSON.
> Anybody who's seriously unhappy with that can propose a patch to fix it
> properly in 9.5 or later.

The hybrid option is to reject the values for 9.4.1 but then commit to
removing that hack and fixing this properly in 9.4.2; we can always call
that release 9.5...


> I think the "it would mean rejecting valid JSON" argument is utter
> hogwash.  We already reject, eg, "\u00A0" if you're not using a UTF8
> encoding.  And we reject "1e10000", not because that's invalid JSON
> but because of an implementation restriction of our underlying numeric
> type.  I don't see any moral superiority of that over rejecting "\u0000"
> because of an implementation restriction of our underlying text type.

Am I missing something or has there been no consideration in this "forbid"
plan on whether users will be able to retrieve, even if partially
incorrectly, any jsonb data that has already been stored?  If we mangled
their data on input we should at least return the data and provide them a
chance to manually (or automatically depending on their data) fix our
mistake.

Given we already disallow NUL in text ISTM that allowing said data in other
text-like areas is asking for just the kind of trouble we are seeing here. 
I'm OK with the proposition that those wishing to utilize NUL are relegated
to working with bytea.

From the commit Tom references down-thread:

"However, this led to some perverse results in the case of Unicode
sequences."

Given that said results are not documented in the commit its hard to judge
whether a complete revert is being overly broad...

Anyway, just some observations since I'm not currently a user of JSON. 
Tom's arguments and counter-arguments ring true to me in the general sense. 
The DBA staying on 9.4.0 because of this change probably just needs to be
told to go back to using "json" and then run the update.  Their data has
issues even they stay on 9.4.0 with the more accepting version of jsonb.

David J.




--
View this message in context:
http://postgresql.nabble.com/jsonb-unicode-escapes-and-escaped-backslashes-tp5834962p5835824.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.



Re: jsonb, unicode escapes and escaped backslashes

From
Tom Lane
Date:
David G Johnston <david.g.johnston@gmail.com> writes:
> Am I missing something or has there been no consideration in this "forbid"
> plan on whether users will be able to retrieve, even if partially
> incorrectly, any jsonb data that has already been stored?

Data that's already been stored would look like the six characters \u0000,
which indeed might have been what it was anyway, since part of the
problem here is that we fail to distinguish "\\u0000" from "\u0000".
        regards, tom lane



Re: jsonb, unicode escapes and escaped backslashes

From
Jim Nasby
Date:
On 1/28/15 11:36 AM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> It's not clear to me how we should represent a unicode null. i.e. given
>> a json of '["foo\u0000bar"]', I get that we'd store the element as
>> 'foo\x00bar', but what is the result of
>
>>      (jsonb '["foo\u0000bar"')->>0
>
>> It's defined to be text so we can't just shove a binary null in the
>> middle of it. Do we throw an error?
>
> Yes, that is what I was proposing upthread.  Obviously, this needs some
> thought to ensure that there's *something* useful you can do with a field
> containing a nul, but we'd have little choice but to throw an error if
> the user asks us to convert such a field to unescaped text.
>
> I'd be a bit inclined to reject nuls in object field names even if we
> allow them in field values, since just about everything you can usefully
> do with a field name involves regarding it as text.
>
> Another interesting implementation problem is what does indexing do with
> such values --- ISTR there's an implicit conversion to C strings in there
> too, at least in GIN indexes.
>
> Anyway, there is a significant amount of work involved here, and there's
> no way we're getting it done for 9.4.1, or probably 9.4.anything.  I think
> our only realistic choice right now is to throw error for \u0000 so that
> we can preserve our options for doing something useful with it later.

My $0.01:

While I sympathize with Noah's sentiments, the only thing that makes sense to me is that a JSON text field is treated
thesame way as we treat text. Right now, that means NUL is not allowed, period.
 

If no one has bitched about this with text, is it really that big a problem with JSON?
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: jsonb, unicode escapes and escaped backslashes

From
Tom Lane
Date:
Jim Nasby <Jim.Nasby@bluetreble.com> writes:
> While I sympathize with Noah's sentiments, the only thing that makes sense to me is that a JSON text field is treated
thesame way as we treat text. Right now, that means NUL is not allowed, period.
 

> If no one has bitched about this with text, is it really that big a problem with JSON?

Oh, people have bitched about it all right ;-).  But I think your real
question is how many applications find that restriction to be fatal.
The answer clearly is "not a lot" for text, and I don't see why the
answer would be different for JSON.
        regards, tom lane



Re: jsonb, unicode escapes and escaped backslashes

From
Jeff Janes
Date:
On Wed, Jan 28, 2015 at 1:13 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

My $0.01:

While I sympathize with Noah's sentiments, the only thing that makes sense to me is that a JSON text field is treated the same way as we treat text. Right now, that means NUL is not allowed, period.

If no one has bitched about this with text, is it really that big a problem with JSON?

I would bitch about it for text if I thought that it would do any good.
 

Re: jsonb, unicode escapes and escaped backslashes

From
Bruce Momjian
Date:
On Wed, Jan 28, 2015 at 03:13:47PM -0600, Jim Nasby wrote:
> While I sympathize with Noah's sentiments, the only thing that makes
> sense to me is that a JSON text field is treated the same way as we
> treat text. Right now, that means NUL is not allowed, period.

+1

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: jsonb, unicode escapes and escaped backslashes

From
Noah Misch
Date:
On Wed, Jan 28, 2015 at 12:48:45PM -0500, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Tue, Jan 27, 2015 at 03:56:22PM -0500, Tom Lane wrote:
> >> So at this point I propose that we reject \u0000 when de-escaping JSON.
> 
> > I would have agreed on 2014-12-09, and this release is the last chance to make
> > such a change.  It is a bold wager that could pay off, but -1 from me anyway.
> 
> You only get to vote -1 if you have a credible alternative.  I don't see
> one.

I don't love json enough to keep participating in a thread where you dismiss
patches and comments from Andrew and myself as "quite useless", "utter
hogwash" and non-"credible".

nm



Re: jsonb, unicode escapes and escaped backslashes

From
Robert Haas
Date:
On Wed, Jan 28, 2015 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d,
> which I'm inclined to think we need to simply revert, not render even
> more squirrely.

Yes, that commit looks broken.  If you convert from text to JSON you
should get a JSON string equivalent to the text you started out with.
That commit departs from that in favor of allowing \uXXXX sequences in
the text being converted to turn into a single character (or perhaps
an encoding error) after a text->json->text roundtrip.  Maybe I
haven't had enough caffeine today, but I can't see how that can
possibly be a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: jsonb, unicode escapes and escaped backslashes

From
Andrew Dunstan
Date:
On 01/29/2015 12:10 PM, Robert Haas wrote:
> On Wed, Jan 28, 2015 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d,
>> which I'm inclined to think we need to simply revert, not render even
>> more squirrely.
> Yes, that commit looks broken.  If you convert from text to JSON you
> should get a JSON string equivalent to the text you started out with.
> That commit departs from that in favor of allowing \uXXXX sequences in
> the text being converted to turn into a single character (or perhaps
> an encoding error) after a text->json->text roundtrip.  Maybe I
> haven't had enough caffeine today, but I can't see how that can
> possibly be a good idea.
>


Possibly.

I'm coming down more and more on the side of Tom's suggestion just to 
ban \u0000 in jsonb. I think that would let us have some fairly simple 
and consistent rules. I'm not too worried that we'll be disallowing 
input that we've previously allowed. We've done that often in the past, 
although less often in point releases. I certainly don't want to wait a 
full release cycle to fix this if possible.

cheers

andrew





Re: jsonb, unicode escapes and escaped backslashes

From
Andres Freund
Date:
On 2015-01-29 16:33:36 -0500, Andrew Dunstan wrote:
> 
> On 01/29/2015 12:10 PM, Robert Haas wrote:
> >On Wed, Jan 28, 2015 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d,
> >>which I'm inclined to think we need to simply revert, not render even
> >>more squirrely.
> >Yes, that commit looks broken.  If you convert from text to JSON you
> >should get a JSON string equivalent to the text you started out with.
> >That commit departs from that in favor of allowing \uXXXX sequences in
> >the text being converted to turn into a single character (or perhaps
> >an encoding error) after a text->json->text roundtrip.  Maybe I
> >haven't had enough caffeine today, but I can't see how that can
> >possibly be a good idea.

> I'm coming down more and more on the side of Tom's suggestion just to ban
> \u0000 in jsonb. I think that would let us have some fairly simple and
> consistent rules. I'm not too worried that we'll be disallowing input that
> we've previously allowed. We've done that often in the past, although less
> often in point releases. I certainly don't want to wait a full release cycle
> to fix this if possible.

I think waiting a full cycle would be likely to make things much worse,
given 9.4's current age.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: jsonb, unicode escapes and escaped backslashes

From
Robert Haas
Date:
On Thu, Jan 29, 2015 at 4:33 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
> On 01/29/2015 12:10 PM, Robert Haas wrote:
>> On Wed, Jan 28, 2015 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d,
>>> which I'm inclined to think we need to simply revert, not render even
>>> more squirrely.
>> Yes, that commit looks broken.  If you convert from text to JSON you
>> should get a JSON string equivalent to the text you started out with.
>> That commit departs from that in favor of allowing \uXXXX sequences in
>> the text being converted to turn into a single character (or perhaps
>> an encoding error) after a text->json->text roundtrip.  Maybe I
>> haven't had enough caffeine today, but I can't see how that can
>> possibly be a good idea.
>
> Possibly.
>
> I'm coming down more and more on the side of Tom's suggestion just to ban
> \u0000 in jsonb. I think that would let us have some fairly simple and
> consistent rules. I'm not too worried that we'll be disallowing input that
> we've previously allowed. We've done that often in the past, although less
> often in point releases. I certainly don't want to wait a full release cycle
> to fix this if possible.

I have yet to understand what we fix by banning \u0000.  How is 0000
different from any other four-digit hexadecimal number that's not a
valid character in the current encoding?  What does banning that one
particular value do?

In any case, whatever we do about that issue, the idea that the text
-> json string transformation can *change the input string into some
other string* seems like an independent problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: jsonb, unicode escapes and escaped backslashes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jan 29, 2015 at 4:33 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>> I'm coming down more and more on the side of Tom's suggestion just to ban
>> \u0000 in jsonb.

> I have yet to understand what we fix by banning \u0000.  How is 0000
> different from any other four-digit hexadecimal number that's not a
> valid character in the current encoding?  What does banning that one
> particular value do?

As Andrew pointed out upthread, it avoids having to answer the question of
what to return for

select (jsonb '["foo\u0000bar"]')->>0;

or any other construct which is supposed to return an *unescaped* text
representation of some JSON string value.

Right now you get
  ?column?   
--------------foo\u0000bar
(1 row)

Which is wrong IMO, first because it violates the premise that the output
should be unescaped, and second because this output cannot be
distinguished from the (correct) output of

regression=# select (jsonb '["foo\\u0000bar"]')->>0;  ?column?   
--------------foo\u0000bar
(1 row)

There is no way to deliver an output that is not confusable with some
other value's correct output, other than by emitting a genuine \0 byte
which unfortunately we cannot support in a TEXT result.

Potential solutions for this have been mooted upthread, but none
of them look like they're something we can do in the very short run.
So the proposal is to ban \u0000 until such time as we can do something
sane with it.

> In any case, whatever we do about that issue, the idea that the text
> -> json string transformation can *change the input string into some
> other string* seems like an independent problem.

No, it's exactly the same problem, because the reason for that breakage
is an ill-advised attempt to make it safe to include \u0000 in JSONB.
        regards, tom lane



Re: jsonb, unicode escapes and escaped backslashes

From
Andrew Dunstan
Date:
On 01/29/2015 04:59 PM, Robert Haas wrote:
> On Thu, Jan 29, 2015 at 4:33 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>> On 01/29/2015 12:10 PM, Robert Haas wrote:
>>> On Wed, Jan 28, 2015 at 12:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> The cause of this bug is commit 0ad1a816320a2b539a51628e2a0b1e83ff096b1d,
>>>> which I'm inclined to think we need to simply revert, not render even
>>>> more squirrely.
>>> Yes, that commit looks broken.  If you convert from text to JSON you
>>> should get a JSON string equivalent to the text you started out with.
>>> That commit departs from that in favor of allowing \uXXXX sequences in
>>> the text being converted to turn into a single character (or perhaps
>>> an encoding error) after a text->json->text roundtrip.  Maybe I
>>> haven't had enough caffeine today, but I can't see how that can
>>> possibly be a good idea.
>> Possibly.
>>
>> I'm coming down more and more on the side of Tom's suggestion just to ban
>> \u0000 in jsonb. I think that would let us have some fairly simple and
>> consistent rules. I'm not too worried that we'll be disallowing input that
>> we've previously allowed. We've done that often in the past, although less
>> often in point releases. I certainly don't want to wait a full release cycle
>> to fix this if possible.
> I have yet to understand what we fix by banning \u0000.  How is 0000
> different from any other four-digit hexadecimal number that's not a
> valid character in the current encoding?  What does banning that one
> particular value do?
>
> In any case, whatever we do about that issue, the idea that the text
> -> json string transformation can *change the input string into some
> other string* seems like an independent problem.
>

jsonb stores string values as postgres text values, with the unicode 
escapes resolved, just as it also resolves numbers and booleans into 
their native representation.  If you want the input perfectly preserved, 
use json, not jsonb.  I think that's made pretty clear in the docs.

so text->jsonb->text is not and has never been expected to be a noop.

cheers

andrew



Re: jsonb, unicode escapes and escaped backslashes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I have yet to understand what we fix by banning \u0000.  How is 0000
> different from any other four-digit hexadecimal number that's not a
> valid character in the current encoding?  What does banning that one
> particular value do?

BTW, as to the point about encoding violations: we *already* ban \uXXXX
sequences that don't correspond to valid characters in the current
encoding.  The attempt to exclude U+0000 from the set of banned characters
was ill-advised, plain and simple.
        regards, tom lane



Re: jsonb, unicode escapes and escaped backslashes

From
Andrew Dunstan
Date:
On 01/29/2015 05:39 PM, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I have yet to understand what we fix by banning \u0000.  How is 0000
>> different from any other four-digit hexadecimal number that's not a
>> valid character in the current encoding?  What does banning that one
>> particular value do?
> BTW, as to the point about encoding violations: we *already* ban \uXXXX
> sequences that don't correspond to valid characters in the current
> encoding.  The attempt to exclude U+0000 from the set of banned characters
> was ill-advised, plain and simple.
>
>             

Actually, unless the encoding is utf8 we ban all non-ascii unicode 
escapes even if they might designate a valid character in the current 
encoding. This was arrived at after some discussion here. So adding 
\u0000 to the list of banned characters is arguably just making us more 
consistent.


cheers

andrew



Re: jsonb, unicode escapes and escaped backslashes

From
Robert Haas
Date:
On Thu, Jan 29, 2015 at 5:30 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
> jsonb stores string values as postgres text values, with the unicode escapes
> resolved, just as it also resolves numbers and booleans into their native
> representation.  If you want the input perfectly preserved, use json, not
> jsonb.  I think that's made pretty clear in the docs.
>
> so text->jsonb->text is not and has never been expected to be a noop.

If you can't store text in a jsonb object and get it back out again,
it doesn't seem like a very useful data type.

Where exactly do you think this is documented?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: jsonb, unicode escapes and escaped backslashes

From
Robert Haas
Date:
On Thu, Jan 29, 2015 at 5:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I have yet to understand what we fix by banning \u0000.  How is 0000
>> different from any other four-digit hexadecimal number that's not a
>> valid character in the current encoding?  What does banning that one
>> particular value do?
>
> BTW, as to the point about encoding violations: we *already* ban \uXXXX
> sequences that don't correspond to valid characters in the current
> encoding.  The attempt to exclude U+0000 from the set of banned characters
> was ill-advised, plain and simple.

Oh.  Well, that's hard to argue with, then.  I can't imagine why we'd
disallow all bytes invalid in the current encoding *except* for \0.
When I originally coded up the JSON data type, I intended for it to
store invalidly-encoded data that was nevertheless valid JSON without
trying to interpret it.  It seems we've drifted pretty far off of that
principle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: jsonb, unicode escapes and escaped backslashes

From
Stephen Frost
Date:
* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, Jan 29, 2015 at 5:30 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
> > jsonb stores string values as postgres text values, with the unicode escapes
> > resolved, just as it also resolves numbers and booleans into their native
> > representation.  If you want the input perfectly preserved, use json, not
> > jsonb.  I think that's made pretty clear in the docs.
> >
> > so text->jsonb->text is not and has never been expected to be a noop.
>
> If you can't store text in a jsonb object and get it back out again,
> it doesn't seem like a very useful data type.

Hrm.  That sentence strikes me as slightly confused.

You *can* store text in a jsonb object and get it back out again,
however, that's not the same as taking a JSON structure which is
represented as text, converting it to jsonb, and then converting it back
into a text representation.

> Where exactly do you think this is documented?

http://www.postgresql.org/docs/9.4/static/datatype-json.html

"Because the json type stores an exact copy of the input text, it will
preserve semantically-insignificant white space between tokens, as well
as the order of keys within JSON objects. Also, if a JSON object within
the value contains the same key more than once, all the key/value pairs
are kept. (The processing functions consider the last value as the
operative one.) By contrast, jsonb does not preserve white space, does
not preserve the order of object keys, and does not keep duplicate
object keys. If duplicate keys are specified in the input, only the last
value is kept."

There is further discussion about how the JSON primitive types are
mapped on to native PostgreSQL types, which both constrains what values
are valid and may impact the formatting of values, eg:

"One semantically-insignificant detail worth noting is that in jsonb,
numbers will be printed according to the behavior of the underlying
numeric type. In practice this means that numbers entered with E
notation will be printed without it, for example:"

That, overall, isn't really the issue at hand though, from what I
gather.  The problem is that strings can be represented in JSON which
can't be represented in our 'text' datatype and therefeore in our jsonb
type (since jsonb converts it to a 'text' datatype underneath).  There
were attempts to make it work anyway, but that ended up causing other
issues.  In the end, I tend to agree with Tom on this one and, while it
really stinks to break things for folks who have just started using 9.4,
we're at least early enough on that we hopefully won't break it for
*too* many people.
Thanks,
    Stephen

Re: jsonb, unicode escapes and escaped backslashes

From
Peter Geoghegan
Date:
On Wed, Jan 28, 2015 at 9:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Anyway, there is a significant amount of work involved here, and there's
> no way we're getting it done for 9.4.1, or probably 9.4.anything.  I think
> our only realistic choice right now is to throw error for \u0000 so that
> we can preserve our options for doing something useful with it later.

+1

I looked into it, and it turns out that MongoDB does not accept NUL in
at least some contexts (for object keys). Apparently it wasn't always
so. MongoDB previously had a security issue that was fixed by
introducing this restriction. Their JSON-centric equivalent of
per-column privileges was for a time compromised, because "NUL
injection" was possible:

https://www.idontplaydarts.com/2011/02/mongodb-null-byte-injection-attacks/

It's easy to bash MongoDB, but this is still an interesting data
point. They changed this after the fact, and yet I can find no
evidence of any grumbling about it from end users. No one really
noticed.

I agree that the restriction on NUL is consistent with the design of
JSONB. Disallowing all bytes invalid in the current encoding with the
sole exception of NUL was a mistake, IMV.
-- 
Peter Geoghegan



Re: jsonb, unicode escapes and escaped backslashes

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> I looked into it, and it turns out that MongoDB does not accept NUL in
> at least some contexts (for object keys). Apparently it wasn't always
> so. MongoDB previously had a security issue that was fixed by
> introducing this restriction. Their JSON-centric equivalent of
> per-column privileges was for a time compromised, because "NUL
> injection" was possible:

> https://www.idontplaydarts.com/2011/02/mongodb-null-byte-injection-attacks/

> It's easy to bash MongoDB, but this is still an interesting data
> point. They changed this after the fact, and yet I can find no
> evidence of any grumbling about it from end users. No one really
> noticed.

Hoo, that's interesting.  Lends some support to my half-baked idea that
we might disallow NUL in object keys even if we are able to allow it
elsewhere in JSON strings.
        regards, tom lane



Re: jsonb, unicode escapes and escaped backslashes

From
Tom Lane
Date:
Attached is a draft patch for this.  It basically reverts commit
0ad1a816320a2b539a51628e2a0b1e83ff096b1d, adds a ban of \u0000 if
that would need to be converted to text (so it still works in the
plain json type, so long as you don't do much processing), and adds
some regression tests.

I made the \u0000 error be errcode(ERRCODE_INVALID_TEXT_REPRESENTATION)
and errmsg("invalid input syntax for type json"), by analogy to what's
thrown for non-ASCII Unicode escapes in non-UTF8 encoding.  I'm not
terribly happy with that, though.  ISTM that for both cases, this is
not "invalid syntax" at all, but an implementation restriction that
forces us to reject perfectly valid syntax.  So I think we ought to
use a different ERRCODE and text message, though I'm not entirely
sure what it should be instead.  ERRCODE_FEATURE_NOT_SUPPORTED is
one possibility.

            regards, tom lane

diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml
index 8feb2fb..b4b97a7 100644
*** a/doc/src/sgml/json.sgml
--- b/doc/src/sgml/json.sgml
***************
*** 69,80 ****
    regardless of the database encoding, and are checked only for syntactic
    correctness (that is, that four hex digits follow <literal>\u</>).
    However, the input function for <type>jsonb</> is stricter: it disallows
!   Unicode escapes for non-ASCII characters (those
!   above <literal>U+007F</>) unless the database encoding is UTF8.  It also
!   insists that any use of Unicode surrogate pairs to designate characters
!   outside the Unicode Basic Multilingual Plane be correct.  Valid Unicode
!   escapes, except for <literal>\u0000</>, are then converted to the
!   equivalent ASCII or UTF8 character for storage.
   </para>

   <note>
--- 69,82 ----
    regardless of the database encoding, and are checked only for syntactic
    correctness (that is, that four hex digits follow <literal>\u</>).
    However, the input function for <type>jsonb</> is stricter: it disallows
!   Unicode escapes for non-ASCII characters (those above <literal>U+007F</>)
!   unless the database encoding is UTF8.  The <type>jsonb</> type also
!   rejects <literal>\u0000</> (because that cannot be represented in
!   <productname>PostgreSQL</productname>'s <type>text</> type), and it insists
!   that any use of Unicode surrogate pairs to designate characters outside
!   the Unicode Basic Multilingual Plane be correct.  Valid Unicode escapes
!   are converted to the equivalent ASCII or UTF8 character for storage;
!   this includes folding surrogate pairs into a single character.
   </para>

   <note>
***************
*** 134,140 ****
         <row>
          <entry><type>string</></entry>
          <entry><type>text</></entry>
!         <entry>See notes above concerning encoding restrictions</entry>
         </row>
         <row>
          <entry><type>number</></entry>
--- 136,143 ----
         <row>
          <entry><type>string</></entry>
          <entry><type>text</></entry>
!         <entry><literal>\u0000</> is disallowed, as are non-ASCII Unicode
!          escapes if database encoding is not UTF8</entry>
         </row>
         <row>
          <entry><type>number</></entry>
diff --git a/doc/src/sgml/release-9.4.sgml b/doc/src/sgml/release-9.4.sgml
index 961e461..11bbf3b 100644
*** a/doc/src/sgml/release-9.4.sgml
--- b/doc/src/sgml/release-9.4.sgml
***************
*** 103,124 ****

      <listitem>
       <para>
-       Unicode escapes in <link linkend="datatype-json"><type>JSON</type></link>
-       text values are no longer rendered with the backslash escaped
-       (Andrew Dunstan)
-      </para>
-
-      <para>
-       Previously, all backslashes in text values being formed into JSON
-       were escaped. Now a backslash followed by <literal>u</> and four
-       hexadecimal digits is not escaped, as this is a legal sequence in a
-       JSON string value, and escaping the backslash led to some perverse
-       results.
-      </para>
-     </listitem>
-
-     <listitem>
-      <para>
        When converting values of type <type>date</>, <type>timestamp</>
        or <type>timestamptz</>
        to <link linkend="datatype-json"><type>JSON</type></link>, render the
--- 103,108 ----
diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index 3c137ea..4e46b0a 100644
*** a/src/backend/utils/adt/json.c
--- b/src/backend/utils/adt/json.c
*************** json_lex_string(JsonLexContext *lex)
*** 806,819 ****
                       * For UTF8, replace the escape sequence by the actual
                       * utf8 character in lex->strval. Do this also for other
                       * encodings if the escape designates an ASCII character,
!                      * otherwise raise an error. We don't ever unescape a
!                      * \u0000, since that would result in an impermissible nul
!                      * byte.
                       */

                      if (ch == 0)
                      {
!                         appendStringInfoString(lex->strval, "\\u0000");
                      }
                      else if (GetDatabaseEncoding() == PG_UTF8)
                      {
--- 806,822 ----
                       * For UTF8, replace the escape sequence by the actual
                       * utf8 character in lex->strval. Do this also for other
                       * encodings if the escape designates an ASCII character,
!                      * otherwise raise an error.
                       */

                      if (ch == 0)
                      {
!                         /* We can't allow this, since our TEXT type doesn't */
!                         ereport(ERROR,
!                                 (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
!                                  errmsg("invalid input syntax for type json"),
!                                  errdetail("\\u0000 cannot be converted to text."),
!                                  report_json_context(lex)));
                      }
                      else if (GetDatabaseEncoding() == PG_UTF8)
                      {
*************** escape_json(StringInfo buf, const char *
*** 2382,2411 ****
                  appendStringInfoString(buf, "\\\"");
                  break;
              case '\\':
!
!                 /*
!                  * Unicode escapes are passed through as is. There is no
!                  * requirement that they denote a valid character in the
!                  * server encoding - indeed that is a big part of their
!                  * usefulness.
!                  *
!                  * All we require is that they consist of \uXXXX where the Xs
!                  * are hexadecimal digits. It is the responsibility of the
!                  * caller of, say, to_json() to make sure that the unicode
!                  * escape is valid.
!                  *
!                  * In the case of a jsonb string value being escaped, the only
!                  * unicode escape that should be present is \u0000, all the
!                  * other unicode escapes will have been resolved.
!                  */
!                 if (p[1] == 'u' &&
!                     isxdigit((unsigned char) p[2]) &&
!                     isxdigit((unsigned char) p[3]) &&
!                     isxdigit((unsigned char) p[4]) &&
!                     isxdigit((unsigned char) p[5]))
!                     appendStringInfoCharMacro(buf, *p);
!                 else
!                     appendStringInfoString(buf, "\\\\");
                  break;
              default:
                  if ((unsigned char) *p < ' ')
--- 2385,2391 ----
                  appendStringInfoString(buf, "\\\"");
                  break;
              case '\\':
!                 appendStringInfoString(buf, "\\\\");
                  break;
              default:
                  if ((unsigned char) *p < ' ')
diff --git a/src/test/regress/expected/json.out b/src/test/regress/expected/json.out
index e435d3e..63d6cf6 100644
*** a/src/test/regress/expected/json.out
--- b/src/test/regress/expected/json.out
*************** select to_json(timestamptz '2014-05-28 1
*** 426,445 ****
  (1 row)

  COMMIT;
- -- unicode escape - backslash is not escaped
- select to_json(text '\uabcd');
-  to_json
- ----------
-  "\uabcd"
- (1 row)
-
- -- any other backslash is escaped
- select to_json(text '\abcd');
-  to_json
- ----------
-  "\\abcd"
- (1 row)
-
  --json_agg
  SELECT json_agg(q)
    FROM ( SELECT $$a$$ || x AS b, y AS c,
--- 426,431 ----
*************** ERROR:  invalid input syntax for type js
*** 1400,1405 ****
--- 1386,1421 ----
  DETAIL:  Unicode low surrogate must follow a high surrogate.
  CONTEXT:  JSON data, line 1: { "a":...
  --handling of simple unicode escapes
+ select json '{ "a":  "the Copyright \u00a9 sign" }' as correct_in_utf8;
+             correct_in_utf8
+ ---------------------------------------
+  { "a":  "the Copyright \u00a9 sign" }
+ (1 row)
+
+ select json '{ "a":  "dollar \u0024 character" }' as correct_everywhere;
+          correct_everywhere
+ -------------------------------------
+  { "a":  "dollar \u0024 character" }
+ (1 row)
+
+ select json '{ "a":  "dollar \\u0024 character" }' as not_an_escape;
+             not_an_escape
+ --------------------------------------
+  { "a":  "dollar \\u0024 character" }
+ (1 row)
+
+ select json '{ "a":  "null \u0000 escape" }' as not_unescaped;
+          not_unescaped
+ --------------------------------
+  { "a":  "null \u0000 escape" }
+ (1 row)
+
+ select json '{ "a":  "null \\u0000 escape" }' as not_an_escape;
+           not_an_escape
+ ---------------------------------
+  { "a":  "null \\u0000 escape" }
+ (1 row)
+
  select json '{ "a":  "the Copyright \u00a9 sign" }' ->> 'a' as correct_in_utf8;
     correct_in_utf8
  ----------------------
*************** select json '{ "a":  "dollar \u0024 char
*** 1412,1419 ****
   dollar $ character
  (1 row)

! select json '{ "a":  "null \u0000 escape" }' ->> 'a' as not_unescaped;
!    not_unescaped
  --------------------
   null \u0000 escape
  (1 row)
--- 1428,1445 ----
   dollar $ character
  (1 row)

! select json '{ "a":  "dollar \\u0024 character" }' ->> 'a' as not_an_escape;
!       not_an_escape
! -------------------------
!  dollar \u0024 character
! (1 row)
!
! select json '{ "a":  "null \u0000 escape" }' ->> 'a' as fails;
! ERROR:  invalid input syntax for type json
! DETAIL:  \u0000 cannot be converted to text.
! CONTEXT:  JSON data, line 1: { "a":...
! select json '{ "a":  "null \\u0000 escape" }' ->> 'a' as not_an_escape;
!    not_an_escape
  --------------------
   null \u0000 escape
  (1 row)
diff --git a/src/test/regress/expected/json_1.out b/src/test/regress/expected/json_1.out
index 106b481..550a14a 100644
*** a/src/test/regress/expected/json_1.out
--- b/src/test/regress/expected/json_1.out
*************** select to_json(timestamptz '2014-05-28 1
*** 426,445 ****
  (1 row)

  COMMIT;
- -- unicode escape - backslash is not escaped
- select to_json(text '\uabcd');
-  to_json
- ----------
-  "\uabcd"
- (1 row)
-
- -- any other backslash is escaped
- select to_json(text '\abcd');
-  to_json
- ----------
-  "\\abcd"
- (1 row)
-
  --json_agg
  SELECT json_agg(q)
    FROM ( SELECT $$a$$ || x AS b, y AS c,
--- 426,431 ----
*************** ERROR:  invalid input syntax for type js
*** 1398,1403 ****
--- 1384,1419 ----
  DETAIL:  Unicode low surrogate must follow a high surrogate.
  CONTEXT:  JSON data, line 1: { "a":...
  --handling of simple unicode escapes
+ select json '{ "a":  "the Copyright \u00a9 sign" }' as correct_in_utf8;
+             correct_in_utf8
+ ---------------------------------------
+  { "a":  "the Copyright \u00a9 sign" }
+ (1 row)
+
+ select json '{ "a":  "dollar \u0024 character" }' as correct_everywhere;
+          correct_everywhere
+ -------------------------------------
+  { "a":  "dollar \u0024 character" }
+ (1 row)
+
+ select json '{ "a":  "dollar \\u0024 character" }' as not_an_escape;
+             not_an_escape
+ --------------------------------------
+  { "a":  "dollar \\u0024 character" }
+ (1 row)
+
+ select json '{ "a":  "null \u0000 escape" }' as not_unescaped;
+          not_unescaped
+ --------------------------------
+  { "a":  "null \u0000 escape" }
+ (1 row)
+
+ select json '{ "a":  "null \\u0000 escape" }' as not_an_escape;
+           not_an_escape
+ ---------------------------------
+  { "a":  "null \\u0000 escape" }
+ (1 row)
+
  select json '{ "a":  "the Copyright \u00a9 sign" }' ->> 'a' as correct_in_utf8;
  ERROR:  invalid input syntax for type json
  DETAIL:  Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8.
*************** select json '{ "a":  "dollar \u0024 char
*** 1408,1415 ****
   dollar $ character
  (1 row)

! select json '{ "a":  "null \u0000 escape" }' ->> 'a' as not_unescaped;
!    not_unescaped
  --------------------
   null \u0000 escape
  (1 row)
--- 1424,1441 ----
   dollar $ character
  (1 row)

! select json '{ "a":  "dollar \\u0024 character" }' ->> 'a' as not_an_escape;
!       not_an_escape
! -------------------------
!  dollar \u0024 character
! (1 row)
!
! select json '{ "a":  "null \u0000 escape" }' ->> 'a' as fails;
! ERROR:  invalid input syntax for type json
! DETAIL:  \u0000 cannot be converted to text.
! CONTEXT:  JSON data, line 1: { "a":...
! select json '{ "a":  "null \\u0000 escape" }' ->> 'a' as not_an_escape;
!    not_an_escape
  --------------------
   null \u0000 escape
  (1 row)
diff --git a/src/test/regress/expected/jsonb.out b/src/test/regress/expected/jsonb.out
index aa5686f..8d5da21 100644
*** a/src/test/regress/expected/jsonb.out
--- b/src/test/regress/expected/jsonb.out
*************** LINE 1: SELECT '"\u000g"'::jsonb;
*** 60,71 ****
                 ^
  DETAIL:  "\u" must be followed by four hexadecimal digits.
  CONTEXT:  JSON data, line 1: "\u000g...
! SELECT '"\u0000"'::jsonb;        -- OK, legal escape
!   jsonb
! ----------
!  "\u0000"
  (1 row)

  -- use octet_length here so we don't get an odd unicode char in the
  -- output
  SELECT octet_length('"\uaBcD"'::jsonb::text); -- OK, uppercase and lower case both OK
--- 60,77 ----
                 ^
  DETAIL:  "\u" must be followed by four hexadecimal digits.
  CONTEXT:  JSON data, line 1: "\u000g...
! SELECT '"\u0045"'::jsonb;        -- OK, legal escape
!  jsonb
! -------
!  "E"
  (1 row)

+ SELECT '"\u0000"'::jsonb;        -- ERROR, we don't support U+0000
+ ERROR:  invalid input syntax for type json
+ LINE 1: SELECT '"\u0000"'::jsonb;
+                ^
+ DETAIL:  \u0000 cannot be converted to text.
+ CONTEXT:  JSON data, line 1: ...
  -- use octet_length here so we don't get an odd unicode char in the
  -- output
  SELECT octet_length('"\uaBcD"'::jsonb::text); -- OK, uppercase and lower case both OK
*************** select to_jsonb(timestamptz '2014-05-28
*** 324,343 ****
  (1 row)

  COMMIT;
- -- unicode escape - backslash is not escaped
- select to_jsonb(text '\uabcd');
-  to_jsonb
- ----------
-  "\uabcd"
- (1 row)
-
- -- any other backslash is escaped
- select to_jsonb(text '\abcd');
-  to_jsonb
- ----------
-  "\\abcd"
- (1 row)
-
  --jsonb_agg
  CREATE TEMP TABLE rows AS
  SELECT x, 'txt' || x as y
--- 330,335 ----
*************** LINE 1: SELECT jsonb '{ "a":  "\ude04X"
*** 1971,1990 ****
  DETAIL:  Unicode low surrogate must follow a high surrogate.
  CONTEXT:  JSON data, line 1: { "a":...
  -- handling of simple unicode escapes
! SELECT jsonb '{ "a":  "the Copyright \u00a9 sign" }' ->> 'a' AS correct_in_utf8;
     correct_in_utf8
  ----------------------
   the Copyright �� sign
  (1 row)

! SELECT jsonb '{ "a":  "dollar \u0024 character" }' ->> 'a' AS correct_everyWHERE;
   correct_everywhere
  --------------------
   dollar $ character
  (1 row)

! SELECT jsonb '{ "a":  "null \u0000 escape" }' ->> 'a' AS not_unescaped;
!    not_unescaped
  --------------------
   null \u0000 escape
  (1 row)
--- 1963,2024 ----
  DETAIL:  Unicode low surrogate must follow a high surrogate.
  CONTEXT:  JSON data, line 1: { "a":...
  -- handling of simple unicode escapes
! SELECT jsonb '{ "a":  "the Copyright \u00a9 sign" }' as correct_in_utf8;
!         correct_in_utf8
! -------------------------------
!  {"a": "the Copyright �� sign"}
! (1 row)
!
! SELECT jsonb '{ "a":  "dollar \u0024 character" }' as correct_everywhere;
!      correct_everywhere
! -----------------------------
!  {"a": "dollar $ character"}
! (1 row)
!
! SELECT jsonb '{ "a":  "dollar \\u0024 character" }' as not_an_escape;
!            not_an_escape
! -----------------------------------
!  {"a": "dollar \\u0024 character"}
! (1 row)
!
! SELECT jsonb '{ "a":  "null \u0000 escape" }' as fails;
! ERROR:  invalid input syntax for type json
! LINE 1: SELECT jsonb '{ "a":  "null \u0000 escape" }' as fails;
!                      ^
! DETAIL:  \u0000 cannot be converted to text.
! CONTEXT:  JSON data, line 1: { "a":...
! SELECT jsonb '{ "a":  "null \\u0000 escape" }' as not_an_escape;
!         not_an_escape
! ------------------------------
!  {"a": "null \\u0000 escape"}
! (1 row)
!
! SELECT jsonb '{ "a":  "the Copyright \u00a9 sign" }' ->> 'a' as correct_in_utf8;
     correct_in_utf8
  ----------------------
   the Copyright �� sign
  (1 row)

! SELECT jsonb '{ "a":  "dollar \u0024 character" }' ->> 'a' as correct_everywhere;
   correct_everywhere
  --------------------
   dollar $ character
  (1 row)

! SELECT jsonb '{ "a":  "dollar \\u0024 character" }' ->> 'a' as not_an_escape;
!       not_an_escape
! -------------------------
!  dollar \u0024 character
! (1 row)
!
! SELECT jsonb '{ "a":  "null \u0000 escape" }' ->> 'a' as fails;
! ERROR:  invalid input syntax for type json
! LINE 1: SELECT jsonb '{ "a":  "null \u0000 escape" }' ->> 'a' as fai...
!                      ^
! DETAIL:  \u0000 cannot be converted to text.
! CONTEXT:  JSON data, line 1: { "a":...
! SELECT jsonb '{ "a":  "null \\u0000 escape" }' ->> 'a' as not_an_escape;
!    not_an_escape
  --------------------
   null \u0000 escape
  (1 row)
diff --git a/src/test/regress/expected/jsonb_1.out b/src/test/regress/expected/jsonb_1.out
index 687ae63..b7b0201 100644
*** a/src/test/regress/expected/jsonb_1.out
--- b/src/test/regress/expected/jsonb_1.out
*************** LINE 1: SELECT '"\u000g"'::jsonb;
*** 60,71 ****
                 ^
  DETAIL:  "\u" must be followed by four hexadecimal digits.
  CONTEXT:  JSON data, line 1: "\u000g...
! SELECT '"\u0000"'::jsonb;        -- OK, legal escape
!   jsonb
! ----------
!  "\u0000"
  (1 row)

  -- use octet_length here so we don't get an odd unicode char in the
  -- output
  SELECT octet_length('"\uaBcD"'::jsonb::text); -- OK, uppercase and lower case both OK
--- 60,77 ----
                 ^
  DETAIL:  "\u" must be followed by four hexadecimal digits.
  CONTEXT:  JSON data, line 1: "\u000g...
! SELECT '"\u0045"'::jsonb;        -- OK, legal escape
!  jsonb
! -------
!  "E"
  (1 row)

+ SELECT '"\u0000"'::jsonb;        -- ERROR, we don't support U+0000
+ ERROR:  invalid input syntax for type json
+ LINE 1: SELECT '"\u0000"'::jsonb;
+                ^
+ DETAIL:  \u0000 cannot be converted to text.
+ CONTEXT:  JSON data, line 1: ...
  -- use octet_length here so we don't get an odd unicode char in the
  -- output
  SELECT octet_length('"\uaBcD"'::jsonb::text); -- OK, uppercase and lower case both OK
*************** select to_jsonb(timestamptz '2014-05-28
*** 324,343 ****
  (1 row)

  COMMIT;
- -- unicode escape - backslash is not escaped
- select to_jsonb(text '\uabcd');
-  to_jsonb
- ----------
-  "\uabcd"
- (1 row)
-
- -- any other backslash is escaped
- select to_jsonb(text '\abcd');
-  to_jsonb
- ----------
-  "\\abcd"
- (1 row)
-
  --jsonb_agg
  CREATE TEMP TABLE rows AS
  SELECT x, 'txt' || x as y
--- 330,335 ----
*************** LINE 1: SELECT jsonb '{ "a":  "\ude04X"
*** 1971,1990 ****
  DETAIL:  Unicode low surrogate must follow a high surrogate.
  CONTEXT:  JSON data, line 1: { "a":...
  -- handling of simple unicode escapes
! SELECT jsonb '{ "a":  "the Copyright \u00a9 sign" }' ->> 'a' AS correct_in_utf8;
  ERROR:  invalid input syntax for type json
  LINE 1: SELECT jsonb '{ "a":  "the Copyright \u00a9 sign" }' ->> 'a'...
                       ^
  DETAIL:  Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8.
  CONTEXT:  JSON data, line 1: { "a":...
! SELECT jsonb '{ "a":  "dollar \u0024 character" }' ->> 'a' AS correct_everyWHERE;
   correct_everywhere
  --------------------
   dollar $ character
  (1 row)

! SELECT jsonb '{ "a":  "null \u0000 escape" }' ->> 'a' AS not_unescaped;
!    not_unescaped
  --------------------
   null \u0000 escape
  (1 row)
--- 1963,2024 ----
  DETAIL:  Unicode low surrogate must follow a high surrogate.
  CONTEXT:  JSON data, line 1: { "a":...
  -- handling of simple unicode escapes
! SELECT jsonb '{ "a":  "the Copyright \u00a9 sign" }' as correct_in_utf8;
! ERROR:  invalid input syntax for type json
! LINE 1: SELECT jsonb '{ "a":  "the Copyright \u00a9 sign" }' as corr...
!                      ^
! DETAIL:  Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8.
! CONTEXT:  JSON data, line 1: { "a":...
! SELECT jsonb '{ "a":  "dollar \u0024 character" }' as correct_everywhere;
!      correct_everywhere
! -----------------------------
!  {"a": "dollar $ character"}
! (1 row)
!
! SELECT jsonb '{ "a":  "dollar \\u0024 character" }' as not_an_escape;
!            not_an_escape
! -----------------------------------
!  {"a": "dollar \\u0024 character"}
! (1 row)
!
! SELECT jsonb '{ "a":  "null \u0000 escape" }' as fails;
! ERROR:  invalid input syntax for type json
! LINE 1: SELECT jsonb '{ "a":  "null \u0000 escape" }' as fails;
!                      ^
! DETAIL:  \u0000 cannot be converted to text.
! CONTEXT:  JSON data, line 1: { "a":...
! SELECT jsonb '{ "a":  "null \\u0000 escape" }' as not_an_escape;
!         not_an_escape
! ------------------------------
!  {"a": "null \\u0000 escape"}
! (1 row)
!
! SELECT jsonb '{ "a":  "the Copyright \u00a9 sign" }' ->> 'a' as correct_in_utf8;
  ERROR:  invalid input syntax for type json
  LINE 1: SELECT jsonb '{ "a":  "the Copyright \u00a9 sign" }' ->> 'a'...
                       ^
  DETAIL:  Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8.
  CONTEXT:  JSON data, line 1: { "a":...
! SELECT jsonb '{ "a":  "dollar \u0024 character" }' ->> 'a' as correct_everywhere;
   correct_everywhere
  --------------------
   dollar $ character
  (1 row)

! SELECT jsonb '{ "a":  "dollar \\u0024 character" }' ->> 'a' as not_an_escape;
!       not_an_escape
! -------------------------
!  dollar \u0024 character
! (1 row)
!
! SELECT jsonb '{ "a":  "null \u0000 escape" }' ->> 'a' as fails;
! ERROR:  invalid input syntax for type json
! LINE 1: SELECT jsonb '{ "a":  "null \u0000 escape" }' ->> 'a' as fai...
!                      ^
! DETAIL:  \u0000 cannot be converted to text.
! CONTEXT:  JSON data, line 1: { "a":...
! SELECT jsonb '{ "a":  "null \\u0000 escape" }' ->> 'a' as not_an_escape;
!    not_an_escape
  --------------------
   null \u0000 escape
  (1 row)
diff --git a/src/test/regress/sql/json.sql b/src/test/regress/sql/json.sql
index 36a6674..53a37a8 100644
*** a/src/test/regress/sql/json.sql
--- b/src/test/regress/sql/json.sql
*************** SET LOCAL TIME ZONE -8;
*** 111,124 ****
  select to_json(timestamptz '2014-05-28 12:22:35.614298-04');
  COMMIT;

- -- unicode escape - backslash is not escaped
-
- select to_json(text '\uabcd');
-
- -- any other backslash is escaped
-
- select to_json(text '\abcd');
-
  --json_agg

  SELECT json_agg(q)
--- 111,116 ----
*************** select json '{ "a":  "\ude04X" }' -> 'a'
*** 401,409 ****

  --handling of simple unicode escapes

  select json '{ "a":  "the Copyright \u00a9 sign" }' ->> 'a' as correct_in_utf8;
  select json '{ "a":  "dollar \u0024 character" }' ->> 'a' as correct_everywhere;
! select json '{ "a":  "null \u0000 escape" }' ->> 'a' as not_unescaped;

  --json_typeof() function
  select value, json_typeof(value)
--- 393,409 ----

  --handling of simple unicode escapes

+ select json '{ "a":  "the Copyright \u00a9 sign" }' as correct_in_utf8;
+ select json '{ "a":  "dollar \u0024 character" }' as correct_everywhere;
+ select json '{ "a":  "dollar \\u0024 character" }' as not_an_escape;
+ select json '{ "a":  "null \u0000 escape" }' as not_unescaped;
+ select json '{ "a":  "null \\u0000 escape" }' as not_an_escape;
+
  select json '{ "a":  "the Copyright \u00a9 sign" }' ->> 'a' as correct_in_utf8;
  select json '{ "a":  "dollar \u0024 character" }' ->> 'a' as correct_everywhere;
! select json '{ "a":  "dollar \\u0024 character" }' ->> 'a' as not_an_escape;
! select json '{ "a":  "null \u0000 escape" }' ->> 'a' as fails;
! select json '{ "a":  "null \\u0000 escape" }' ->> 'a' as not_an_escape;

  --json_typeof() function
  select value, json_typeof(value)
diff --git a/src/test/regress/sql/jsonb.sql b/src/test/regress/sql/jsonb.sql
index a846103..53cc239 100644
*** a/src/test/regress/sql/jsonb.sql
--- b/src/test/regress/sql/jsonb.sql
*************** SELECT '"\v"'::jsonb;            -- ERROR, not a
*** 10,16 ****
  SELECT '"\u"'::jsonb;            -- ERROR, incomplete escape
  SELECT '"\u00"'::jsonb;            -- ERROR, incomplete escape
  SELECT '"\u000g"'::jsonb;        -- ERROR, g is not a hex digit
! SELECT '"\u0000"'::jsonb;        -- OK, legal escape
  -- use octet_length here so we don't get an odd unicode char in the
  -- output
  SELECT octet_length('"\uaBcD"'::jsonb::text); -- OK, uppercase and lower case both OK
--- 10,17 ----
  SELECT '"\u"'::jsonb;            -- ERROR, incomplete escape
  SELECT '"\u00"'::jsonb;            -- ERROR, incomplete escape
  SELECT '"\u000g"'::jsonb;        -- ERROR, g is not a hex digit
! SELECT '"\u0045"'::jsonb;        -- OK, legal escape
! SELECT '"\u0000"'::jsonb;        -- ERROR, we don't support U+0000
  -- use octet_length here so we don't get an odd unicode char in the
  -- output
  SELECT octet_length('"\uaBcD"'::jsonb::text); -- OK, uppercase and lower case both OK
*************** SET LOCAL TIME ZONE -8;
*** 73,86 ****
  select to_jsonb(timestamptz '2014-05-28 12:22:35.614298-04');
  COMMIT;

- -- unicode escape - backslash is not escaped
-
- select to_jsonb(text '\uabcd');
-
- -- any other backslash is escaped
-
- select to_jsonb(text '\abcd');
-
  --jsonb_agg

  CREATE TEMP TABLE rows AS
--- 74,79 ----
*************** SELECT jsonb '{ "a":  "\ud83dX" }' -> 'a
*** 488,496 ****
  SELECT jsonb '{ "a":  "\ude04X" }' -> 'a'; -- orphan low surrogate

  -- handling of simple unicode escapes
! SELECT jsonb '{ "a":  "the Copyright \u00a9 sign" }' ->> 'a' AS correct_in_utf8;
! SELECT jsonb '{ "a":  "dollar \u0024 character" }' ->> 'a' AS correct_everyWHERE;
! SELECT jsonb '{ "a":  "null \u0000 escape" }' ->> 'a' AS not_unescaped;

  -- jsonb_to_record and jsonb_to_recordset

--- 481,498 ----
  SELECT jsonb '{ "a":  "\ude04X" }' -> 'a'; -- orphan low surrogate

  -- handling of simple unicode escapes
!
! SELECT jsonb '{ "a":  "the Copyright \u00a9 sign" }' as correct_in_utf8;
! SELECT jsonb '{ "a":  "dollar \u0024 character" }' as correct_everywhere;
! SELECT jsonb '{ "a":  "dollar \\u0024 character" }' as not_an_escape;
! SELECT jsonb '{ "a":  "null \u0000 escape" }' as fails;
! SELECT jsonb '{ "a":  "null \\u0000 escape" }' as not_an_escape;
!
! SELECT jsonb '{ "a":  "the Copyright \u00a9 sign" }' ->> 'a' as correct_in_utf8;
! SELECT jsonb '{ "a":  "dollar \u0024 character" }' ->> 'a' as correct_everywhere;
! SELECT jsonb '{ "a":  "dollar \\u0024 character" }' ->> 'a' as not_an_escape;
! SELECT jsonb '{ "a":  "null \u0000 escape" }' ->> 'a' as fails;
! SELECT jsonb '{ "a":  "null \\u0000 escape" }' ->> 'a' as not_an_escape;

  -- jsonb_to_record and jsonb_to_recordset


Re: jsonb, unicode escapes and escaped backslashes

From
Peter Geoghegan
Date:
On Thu, Jan 29, 2015 at 10:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I made the \u0000 error be errcode(ERRCODE_INVALID_TEXT_REPRESENTATION)
> and errmsg("invalid input syntax for type json"), by analogy to what's
> thrown for non-ASCII Unicode escapes in non-UTF8 encoding.  I'm not
> terribly happy with that, though.  ISTM that for both cases, this is
> not "invalid syntax" at all, but an implementation restriction that
> forces us to reject perfectly valid syntax.  So I think we ought to
> use a different ERRCODE and text message, though I'm not entirely
> sure what it should be instead.  ERRCODE_FEATURE_NOT_SUPPORTED is
> one possibility.

I personally prefer what you have here.

The point of JSONB is that we take a position on certain aspects like
this. We're bridging a pointedly loosey goosey interchange format,
JSON, with native PostgreSQL types. For example, we take a firm
position on encoding. The JSON type is a bit more permissive, to about
the extent that that's possible. The whole point is that we're
interpreting JSON data in a way that's consistent with *Postgres*
conventions. You'd have to interpret the data according to *some*
convention in order to do something non-trivial with it in any case,
and users usually want that.

It's really nice the way encoding is a strict implementation detail
within Postgres in general, in the sense that you know that if your
application code is concerned about encoding, you're probably thinking
about the problem incorrectly (at least once data has crossed the
database encoding "border"). MySQL's laxidasical attitudes here appear
to have been an enormous mistake.
-- 
Peter Geoghegan



Re: jsonb, unicode escapes and escaped backslashes

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Thu, Jan 29, 2015 at 10:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I made the \u0000 error be errcode(ERRCODE_INVALID_TEXT_REPRESENTATION)
>> and errmsg("invalid input syntax for type json"), by analogy to what's
>> thrown for non-ASCII Unicode escapes in non-UTF8 encoding.  I'm not
>> terribly happy with that, though.  ISTM that for both cases, this is
>> not "invalid syntax" at all, but an implementation restriction that
>> forces us to reject perfectly valid syntax.  So I think we ought to
>> use a different ERRCODE and text message, though I'm not entirely
>> sure what it should be instead.  ERRCODE_FEATURE_NOT_SUPPORTED is
>> one possibility.

> I personally prefer what you have here.

> The point of JSONB is that we take a position on certain aspects like
> this. We're bridging a pointedly loosey goosey interchange format,
> JSON, with native PostgreSQL types. For example, we take a firm
> position on encoding. The JSON type is a bit more permissive, to about
> the extent that that's possible. The whole point is that we're
> interpreting JSON data in a way that's consistent with *Postgres*
> conventions. You'd have to interpret the data according to *some*
> convention in order to do something non-trivial with it in any case,
> and users usually want that.

I quite agree with you, actually, in terms of that perspective.  But my
point remains: "\u0000" is not invalid JSON syntax, and neither is
"\u1234".  If we choose to throw an error because we can't interpret or
process that according to our conventions, fine, but we should call it
something other than "invalid syntax".

ERRCODE_UNTRANSLATABLE_CHARACTER or ERRCODE_CHARACTER_NOT_IN_REPERTOIRE
seem more apropos from here.  And I still think there's a case to be
made for ERRCODE_FEATURE_NOT_SUPPORTED, because it's at least possible
that we'd relax this restriction in future (eg, allow Unicode characters
that can be converted to the database's encoding).
        regards, tom lane



Re: jsonb, unicode escapes and escaped backslashes

From
Peter Geoghegan
Date:
On Thu, Jan 29, 2015 at 11:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The point of JSONB is that we take a position on certain aspects like
>> this. We're bridging a pointedly loosey goosey interchange format,
>> JSON, with native PostgreSQL types. For example, we take a firm
>> position on encoding. The JSON type is a bit more permissive, to about
>> the extent that that's possible. The whole point is that we're
>> interpreting JSON data in a way that's consistent with *Postgres*
>> conventions. You'd have to interpret the data according to *some*
>> convention in order to do something non-trivial with it in any case,
>> and users usually want that.
>
> I quite agree with you, actually, in terms of that perspective.

Sure, but I wasn't sure that that was evident to others.

To emphasize: I think it's appropriate that the JSON spec takes
somewhat of a back seat approach to things like encoding and the
precision of numbers. I also think it's appropriate that JSONB does
not, up to and including where JSONB forbids things that the JSON spec
supposes could be useful. We haven't failed users by (say) not
accepting NULs, even though the spec suggests that that might be
useful - we have provided them with a reasonable, concrete
interpretation of that JSON data, with lots of useful operators, that
they may take or leave. It really isn't historical that we have both a
JSON and JSONB type. For other examples of this, see every "document
database" in existence.

Depart from this perspective, as an interchange standard author, and
you end up with something like XML, which while easy to reason about
isn't all that useful, or BSON, the binary interchange format, which
is an oxymoron.

> But my point remains: "\u0000" is not invalid JSON syntax, and neither is
> "\u1234".  If we choose to throw an error because we can't interpret or
> process that according to our conventions, fine, but we should call it
> something other than "invalid syntax".
>
> ERRCODE_UNTRANSLATABLE_CHARACTER or ERRCODE_CHARACTER_NOT_IN_REPERTOIRE
> seem more apropos from here.

I see. I'd go with ERRCODE_UNTRANSLATABLE_CHARACTER, then.
-- 
Peter Geoghegan



Re: jsonb, unicode escapes and escaped backslashes

From
Robert Haas
Date:
On Thu, Jan 29, 2015 at 10:09 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
>> On Thu, Jan 29, 2015 at 5:30 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
>> > jsonb stores string values as postgres text values, with the unicode escapes
>> > resolved, just as it also resolves numbers and booleans into their native
>> > representation.  If you want the input perfectly preserved, use json, not
>> > jsonb.  I think that's made pretty clear in the docs.
>> >
>> > so text->jsonb->text is not and has never been expected to be a noop.
>>
>> If you can't store text in a jsonb object and get it back out again,
>> it doesn't seem like a very useful data type.
>
> Hrm.  That sentence strikes me as slightly confused.
>
> You *can* store text in a jsonb object and get it back out again,
> however, that's not the same as taking a JSON structure which is
> represented as text, converting it to jsonb, and then converting it back
> into a text representation.

I understand Andrew to be saying that if you take a 6-character string
and convert it to a JSON string and then back to text, you will
*usually* get back the same 6 characters you started with ... unless
the first character was \, the second u, and the remainder hexadecimal
digits.  Then you'll get back a one-character string or an error
instead.  It's not hard to imagine that leading to surprising
behavior, or even security vulnerabilities in applications that aren't
expecting such a translation to happen under them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: jsonb, unicode escapes and escaped backslashes

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I understand Andrew to be saying that if you take a 6-character string
> and convert it to a JSON string and then back to text, you will
> *usually* get back the same 6 characters you started with ... unless
> the first character was \, the second u, and the remainder hexadecimal
> digits.  Then you'll get back a one-character string or an error
> instead.  It's not hard to imagine that leading to surprising
> behavior, or even security vulnerabilities in applications that aren't
> expecting such a translation to happen under them.

That *was* the case, with the now-reverted patch that changed the escaping
rules.  It's not anymore:

regression=# select to_json('\u1234'::text); to_json  
-----------"\\u1234"
(1 row)

When you convert that back to text, you'll get \u1234, no more and no
less.  For example:

regression=# select array_to_json(array['\u1234'::text]);array_to_json 
---------------["\\u1234"]
(1 row)

regression=# select array_to_json(array['\u1234'::text])->0;?column?  
-----------"\\u1234"
(1 row)

regression=# select array_to_json(array['\u1234'::text])->>0;?column? 
----------\u1234
(1 row)

Now, if you put in '"\u1234"'::jsonb and extract that string as text,
you get some Unicode character or other.  But I'd say that a JSON user
who is surprised by that doesn't understand JSON, and definitely that they
hadn't read more than about one paragraph of our description of the JSON
types.
        regards, tom lane



Re: jsonb, unicode escapes and escaped backslashes

From
Robert Haas
Date:
On Sat, Jan 31, 2015 at 8:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I understand Andrew to be saying that if you take a 6-character string
>> and convert it to a JSON string and then back to text, you will
>> *usually* get back the same 6 characters you started with ... unless
>> the first character was \, the second u, and the remainder hexadecimal
>> digits.  Then you'll get back a one-character string or an error
>> instead.  It's not hard to imagine that leading to surprising
>> behavior, or even security vulnerabilities in applications that aren't
>> expecting such a translation to happen under them.
>
> That *was* the case, with the now-reverted patch that changed the escaping
> rules.  It's not anymore:
>
> regression=# select to_json('\u1234'::text);
>   to_json
> -----------
>  "\\u1234"
> (1 row)
>
> When you convert that back to text, you'll get \u1234, no more and no
> less.  For example:
>
> regression=# select array_to_json(array['\u1234'::text]);
>  array_to_json
> ---------------
>  ["\\u1234"]
> (1 row)
>
> regression=# select array_to_json(array['\u1234'::text])->0;
>  ?column?
> -----------
>  "\\u1234"
> (1 row)
>
> regression=# select array_to_json(array['\u1234'::text])->>0;
>  ?column?
> ----------
>  \u1234
> (1 row)
>
> Now, if you put in '"\u1234"'::jsonb and extract that string as text,
> you get some Unicode character or other.  But I'd say that a JSON user
> who is surprised by that doesn't understand JSON, and definitely that they
> hadn't read more than about one paragraph of our description of the JSON
> types.

Totally agree.  That's why I think reverting the patch was the right
thing to do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company