Home > mailing lists

Re: JSON and unicode surrogate pairs - Mailing list pgsql-hackers

From	Andrew Dunstan
Subject	Re: JSON and unicode surrogate pairs
Date	June 10, 2013 09:16:51
Msg-id	51B56F2C.3020305@dunslane.net Whole thread Raw
In response to	Re: JSON and unicode surrogate pairs (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: JSON and unicode surrogate pairs
List	pgsql-hackers

Tree view

On 06/09/2013 07:47 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> I did that, but it's evident from the buildfarm that there's more work
>> to do. The problem is that we do the de-escaping as we lex the json to
>> construct the look ahead token, and at that stage we don't know whether
>> or not it's really going to be needed. That means we can cause errors to
>> be raised in far too many places. It's failing on this line:
>>      converted = pg_any_to_server(utf8str, utf8len, PG_UTF8);
>> even though the operator in use ("->") doesn't even use the de-escaped
>> value.
>> The real solution is going to be to delay the de-escaping of the string
>> until it is known to be wanted. That's unfortunately going to be a bit
>> invasive, but I can't see a better solution. I'll work on it ASAP.
> Not sure that this idea isn't a dead end.  IIUC, you're proposing to
> jump through hoops in order to avoid complaining about illegal JSON
> data, essentially just for backwards compatibility with 9.2's failure to
> complain about it.  If we switch over to a pre-parsed (binary) storage
> format for JSON values, won't we be forced to throw these errors anyway?
> If so, maybe we should just take the compatibility hit now while there's
> still a relatively small amount of stored JSON data in the wild.
>
>             


No, I probably haven't explained it very well. Here is the regression 
diff from jacana:
      ERROR:  cannot call json_populate_recordset on a nested object      -- handling of unicode surrogate pairs
selectjson '{ "a":  "\ud83d\ude04\ud83d\udc36" }' -> 'a' as correct;   !           correct   !
----------------------------  !  "\ud83d\ude04\ud83d\udc36"   ! (1 row)   !      select json '{ "a":  "\ud83d\ud83d" }'
->'a'; -- 2 high surrogates in a row      ERROR:  invalid input syntax for type json      DETAIL:  high order surrogate
mustnot follow a high order surrogate.   --- 922,928 ----      ERROR:  cannot call json_populate_recordset on a nested
object     -- handling of unicode surrogate pairs      select json '{ "a":  "\ud83d\ude04\ud83d\udc36" }' -> 'a' as
correct;  ! ERROR:  character with byte sequence 0xf0 0x9f 0x98 0x84 in encoding "UTF8" has no equivalent in encoding
"WIN1252"     select json '{ "a":  "\ud83d\ud83d" }' -> 'a'; -- 2 high surrogates in a row      ERROR:  invalid input
syntaxfor type json      DETAIL:  high order surrogate must not follow a high order surrogate.
 


The sequence in question is two perfectly valid surrogate pairs.

...

After thinking about this some more I have come to the conclusion that 
we should only do any de-escaping of \uxxxx sequences, whether or not 
they are for BMP characters, when the server encoding is utf8. For any 
other encoding, which is already a violation of the JSON standard 
anyway, and should be avoided if you're dealing with JSON, we should 
just pass them through even in text output. This will be a simple and 
very localized fix.

We'll still have to deal with this issue when we get to binary storage 
of JSON, but that's not something we need to confront today.

cheers

andrew

pgsql-hackers by date:

From: Noah Misch
Date: 10 June 2013, 09:06:19
Subject: Re: Optimising Foreign Key checks

From: Martin Schäfer
Date: 10 June 2013, 10:08:26
Subject: Re: UTF-8 encoding problem w/ libpq

Re: JSON and unicode surrogate pairs - Mailing list pgsql-hackers

Previous

Next