Re: JSON and unicode surrogate pairs - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: JSON and unicode surrogate pairs
Date
Msg-id 51B5EEAD.50208@dunslane.net
Whole thread Raw
In response to Re: JSON and unicode surrogate pairs  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: JSON and unicode surrogate pairs
Re: JSON and unicode surrogate pairs
List pgsql-hackers
On 06/10/2013 10:18 AM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> After thinking about this some more I have come to the conclusion that
>> we should only do any de-escaping of \uxxxx sequences, whether or not
>> they are for BMP characters, when the server encoding is utf8. For any
>> other encoding, which is already a violation of the JSON standard
>> anyway, and should be avoided if you're dealing with JSON, we should
>> just pass them through even in text output. This will be a simple and
>> very localized fix.
> Hmm.  I'm not sure that users will like this definition --- it will seem
> pretty arbitrary to them that conversion of \u sequences happens in some
> databases and not others.

Then what should we do when there is no matching codepoint in the 
database encoding? First we'll have to delay the evaluation so it's not 
done over-eagerly, and then we'll have to try the conversion and throw 
an error if it doesn't work. The second part is what's happening now, 
but the delayed evaluation is not.

Or we could abandon the conversion altogether, but that doesn't seem 
very friendly either. I suspect the biggest case for people to use these 
sequences is where the database is UTF8 but the client encoding is not.

Frankly, if you want to use Unicode escapes, you should really be using 
a UTF8 encoded database if at all possible.


>
>> We'll still have to deal with this issue when we get to binary storage
>> of JSON, but that's not something we need to confront today.
> Well, if we have to break backwards compatibility when we try to do
> binary storage, we're not going to be happy either.  So I think we'd
> better have a plan in mind for what will happen then.
>
>             

I don't see any reason why we couldn't store the JSON strings with the 
Unicode escape sequences intact in the binary format. What the binary 
format buys us is that it has decomposed the JSON into a tree structure, 
so instead of parsing the JSON we can just walk the tree, but the leaf 
nodes of the tree are still (in the case of the nodes under discussion) 
text-like objects.

cheers

andrew



pgsql-hackers by date:

Previous
From: Dimitri Fontaine
Date:
Subject: Re: erroneous restore into pg_catalog schema
Next
From: Teodor Sigaev
Date:
Subject: Re: SPGist "triple parity" concept doesn't work