Re: JSON and unicode surrogate pairs - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: JSON and unicode surrogate pairs
Date
Msg-id 51B60669.3080107@dunslane.net
Whole thread Raw
In response to Re: JSON and unicode surrogate pairs  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: JSON and unicode surrogate pairs  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
On 06/10/2013 11:43 AM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> Or we could abandon the conversion altogether, but that doesn't seem
>> very friendly either. I suspect the biggest case for people to use these
>> sequences is where the database is UTF8 but the client encoding is not.
> Well, if that's actually the biggest use-case, then maybe we should just
> say we're *not* in the business of converting those escapes.  That would
> make things nice and consistent regardless of the DB encoding, and it
> would avoid the problem of being able to input a value and then not
> being able to output it again.
>
> It's legal, is it not, to just write the equivalent Unicode character in
> the JSON string and not use the escapes?  If so I would think that that
> would be the most common usage.  If someone's writing an escape, they
> probably had a reason for doing it that way, and might not appreciate
> our overriding their decision.
>
>


We never store the converted values in the JSON object, nor do we return
them from functions that return JSON. But many of the functions and
operators that process the JSON have variants that return text instead
of JSON, and in those cases, when the value returned is a JSON string,
we do the following to it:

  * strip the outside quotes
  * de-escape the various escaped characters (i.e. everything preceded
    by a backslash in the railroad diagram for string at
    <http://www.json.org/>)


Here's an example of the difference:

    andrew=# select '{ "a": "\u00a9"}'::json -> 'a';
      ?column?
    ----------
      "\u00a9"
    (1 row)

    andrew=# select '{ "a": "\u00a9"}'::json ->>'a';
      ?column?
    ----------
      ©
    (1 row)

It's the process of producing the latter that is giving us a headache in
non-UTF8 databases.

... [ more caffeine is consumed ] ...

I have just realized that the problem is actually quite a lot bigger
than that. We also use this value for field name comparison. So, let us
suppose that we have a LATIN1 database and a piece of JSON with a field
name containing the Euro sign ("\u20ac"), a character that is not in
LATIN1. Making that processable so it doesn't blow up would be mighty
tricky and error prone. The non-orthogonality I suggested as a solution
upthread is, by contrast, very small and easy to manage, and not
terribly hard to explain - see attached.

cheers

andrew


Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: SPGist "triple parity" concept doesn't work
Next
From: Jeff Davis
Date:
Subject: Re: pg_filedump 9.3: checksums (and a few other fixes)