Home > mailing lists

Re: JSON and unicode surrogate pairs - Mailing list pgsql-hackers

From	Andrew Dunstan
Subject	Re: JSON and unicode surrogate pairs
Date	June 10, 2013 20:02:01
Msg-id	51B60669.3080107@dunslane.net Whole thread Raw
In response to	Re: JSON and unicode surrogate pairs (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: JSON and unicode surrogate pairs (Andres Freund <andres@2ndquadrant.com>)
List	pgsql-hackers

Tree view

On 06/10/2013 11:43 AM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> Or we could abandon the conversion altogether, but that doesn't seem
>> very friendly either. I suspect the biggest case for people to use these
>> sequences is where the database is UTF8 but the client encoding is not.
> Well, if that's actually the biggest use-case, then maybe we should just
> say we're *not* in the business of converting those escapes.  That would
> make things nice and consistent regardless of the DB encoding, and it
> would avoid the problem of being able to input a value and then not
> being able to output it again.
>
> It's legal, is it not, to just write the equivalent Unicode character in
> the JSON string and not use the escapes?  If so I would think that that
> would be the most common usage.  If someone's writing an escape, they
> probably had a reason for doing it that way, and might not appreciate
> our overriding their decision.
>
>

We never store the converted values in the JSON object, nor do we return
them from functions that return JSON. But many of the functions and
operators that process the JSON have variants that return text instead
of JSON, and in those cases, when the value returned is a JSON string,
we do the following to it:

  * strip the outside quotes
  * de-escape the various escaped characters (i.e. everything preceded
    by a backslash in the railroad diagram for string at
    <http://www.json.org/>)

Here's an example of the difference:

    andrew=# select '{ "a": "\u00a9"}'::json -> 'a';
      ?column?
    ----------
      "\u00a9"
    (1 row)

    andrew=# select '{ "a": "\u00a9"}'::json ->>'a';
      ?column?
    ----------
      ©
    (1 row)

It's the process of producing the latter that is giving us a headache in
non-UTF8 databases.

... [ more caffeine is consumed ] ...

I have just realized that the problem is actually quite a lot bigger
than that. We also use this value for field name comparison. So, let us
suppose that we have a LATIN1 database and a piece of JSON with a field
name containing the Euro sign ("\u20ac"), a character that is not in
LATIN1. Making that processable so it doesn't blow up would be mighty
tricky and error prone. The non-orthogonality I suggested as a solution
upthread is, by contrast, very small and easy to manage, and not
terribly hard to explain - see attached.

cheers

andrew

Attachment

json-unicode-esc.patch

pgsql-hackers by date:

From: Tom Lane
Date: 10 June 2013, 19:20:56
Subject: Re: SPGist "triple parity" concept doesn't work

From: Jeff Davis
Date: 10 June 2013, 20:10:36
Subject: Re: pg_filedump 9.3: checksums (and a few other fixes)

Re: JSON and unicode surrogate pairs - Mailing list pgsql-hackers

Attachment

Previous

Next