Re: JSON and unicode surrogate pairs - Mailing list pgsql-hackers

From Andres Freund
Subject Re: JSON and unicode surrogate pairs
Date
Msg-id 20130611084717.GB2428@alap2.anarazel.de
Whole thread Raw
In response to Re: JSON and unicode surrogate pairs  (Andrew Dunstan <andrew@dunslane.net>)
Responses Re: JSON and unicode surrogate pairs
List pgsql-hackers
On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote:
> >It's legal, is it not, to just write the equivalent Unicode character in
> >the JSON string and not use the escapes?  If so I would think that that
> >would be the most common usage.  If someone's writing an escape, they
> >probably had a reason for doing it that way, and might not appreciate
> >our overriding their decision.

> We never store the converted values in the JSON object, nor do we return
> them from functions that return JSON. But many of the functions and
> operators that process the JSON have variants that return text instead of
> JSON, and in those cases, when the value returned is a JSON string, we do
> the following to it:
> 

> I have just realized that the problem is actually quite a lot bigger than
> that. We also use this value for field name comparison. So, let us suppose
> that we have a LATIN1 database and a piece of JSON with a field name
> containing the Euro sign ("\u20ac"), a character that is not in LATIN1.
> Making that processable so it doesn't blow up would be mighty tricky and
> error prone. The non-orthogonality I suggested as a solution upthread is, by
> contrast, very small and easy to manage, and not terribly hard to explain -
> see attached.

I think this all shows pretty clearly that it was a mistake allowing
json data in the database that we cannot entirely display with the
database's encoding. All the proposed ugly workarounds are only
necessary because we don't throw an error when originally validating the
json.
Even in an utf-8 database you can get errors due to \u unescaping (at
attribute access time, *NOT* at json_in() time) due to invalidate
surrogate pairs.

I think this goes countrary to normal postgres approach of validating
data as strict as necessary. And I think we are going to regret not
fixing this while there are still relatively few users out there.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: erroneous restore into pg_catalog schema
Next
From: Dimitri Fontaine
Date:
Subject: Re: DO ... RETURNING