Re: Unicode escapes with any backend encoding - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Unicode escapes with any backend encoding
Date
Msg-id 7317.1579014636@sss.pgh.pa.us
Whole thread Raw
In response to Re: Unicode escapes with any backend encoding  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Unicode escapes with any backend encoding  (Chapman Flack <chap@anastigmatix.net>)
List pgsql-hackers
I wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>> On Tue, Jan 14, 2020 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Grepping for other direct uses of unicode_to_utf8(), I notice that
>>> there are a couple of places in the JSON code where we have a similar
>>> restriction that you can only write a Unicode escape in UTF8 server
>>> encoding.  I'm not sure whether these same semantics could be
>>> applied there, so I didn't touch that.

>> Off the cuff I'd be inclined to say we should keep the text escape
>> rules the same. We've already extended the JSON standard y allowing
>> non-UTF8 encodings.

> Right.  I'm just thinking though that if you can write "é" literally
> in a JSON string, even though you're using LATIN1 not UTF8, then why
> not allow writing that as "\u00E9" instead?  The latter is arguably
> truer to spec.
> However, if JSONB collapses "\u00E9" to LATIN1 "é", that would be bad,
> unless we have a way to undo it on printout.  So there might be
> some more moving parts here than I thought.

On third thought, what would be so bad about that?  Let's suppose
I write:

    INSERT ... values('{"x": "\u00E9"}'::jsonb);

and the jsonb parsing logic chooses to collapse the backslash to
the represented character, i.e., "é".  Why should it matter whether
the database encoding is UTF8 or LATIN1?  If I am using UTF8
client encoding, I will see the "é" in UTF8 encoding either way,
because of output encoding conversion.  If I am using LATIN1
client encoding, I will see the "é" in LATIN1 either way --- or
at least, I will if the database encoding is UTF8.  Right now I get
an error for that when the database encoding is LATIN1 ... but if
I store the "é" as literal "é", it works, either way.  So it seems
to me that this error is just useless pedantry.  As long as the DB
encoding can represent the desired character, it should be transparent
to users.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Atsushi Torikoshi
Date:
Subject: Re: Add pg_file_sync() to adminpack
Next
From: Daniel Gustafsson
Date:
Subject: Re: Setting min/max TLS protocol in clientside libpq