Re: Unicode escapes with any backend encoding - Mailing list pgsql-hackers

From Chapman Flack
Subject Re: Unicode escapes with any backend encoding
Date
Msg-id ef2648e8-66dc-c95c-c5ad-72ff05191c2c@anastigmatix.net
Whole thread Raw
In response to Re: Unicode escapes with any backend encoding  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 1/14/20 4:25 PM, Tom Lane wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>> On Wed, Jan 15, 2020 at 4:25 AM Chapman Flack <chap@anastigmatix.net> wrote:
>>> On 1/14/20 10:10 AM, Tom Lane wrote:
>>>> to me that this error is just useless pedantry.  As long as the DB
>>>> encoding can represent the desired character, it should be transparent
>>>> to users.
> 
>>> That's my position too.
> 
>> and mine.
> 
> I'm confused --- yesterday you seemed to be against this idea.
> Have you changed your mind?
> 
> I'll gladly go change the patch if people are on board with this.

Hmm, well, let me clarify for my own part what I think I'm agreeing
with ... perhaps it's misaligned with something further upthread.

In an ideal world (which may be ideal in more ways than are in scope
for the present discussion) I would expect to see these principles:

1. On input, whether a Unicode escape is or isn't allowed should
   not depend on any encoding settings. It should be lexically
   allowed always, and if it represents a character that exists
   in the server encoding, it should mean that character. If it's
   not representable in the storage format, it should produce an
   error that says that.

2. If it happens that the character is representable in both the
   storage encoding and the client encoding, it shouldn't matter
   whether it arrives literally as an é or as an escape. Either
   should get stored on disk as the same bytes.

3. On output, as long as the character is representable in the client
   encoding, there is nothing to worry about. It will be sent as its
   representation in the client encoding (which may be different bytes
   than its representation in the server encoding).

4. If a character to be output isn't in the client encoding, it
   will be datatype-dependent whether there is any way to escape.
   For example, xml_out could produce &#x????; forms, and json_out
   could produce \u???? forms.

5. If the datatype being output has no escaping rules available
   (as would be the case for an ordinary text column, say), then
   the unrepresentable character has to be reported in an error.
   (Encoding conversions often have the option of substituting
   a replacement character like ? but I don't believe a DBMS has
   any business making such changes to data, unless by explicit
   opt-in. If it can't give you the data you wanted, it should
   say "here's why I can't give you that.")

6. While 'text' in general provides no escaping mechanism, some
   functions that produce text may still have that option. For
   example, quote_literal and quote_ident could conceivably
   produce the U&'...' or U&"..." forms, respectively, if
   the argument contains characters that won't go in the client
   encoding.

I understand that on the way from 1 to 6 I will have drifted
further from what's discussed in this thread; for example, I bet
that quote_literal/quote_ident never produce U& forms now, and
that no one is proposing to change that, and I'm pretending not
to notice the question of how astonishing such behavior could be.
(Not to mention, how would they know whether they are returning
a value that's destined to go across the client encoding, rather
than to be used in a purely server-side expression? Maybe distinct
versions of those functions could take an encoding argument, and
produce the U& forms when the content won't go in the specified
encoding. That would avoid astonishing changes to existing functions.)

Regards,
-Chap



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: aggregate crash
Next
From: David Fetter
Date:
Subject: Re: Use compiler intrinsics for bit ops in hash