Re: The "char" type versus non-ASCII characters - Mailing list pgsql-hackers

From Tom Lane
Subject Re: The "char" type versus non-ASCII characters
Date
Msg-id 2849759.1638723714@sss.pgh.pa.us
Whole thread Raw
In response to Re: The "char" type versus non-ASCII characters  (Chapman Flack <chap@anastigmatix.net>)
Responses Re: The "char" type versus non-ASCII characters
List pgsql-hackers
Chapman Flack <chap@anastigmatix.net> writes:
> On 12/04/21 11:34, Tom Lane wrote:
>> So I'm visualizing it as a uint8 that we happen to like to store
>> ASCII codes in, and that's what prompts the thought of using a
>> numeric representation for non-ASCII values.

> I'm in substantial agreement, though I also see that it is nearly always
> set from a quoted literal, and tested against a quoted literal, and calls
> itself "char", so I guess I am thinking for consistency's sake it might
> be better not to invent some all-new convention for its text representation,
> but adopt something that's already familiar, like bytea escaped format.
> So it would always look and act like a one-octet bytea.

Hmm.  I don't have any great objection to that ... except that
I observe that bytea rejects a bare backslash:

regression=# select '\'::bytea;
ERROR:  invalid input syntax for type bytea

which would be incompatible with "char"'s existing behavior.  But as
long as we don't do that, I'd be okay with having high-bit-set char
values map to backslash-followed-by-three-octal-digits, which is
what bytea escape format would produce.

> Maybe have charin
> accept either bytea-escaped or bytea-hex form too.

That seems like more complexity than is warranted, although I suppose
that allowing easy interchange between char and bytea is worth
something.

One other point in this area is that charin does not currently object
to multiple input characters, it just discards the extra:

regression=# select 'foo'::"char";
 char 
------
 f
(1 row)

I think that was justified by analogy to

regression=# select 'foo'::char(1);
 bpchar 
--------
 f
(1 row)

but I think it would be a bad idea to preserve it once we introduce
any sort of mapping, because it'd mask mistakes.  So I'm envisioning
that charin should accept any single-byte string (including non-ASCII,
for backwards compatibility), but for multi-byte input throw an error
if it doesn't look like whatever numeric-ish mapping we settle on.

>> Yup, cstring is definitely presumed to be in the server's encoding.

> Without proposing to change it, I observe that by defining both cstring
> and unknown in this way (with the latter being expressly the type of
> any literal from the client destined for a type we don't know yet), we're
> a bit painted into the corner as far as supporting types like NCHAR.

Yeah, I'm not sure what to do about that.  We convert the query text
to server encoding before ever attempting to parse it, and I don't
think I want to contemplate trying to postpone that (... especially
not if the client encoding is an unsafe one like SJIS, as you
probably could not avoid SQL-injection hazards).  So an in-line
literal in some other encoding is basically impossible, or at least
pointless.  I'm inclined to think that NCHAR is another one in a
rather long list of not-that-well-thought-out SQL features.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: enable certain TAP tests for MSVC builds
Next
From: Andrew Dunstan
Date:
Subject: MSVC SSL test failure