Re: The "char" type versus non-ASCII characters - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: The "char" type versus non-ASCII characters |
Date | |
Msg-id | 2849759.1638723714@sss.pgh.pa.us Whole thread Raw |
In response to | Re: The "char" type versus non-ASCII characters (Chapman Flack <chap@anastigmatix.net>) |
Responses |
Re: The "char" type versus non-ASCII characters
|
List | pgsql-hackers |
Chapman Flack <chap@anastigmatix.net> writes: > On 12/04/21 11:34, Tom Lane wrote: >> So I'm visualizing it as a uint8 that we happen to like to store >> ASCII codes in, and that's what prompts the thought of using a >> numeric representation for non-ASCII values. > I'm in substantial agreement, though I also see that it is nearly always > set from a quoted literal, and tested against a quoted literal, and calls > itself "char", so I guess I am thinking for consistency's sake it might > be better not to invent some all-new convention for its text representation, > but adopt something that's already familiar, like bytea escaped format. > So it would always look and act like a one-octet bytea. Hmm. I don't have any great objection to that ... except that I observe that bytea rejects a bare backslash: regression=# select '\'::bytea; ERROR: invalid input syntax for type bytea which would be incompatible with "char"'s existing behavior. But as long as we don't do that, I'd be okay with having high-bit-set char values map to backslash-followed-by-three-octal-digits, which is what bytea escape format would produce. > Maybe have charin > accept either bytea-escaped or bytea-hex form too. That seems like more complexity than is warranted, although I suppose that allowing easy interchange between char and bytea is worth something. One other point in this area is that charin does not currently object to multiple input characters, it just discards the extra: regression=# select 'foo'::"char"; char ------ f (1 row) I think that was justified by analogy to regression=# select 'foo'::char(1); bpchar -------- f (1 row) but I think it would be a bad idea to preserve it once we introduce any sort of mapping, because it'd mask mistakes. So I'm envisioning that charin should accept any single-byte string (including non-ASCII, for backwards compatibility), but for multi-byte input throw an error if it doesn't look like whatever numeric-ish mapping we settle on. >> Yup, cstring is definitely presumed to be in the server's encoding. > Without proposing to change it, I observe that by defining both cstring > and unknown in this way (with the latter being expressly the type of > any literal from the client destined for a type we don't know yet), we're > a bit painted into the corner as far as supporting types like NCHAR. Yeah, I'm not sure what to do about that. We convert the query text to server encoding before ever attempting to parse it, and I don't think I want to contemplate trying to postpone that (... especially not if the client encoding is an unsafe one like SJIS, as you probably could not avoid SQL-injection hazards). So an in-line literal in some other encoding is basically impossible, or at least pointless. I'm inclined to think that NCHAR is another one in a rather long list of not-that-well-thought-out SQL features. regards, tom lane
pgsql-hackers by date: