Thread: The "char" type versus non-ASCII characters
[ breaking off a different new thread ] Chapman Flack <chap@anastigmatix.net> writes: > Then there's "char". It's category S, but does not apply the server > encoding. You could call it an 8-bit int type, but it's typically used > as a character, making it well-defined for ASCII values and not so > for others, just like SQL_ASCII encoding. You could as well say that > the "char" type has a defined encoding of SQL_ASCII at all times, > regardless of the database encoding. This reminds me of something I've been intending to bring up, which is that the "char" type is not very encoding-safe. charout() for example just regurgitates the single byte as-is. I think we deemed that okay the last time anyone thought about it, but that was when single-byte encodings were the mainstream usage for non-ASCII data. If you're using UTF8 or another multi-byte server encoding, it's quite easy to get an invalidly-encoded string this way, which at minimum is going to break dump/restore scenarios. I can think of at least three ways we might address this: * Forbid all non-ASCII values for type "char". This results in simple and portable semantics, but it might break usages that work okay today. * Allow such values only in single-byte server encodings. This is a bit messy, but it wouldn't break any cases that are not problematic already. * Continue to allow non-ASCII values, but change charin/charout, char_text, etc so that the external representation is encoding-safe (perhaps make it an octal or decimal number). Either of the first two ways would have to contemplate what to do with disallowed values that snuck into the DB via pg_upgrade. That leads me to think that the third way might be the most preferable, even though it's not terribly backward-compatible. There's a nearby issue that we might do something about at the same time, which is that chartoi4() and i4tochar() think that the byte value of a "char" is signed, while all the other operations treat it as unsigned. I wouldn't be too surprised if this behavior is the direct cause of the bug fixed in a6bd28beb. The issue vanishes if we forbid non-ASCII values, but otherwise I'd be inclined to change these functions to treat the byte values as unsigned. Thoughts? regards, tom lane
On 12/3/21 14:12, Tom Lane wrote: > [ breaking off a different new thread ] > > Chapman Flack <chap@anastigmatix.net> writes: >> Then there's "char". It's category S, but does not apply the server >> encoding. You could call it an 8-bit int type, but it's typically used >> as a character, making it well-defined for ASCII values and not so >> for others, just like SQL_ASCII encoding. You could as well say that >> the "char" type has a defined encoding of SQL_ASCII at all times, >> regardless of the database encoding. > This reminds me of something I've been intending to bring up, which > is that the "char" type is not very encoding-safe. charout() for > example just regurgitates the single byte as-is. I think we deemed > that okay the last time anyone thought about it, but that was when > single-byte encodings were the mainstream usage for non-ASCII data. > If you're using UTF8 or another multi-byte server encoding, it's > quite easy to get an invalidly-encoded string this way, which at > minimum is going to break dump/restore scenarios. > > I can think of at least three ways we might address this: > > * Forbid all non-ASCII values for type "char". This results in > simple and portable semantics, but it might break usages that > work okay today. > > * Allow such values only in single-byte server encodings. This > is a bit messy, but it wouldn't break any cases that are not > problematic already. > > * Continue to allow non-ASCII values, but change charin/charout, > char_text, etc so that the external representation is encoding-safe > (perhaps make it an octal or decimal number). > > Either of the first two ways would have to contemplate what to do > with disallowed values that snuck into the DB via pg_upgrade. > That leads me to think that the third way might be the most > preferable, even though it's not terribly backward-compatible. > I don't like #2. Is #3 going to change the external representation only for non-ASCII values? If so, that seems OK. Changing it for ASCII values seems ugly. #1 is the simplest to implement and to understand, and I suspect it would break very little in practice, but others might disagree with that assessment. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 12/3/21 14:12, Tom Lane wrote: >> I can think of at least three ways we might address this: >> >> * Forbid all non-ASCII values for type "char". This results in >> simple and portable semantics, but it might break usages that >> work okay today. >> >> * Allow such values only in single-byte server encodings. This >> is a bit messy, but it wouldn't break any cases that are not >> problematic already. >> >> * Continue to allow non-ASCII values, but change charin/charout, >> char_text, etc so that the external representation is encoding-safe >> (perhaps make it an octal or decimal number). > I don't like #2. Yeah, it's definitely messy --- for example, maybe é works in a latin1 database but is rejected when you try to restore into a DB with utf8 encoding. > Is #3 going to change the external representation only > for non-ASCII values? If so, that seems OK. Right, I envisioned that ASCII behaves the same but we'd use a numeric representation for high-bit-set values. These cases could be told apart fairly easily by charin(), since the numeric representation would always be three digits. > #1 is the simplest to implement and to understand, > and I suspect it would break very little in practice, but others might > disagree with that assessment. We'd still have to decide what to do with pg_upgrade'd non-ASCII values, so there's messiness there too. Having charout() throw an error seems not very nice. regards, tom lane
On 12/3/21 14:42, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 12/3/21 14:12, Tom Lane wrote: >>> I can think of at least three ways we might address this: >>> >>> * Forbid all non-ASCII values for type "char". This results in >>> simple and portable semantics, but it might break usages that >>> work okay today. >>> >>> * Allow such values only in single-byte server encodings. This >>> is a bit messy, but it wouldn't break any cases that are not >>> problematic already. >>> >>> * Continue to allow non-ASCII values, but change charin/charout, >>> char_text, etc so that the external representation is encoding-safe >>> (perhaps make it an octal or decimal number). >> Is #3 going to change the external representation only >> for non-ASCII values? If so, that seems OK. > Right, I envisioned that ASCII behaves the same but we'd use > a numeric representation for high-bit-set values. These > cases could be told apart fairly easily by charin(), since > the numeric representation would always be three digits. OK, this seems the most attractive. Can we also allow 2 hex digits? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 12/3/21 14:42, Tom Lane wrote: >> Right, I envisioned that ASCII behaves the same but we'd use >> a numeric representation for high-bit-set values. These >> cases could be told apart fairly easily by charin(), since >> the numeric representation would always be three digits. > OK, this seems the most attractive. Can we also allow 2 hex digits? I think we should pick one base and stick to it. I don't mind hex if you have a preference for that. regards, tom lane
On 12/03/21 14:12, Tom Lane wrote: > This reminds me of something I've been intending to bring up, which > is that the "char" type is not very encoding-safe. charout() for > example just regurgitates the single byte as-is. I wonder if maybe what to do about that lies downstream of some other thought about encoding-related type properties. ISTM we don't, at present, have a clear story for types that have an encoding (or repertoire) property that isn't one of (inapplicable, server_encoding). And yet such things exist, and more such things could or should exist (NCHAR, healthier versions of xml or json, ...). "char" is an existing example, because its current behavior is exactly as if it declared "I am one byte of SQL_ASCII regardless of server setting". Which is no trouble at all when the server setting is also SQL_ASCII. But what does it mean when the server setting and the inherent repertoire property of a type can be different? The present answer isn't pretty. When can charout() be called? typoutput functions don't have any 'internal' parameters, so nothing stops user code from calling them; I don't know how often that's done, and that's a complication. The canonical place for it to be called is inside printtup(), when the client driver has requested format 0 for that attribute. Up to that point, we could have known it was a type with SQL_ASCII wired in, but after charout() we have a cstring, and printtup treats that type as having the server encoding, and it goes through encoding conversion from that to the client encoding in pq_sendcountedtext. Indeed, cstring behaves completely as if it is a type with the server encoding. If you send a cstring with format 1 rather than format 0, while it is no longer subject to the encoding conversion done in pq_sendcountedtext, it will dutifully perform the same conversion in its own cstring_send. unknownsend is the same way. But of course a "char" column in format 1 would never go through cstring; char_send would be called, and just plop the byte in the buffer unchanged (which is the same operation as an encoding conversion from SQL_ASCII to anything). Ever since I figured out I have to look at the send/recv functions for a type to find out if it is encoding-dependent, I have to walk myself through those steps again every time I forget why that is. Having the type's character-encoding details show up in its send/recv functions and not in its in/out functions never stops being counterintuitive to me. But for server-encoding-dependent types, that's how it is: you don't see it in the typoutput function, because on the format-0 path, the transcoding happens in pq_sendcountedtext. But on the format-1 path, the same transcoding happens, this time under the type's own control in its typsend function. That was the second thing that surprised me: we have what we call a text and a binary path, but for an encoding-dependent type, neither one is a path where transcoding doesn't happen! The difference is, the format-0 transcoding is applied blindly, in pq_sendcountedtext, with no surviving information about the data type (which has become cstring by that point). In contrast, on the format-1 path, the type's typsend is in control. In theory, that would allow type-aware conversion; a smarter xml_send could use n; form for characters that won't go in the client encoding, while the blind pq transcoding on format 0 would just botch the data. XML, in an ideal world, might live on disk in a form that cares nothing for the server encoding, and be sent directly over the wire to a client (it declares what encoding it's in) and presented to the application over an XML-aware API that isn't hamstrung by the client's default text encoding either. But in the present world, we have somehow arrived at a setup where there are only two paths that can take, and either one is a funnel that can only be passed by data that survives both the client and the server encoding. The FE/BE docs have said "Text has format code zero, binary has format code one, and all other format codes are reserved for future definition" ever since 7.4. Maybe the time will come for a format 2, where you say "here's an encoding ID and some bytes"? This rambled on a bit far afield from "what should charout do with non-ASCII values?". But honestly, either nobody is storing non-ASCII values in "char", and we could make any choice there and nothing would break, or somebody is doing that and their stuff would be broken by any choice of change. So, is the current "char" situation so urgent that it demands some one-off solution be chosen for it, or could it be neglected with minimal risk until someday we've defined what "this datatype has encoding X that's different from the server encoding" means, and that takes care of it? Regards, -Chap
Chapman Flack <chap@anastigmatix.net> writes: > On 12/03/21 14:12, Tom Lane wrote: >> This reminds me of something I've been intending to bring up, which >> is that the "char" type is not very encoding-safe. charout() for >> example just regurgitates the single byte as-is. > I wonder if maybe what to do about that lies downstream of some other > thought about encoding-related type properties. As you mentioned upthread, it's probably wrong to think of "char" as character data at all. The catalogs use it as a poor man's enum type, and it's just for convenience that we assign readable ASCII codes for the enum values of a given column. The only reason to think of it as encoding-dependent would be if you have ambitions to store a non-ASCII character in a "char". But I think that's something we want to strongly discourage, even if we don't prohibit it altogether. The whole point of the type is to be one byte, so only in legacy encodings can it possibly represent a non-ASCII character. So I'm visualizing it as a uint8 that we happen to like to store ASCII codes in, and that's what prompts the thought of using a numeric representation for non-ASCII values. I think you're just in for pain if you want to consider such values as character data rather than numbers. > ... "char" is an existing > example, because its current behavior is exactly as if it declared > "I am one byte of SQL_ASCII regardless of server setting". But it's not quite that. If we treated it as SQL_ASCII, we'd refuse to convert it to some other encoding unless the value passes encoding verification, which is exactly what charout() is not doing. > Indeed, cstring behaves completely as if it is a type with the server > encoding. Yup, cstring is definitely presumed to be in the server's encoding. > So, is the current "char" situation so urgent that it demands some > one-off solution be chosen for it, or could it be neglected with minimal > risk until someday we've defined what "this datatype has encoding X that's > different from the server encoding" means, and that takes care of it? I'm not willing to leave it broken in the rather faint hope that someday there will be a more general solution, especially since I don't buy the premise that "char" ought to participate in any such solution. regards, tom lane
On 12/04/21 11:34, Tom Lane wrote: > Chapman Flack <chap@anastigmatix.net> writes: >> "I am one byte of SQL_ASCII regardless of server setting". > > But it's not quite that. If we treated it as SQL_ASCII, we'd refuse > to convert it to some other encoding unless the value passes encoding > verification, which is exactly what charout() is not doing. Ah, good point. I remembered noticing pg_do_encoding_conversion returning the src pointer unchanged when SQL_ASCII is involved, but see that it does verify the dest_encoding when SQL_ASCII is the source. > encoding-dependent would be if you have ambitions to store a non-ASCII > character in a "char". But I think that's something we want to > strongly discourage, even if we don't prohibit it altogether. ... > So I'm visualizing it as a uint8 that we happen to like to store > ASCII codes in, and that's what prompts the thought of using a > numeric representation for non-ASCII values. I'm in substantial agreement, though I also see that it is nearly always set from a quoted literal, and tested against a quoted literal, and calls itself "char", so I guess I am thinking for consistency's sake it might be better not to invent some all-new convention for its text representation, but adopt something that's already familiar, like bytea escaped format. So it would always look and act like a one-octet bytea. Maybe have charin accept either bytea-escaped or bytea-hex form too. (Or, never mind; when restricted to one octet, bytea-hex and the \xhh bytea-escape form are indistinguishable anyway.) Then for free we get the property that if somebody today uses 'ű' as an enum value, it might start appearing as '\xfb' now in dumps, etc., but their existing CASE WHEN thing = 'ű' code doesn't stop working (as long as they haven't done something silly like change the encoding), and they have the flexibility to update it to WHEN thing = '\xfb' as time permits if they choose. If they don't, they accept the risk that by switching to another encoding in the future, they may either see their existing tests stop matching, or their existing literals fail to parse, but there won't be invalidly-encoded strings created. > Yup, cstring is definitely presumed to be in the server's encoding. Without proposing to change it, I observe that by defining both cstring and unknown in this way (with the latter being expressly the type of any literal from the client destined for a type we don't know yet), we're a bit painted into the corner as far as supporting types like NCHAR. (I suppose clients could be banned from sending such values as literals, and required to use extended form and bind them with a binary message.) It's analogous to the way format-0 and format-1 both act as filters that no encoding-dependent data can squish through without surviving both the client and the server encoding, even if it is of a type that's defined to be independent of either. Regards, -Chap
Chapman Flack <chap@anastigmatix.net> writes: > On 12/04/21 11:34, Tom Lane wrote: >> So I'm visualizing it as a uint8 that we happen to like to store >> ASCII codes in, and that's what prompts the thought of using a >> numeric representation for non-ASCII values. > I'm in substantial agreement, though I also see that it is nearly always > set from a quoted literal, and tested against a quoted literal, and calls > itself "char", so I guess I am thinking for consistency's sake it might > be better not to invent some all-new convention for its text representation, > but adopt something that's already familiar, like bytea escaped format. > So it would always look and act like a one-octet bytea. Hmm. I don't have any great objection to that ... except that I observe that bytea rejects a bare backslash: regression=# select '\'::bytea; ERROR: invalid input syntax for type bytea which would be incompatible with "char"'s existing behavior. But as long as we don't do that, I'd be okay with having high-bit-set char values map to backslash-followed-by-three-octal-digits, which is what bytea escape format would produce. > Maybe have charin > accept either bytea-escaped or bytea-hex form too. That seems like more complexity than is warranted, although I suppose that allowing easy interchange between char and bytea is worth something. One other point in this area is that charin does not currently object to multiple input characters, it just discards the extra: regression=# select 'foo'::"char"; char ------ f (1 row) I think that was justified by analogy to regression=# select 'foo'::char(1); bpchar -------- f (1 row) but I think it would be a bad idea to preserve it once we introduce any sort of mapping, because it'd mask mistakes. So I'm envisioning that charin should accept any single-byte string (including non-ASCII, for backwards compatibility), but for multi-byte input throw an error if it doesn't look like whatever numeric-ish mapping we settle on. >> Yup, cstring is definitely presumed to be in the server's encoding. > Without proposing to change it, I observe that by defining both cstring > and unknown in this way (with the latter being expressly the type of > any literal from the client destined for a type we don't know yet), we're > a bit painted into the corner as far as supporting types like NCHAR. Yeah, I'm not sure what to do about that. We convert the query text to server encoding before ever attempting to parse it, and I don't think I want to contemplate trying to postpone that (... especially not if the client encoding is an unsafe one like SJIS, as you probably could not avoid SQL-injection hazards). So an in-line literal in some other encoding is basically impossible, or at least pointless. I'm inclined to think that NCHAR is another one in a rather long list of not-that-well-thought-out SQL features. regards, tom lane
On 12/05/21 12:01, Tom Lane wrote: > regression=# select '\'::bytea; > ERROR: invalid input syntax for type bytea > > which would be incompatible with "char"'s existing behavior. But as > long as we don't do that, I'd be okay with having high-bit-set char > values map to backslash-followed-by-three-octal-digits, which is > what bytea escape format would produce. Is that a proposal to change nothing about the current treatment of values < 128, or just to avoid rejecting bare '\'? It seems defensible to relax the error treatment of bare backslash, as it isn't inherently ambiguous; it functions more as an "are you sure you weren't starting to write an escape sequence here?" check. If it's a backslash with nothing after it and you assume the user wrote what they meant, then it's not hard to tell what they meant. If there's a way to factor out and reuse the good parts of byteain, that would mean '\\' would also be accepted to mean a backslash, and the \r \n \t usual escapes would be accepted too, and \ooo and \xhh. >> Maybe have charin >> accept either bytea-escaped or bytea-hex form too. > > That seems like more complexity than is warranted I think it ends up being no more complexity at all, because a single octet in bytea-hex form looks like \xhh, which is exactly what a single \xhh in bytea-escape form looks like. I suppose it's important to consider what comparisons like c = '\' and c = '\\' mean, which should be just fine when the type analysis produces char = char or char = unknown, but could be surprising if it doesn't. Regards, -Chap
Chapman Flack <chap@anastigmatix.net> writes: > On 12/05/21 12:01, Tom Lane wrote: >> regression=# select '\'::bytea; >> ERROR: invalid input syntax for type bytea >> >> which would be incompatible with "char"'s existing behavior. But as >> long as we don't do that, I'd be okay with having high-bit-set char >> values map to backslash-followed-by-three-octal-digits, which is >> what bytea escape format would produce. > Is that a proposal to change nothing about the current treatment > of values < 128, or just to avoid rejecting bare '\'? I intended to change nothing about charin's treatment of ASCII characters, nor anything about bytea's behavior. I don't think we should relax the error checks in the latter. That does mean that backslash becomes a problem for the idea of transparent conversion from char to bytea or vice versa. We could think about emitting backslash as '\\' in charout, I suppose. I'm not really convinced though that bytea compatibility is worth changing a case that's non-problematic today. > If there's a way to factor out and reuse the good parts of byteain, > that would mean '\\' would also be accepted to mean a backslash, > and the \r \n \t usual escapes would be accepted too, and \ooo and > \xhh. Uh, what? regression=# select '\n'::bytea; ERROR: invalid input syntax for type bytea But I doubt that sharing code here would be worth the trouble. The vast majority of byteain is concerned with managing the string length, which is a nonissue for charin. > I think it ends up being no more complexity at all, because a single > octet in bytea-hex form looks like \xhh, which is exactly what > a single \xhh in bytea-escape form looks like. I'm confused by this statement too. AFAIK the alternatives in bytea are \xhh or \ooo: regression=# select '\xEE'::bytea; bytea ------- \xee (1 row) regression=# set bytea_output to escape; SET regression=# select '\xEE'::bytea; bytea ------- \356 (1 row) regards, tom lane
On 12/05/21 14:51, Tom Lane wrote: > Uh, what? > > regression=# select '\n'::bytea; > ERROR: invalid input syntax for type bytea Wow, I was completely out to lunch there somehow. Sorry. None of those other escaped forms are known to byteain, other than '\\' and '''' according to table 8.7. I can't even explain why I thought that. > I'm confused by this statement too. AFAIK the alternatives in > bytea are \xhh or \ooo: Here I think I can at least tell where I went wrong; I saw both an octal and a hex column in table 8.7, which I saw located under the "bytea escape format" heading, and without testing carefully enough, I assumed it was telling me that either format would be recognized on input, which would certainly be possible, but clearly I was carrying over too many assumptions from other escape formats where I'm used to that being the case. If I wanted to prevent another reader making my exact mistake, I might re-title those two table columns to be "In bytea escape format" and "In bytea hex format" to make it more clear the table is combining information for both formats. I'm sure I did test SELECT '\x41'::bytea, but that proved nothing, being simply interpreted as the hex input format. I should have tried SELECT 'A\x41'::bytea, and would have immediately seen it rejected. I've just looked at datatypes.sgml, where I was expecting to see that table 8.7 actually falls outside of the sect2 for "bytea escape format", and that I had simply misinterpreted it because the structural nesting isn't obvious in the rendered HTML. But what I found was that the table actually /is/ nested inside the "bytea escape format" section, and in the generated HTML it is within the div for that section, and the table's own div has the ID DATATYPE-BINARY-SQLESC. The change history there appears complex. The table already existed at the time of a2a8c7a, which made a "bytea escape format" sect2 out of the existing text that included the table, and added a separate "bytea hex format" sect2. But the table at that point showed only the input and output representations for the escape format, didn't say anything about hex format, and wasn't touched in that commit. Nine years later, f77de4b changed the values in the rightmost column to hex form, but only because that was then the "output representation" column and the default output format had been changed to hex. Five months after that, f10a20e changed the heading of that column from "output representation" to "hex representation", probably because the values in that column by then were hex. So it ended up as a table that is structurally part of the "bytea escape format" section, whose rightmost column shows a hex format, and therefore (ahem) could suggest to a reader (who doesn't rush to psql and test it thoroughly) that a hex format is accepted there. Still, I could have avoided that if I had better tested my reading first. Regards, -Chap
On 03.12.21 21:13, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> On 12/3/21 14:42, Tom Lane wrote: >>> Right, I envisioned that ASCII behaves the same but we'd use >>> a numeric representation for high-bit-set values. These >>> cases could be told apart fairly easily by charin(), since >>> the numeric representation would always be three digits. > >> OK, this seems the most attractive. Can we also allow 2 hex digits? > > I think we should pick one base and stick to it. I don't mind > hex if you have a preference for that. I think we could consider char to be a single-byte bytea and use the escape format of bytea for char. That way there is some precedent and we don't add yet another encoding or escape format.
Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > I think we could consider char to be a single-byte bytea and use the > escape format of bytea for char. That way there is some precedent and > we don't add yet another encoding or escape format. Do you want to take that as far as changing backslash to print as '\\' ? regards, tom lane
I wrote: > Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: >> I think we could consider char to be a single-byte bytea and use the >> escape format of bytea for char. That way there is some precedent and >> we don't add yet another encoding or escape format. > Do you want to take that as far as changing backslash to print > as '\\' ? This came up again today [1], so here's a concrete proposal. Let's use \ooo for high-bit-set chars, but keep backslash as just backslash (so it's only semi-compatible with bytea). regards, tom lane [1] https://www.postgresql.org/message-id/CAFM5RapGbBQm%2BdH%3D7K80HcvBvEWiV5Tm7N%3DNRaYURfm98YWc8A%40mail.gmail.com diff --git a/src/backend/utils/adt/char.c b/src/backend/utils/adt/char.c index 0df41c2253..e50293bf14 100644 --- a/src/backend/utils/adt/char.c +++ b/src/backend/utils/adt/char.c @@ -20,6 +20,11 @@ #include "libpq/pqformat.h" #include "utils/builtins.h" +#define ISOCTAL(c) (((c) >= '0') && ((c) <= '7')) +#define TOOCTAL(c) ((c) + '0') +#define FROMOCTAL(c) ((unsigned char) (c) - '0') + + /***************************************************************************** * USER I/O ROUTINES * *****************************************************************************/ @@ -27,31 +32,53 @@ /* * charin - converts "x" to 'x' * - * Note that an empty input string will implicitly be converted to \0. + * This accepts the formats charout produces. If we have multibyte input + * that is not in the form '\ooo', then we take its first byte as the value + * and silently discard the rest; this is a backwards-compatibility provision. */ Datum charin(PG_FUNCTION_ARGS) { char *ch = PG_GETARG_CSTRING(0); + if (strlen(ch) == 4 && ch[0] == '\\' && + ISOCTAL(ch[1]) && ISOCTAL(ch[2]) && ISOCTAL(ch[3])) + PG_RETURN_CHAR((FROMOCTAL(ch[1]) << 6) + + (FROMOCTAL(ch[2]) << 3) + + FROMOCTAL(ch[3])); + /* This will do the right thing for a zero-length input string */ PG_RETURN_CHAR(ch[0]); } /* * charout - converts 'x' to "x" * - * Note that if the char value is \0, the resulting string will appear - * to be empty (null-terminated after zero characters). So this is the - * inverse of the charin() function for such data. + * The possible output formats are: + * 1. 0x00 is represented as an empty string. + * 2. 0x01..0x7F are represented as a single ASCII byte. + * 3. 0x80..0xFF are represented as \ooo (backslash and 3 octal digits). + * Case 3 is meant to match the traditional "escape" format of bytea. */ Datum charout(PG_FUNCTION_ARGS) { char ch = PG_GETARG_CHAR(0); - char *result = (char *) palloc(2); + char *result = (char *) palloc(5); - result[0] = ch; - result[1] = '\0'; + if (IS_HIGHBIT_SET(ch)) + { + result[0] = '\\'; + result[1] = TOOCTAL(((unsigned char) ch) >> 6); + result[2] = TOOCTAL((((unsigned char) ch) >> 3) & 07); + result[3] = TOOCTAL(((unsigned char) ch) & 07); + result[4] = '\0'; + } + else + { + /* This produces acceptable results for 0x00 as well */ + result[0] = ch; + result[1] = '\0'; + } PG_RETURN_CSTRING(result); } @@ -176,15 +203,20 @@ Datum text_char(PG_FUNCTION_ARGS) { text *arg1 = PG_GETARG_TEXT_PP(0); + char *ch = VARDATA_ANY(arg1); char result; /* - * An empty input string is converted to \0 (for consistency with charin). - * If the input is longer than one character, the excess data is silently - * discarded. + * Conversion rules are the same as in charin(), but here we need to + * handle the empty-string case honestly. */ - if (VARSIZE_ANY_EXHDR(arg1) > 0) - result = *(VARDATA_ANY(arg1)); + if (VARSIZE_ANY_EXHDR(arg1) == 4 && ch[0] == '\\' && + ISOCTAL(ch[1]) && ISOCTAL(ch[2]) && ISOCTAL(ch[3])) + result = (FROMOCTAL(ch[1]) << 6) + + (FROMOCTAL(ch[2]) << 3) + + FROMOCTAL(ch[3]); + else if (VARSIZE_ANY_EXHDR(arg1) > 0) + result = ch[0]; else result = '\0'; @@ -195,13 +227,21 @@ Datum char_text(PG_FUNCTION_ARGS) { char arg1 = PG_GETARG_CHAR(0); - text *result = palloc(VARHDRSZ + 1); + text *result = palloc(VARHDRSZ + 4); /* - * Convert \0 to an empty string, for consistency with charout (and - * because the text stuff doesn't like embedded nulls all that well). + * Conversion rules are the same as in charout(), but here we need to be + * honest about converting 0x00 to an empty string. */ - if (arg1 != '\0') + if (IS_HIGHBIT_SET(arg1)) + { + SET_VARSIZE(result, VARHDRSZ + 4); + (VARDATA(result))[0] = '\\'; + (VARDATA(result))[1] = TOOCTAL(((unsigned char) arg1) >> 6); + (VARDATA(result))[2] = TOOCTAL((((unsigned char) arg1) >> 3) & 07); + (VARDATA(result))[3] = TOOCTAL(((unsigned char) arg1) & 07); + } + else if (arg1 != '\0') { SET_VARSIZE(result, VARHDRSZ + 1); *(VARDATA(result)) = arg1;
В письме от пятница, 3 декабря 2021 г. 22:12:10 MSK пользователь Tom Lane написал: > which > is that the "char" type is not very encoding-safe. charout() for > example just regurgitates the single byte as-is. I think we deemed > that okay the last time anyone thought about it, but that was when > single-byte encodings were the mainstream usage for non-ASCII data. > If you're using UTF8 or another multi-byte server encoding, it's > quite easy to get an invalidly-encoded string this way, which at > minimum is going to break dump/restore scenarios. As I've mentioned in another thread I've been very surprised when I first saw "char" type name. And I was also very confused. This leads me to an idea that may be as we fix "char" behaviour, we should also change it's name to something more speaking for itself. Like ascii_char or something like that. Or better to add ascii_char with behaviour we need, update system tables with it, and keep "char" with old behaviour in "deprecated" status in the case somebody still using it. To give them time to change it to something more decent: ascii_char or char(1). I've also talked to a guy who knows postgres history very well, he told me that "char" existed at least from portgres version 3.1, it also had "char16", and in v.4 "char2", "char4", "char8" were added. But later on they was all removed, and we have only "char". Aslo "char" has nothing in common with SQL standard. Actually it looks very unnaturally. May be it is time to get rid of it too, if we are changing this part of code... > I can think of at least three ways we might address this: > > * Forbid all non-ASCII values for type "char". This results in > simple and portable semantics, but it might break usages that > work okay today. > > * Allow such values only in single-byte server encodings. This > is a bit messy, but it wouldn't break any cases that are not > problematic already. > > * Continue to allow non-ASCII values, but change charin/charout, > char_text, etc so that the external representation is encoding-safe > (perhaps make it an octal or decimal number). This will give us steady #1 for ascii_char, and deprecation and removal of "char" later on. -- Nikolay Shaplov aka Nataraj Fuzzing Engineer at Postgres Professional Matrix IM: @dhyan:nataraj.su
Attachment
Nikolay Shaplov <dhyan@nataraj.su> writes: > This leads me to an idea that may be as we fix "char" behaviour, we should also > change it's name to something more speaking for itself. I don't think this is going to happen. It's especially not going to happen in the back branches. But in any case, what I'm looking for is the minimum compatibility breakage needed to fix the encoding-unsafety problem. Renaming the type goes far beyond that. It'd likely break some client code that examines the system catalogs, for little gain. regards, tom lane
I wrote: > This came up again today [1], so here's a concrete proposal. > Let's use \ooo for high-bit-set chars, but keep backslash as just > backslash (so it's only semi-compatible with bytea). Hearing no howls of protest, here's a fleshed out, potentially-committable version. I added some regression test coverage for the modified code. (I also fixed an astonishingly obsolete comment about what the regular char type does.) I looked at the SGML docs too, but I don't think there is anything to change there. The docs say "single-byte internal type" and are silent about "char" beyond that. I think that's exactly where we want to be: any more detail would encourage people to use the type, which we don't really want. Possibly we could change the text to "single-byte internal type, meant to hold ASCII characters" but I'm not sure that's better. The next question is what to do with this. I propose to commit it into HEAD and v15 before next week's beta3 release. If we don't get a lot of pushback, we could consider back-patching further for the November releases; but I'm hesitant to shove something like this into stable branches with only a week's notice. regards, tom lane diff --git a/src/backend/utils/adt/char.c b/src/backend/utils/adt/char.c index 0df41c2253..e50293bf14 100644 --- a/src/backend/utils/adt/char.c +++ b/src/backend/utils/adt/char.c @@ -20,6 +20,11 @@ #include "libpq/pqformat.h" #include "utils/builtins.h" +#define ISOCTAL(c) (((c) >= '0') && ((c) <= '7')) +#define TOOCTAL(c) ((c) + '0') +#define FROMOCTAL(c) ((unsigned char) (c) - '0') + + /***************************************************************************** * USER I/O ROUTINES * *****************************************************************************/ @@ -27,31 +32,53 @@ /* * charin - converts "x" to 'x' * - * Note that an empty input string will implicitly be converted to \0. + * This accepts the formats charout produces. If we have multibyte input + * that is not in the form '\ooo', then we take its first byte as the value + * and silently discard the rest; this is a backwards-compatibility provision. */ Datum charin(PG_FUNCTION_ARGS) { char *ch = PG_GETARG_CSTRING(0); + if (strlen(ch) == 4 && ch[0] == '\\' && + ISOCTAL(ch[1]) && ISOCTAL(ch[2]) && ISOCTAL(ch[3])) + PG_RETURN_CHAR((FROMOCTAL(ch[1]) << 6) + + (FROMOCTAL(ch[2]) << 3) + + FROMOCTAL(ch[3])); + /* This will do the right thing for a zero-length input string */ PG_RETURN_CHAR(ch[0]); } /* * charout - converts 'x' to "x" * - * Note that if the char value is \0, the resulting string will appear - * to be empty (null-terminated after zero characters). So this is the - * inverse of the charin() function for such data. + * The possible output formats are: + * 1. 0x00 is represented as an empty string. + * 2. 0x01..0x7F are represented as a single ASCII byte. + * 3. 0x80..0xFF are represented as \ooo (backslash and 3 octal digits). + * Case 3 is meant to match the traditional "escape" format of bytea. */ Datum charout(PG_FUNCTION_ARGS) { char ch = PG_GETARG_CHAR(0); - char *result = (char *) palloc(2); + char *result = (char *) palloc(5); - result[0] = ch; - result[1] = '\0'; + if (IS_HIGHBIT_SET(ch)) + { + result[0] = '\\'; + result[1] = TOOCTAL(((unsigned char) ch) >> 6); + result[2] = TOOCTAL((((unsigned char) ch) >> 3) & 07); + result[3] = TOOCTAL(((unsigned char) ch) & 07); + result[4] = '\0'; + } + else + { + /* This produces acceptable results for 0x00 as well */ + result[0] = ch; + result[1] = '\0'; + } PG_RETURN_CSTRING(result); } @@ -176,15 +203,20 @@ Datum text_char(PG_FUNCTION_ARGS) { text *arg1 = PG_GETARG_TEXT_PP(0); + char *ch = VARDATA_ANY(arg1); char result; /* - * An empty input string is converted to \0 (for consistency with charin). - * If the input is longer than one character, the excess data is silently - * discarded. + * Conversion rules are the same as in charin(), but here we need to + * handle the empty-string case honestly. */ - if (VARSIZE_ANY_EXHDR(arg1) > 0) - result = *(VARDATA_ANY(arg1)); + if (VARSIZE_ANY_EXHDR(arg1) == 4 && ch[0] == '\\' && + ISOCTAL(ch[1]) && ISOCTAL(ch[2]) && ISOCTAL(ch[3])) + result = (FROMOCTAL(ch[1]) << 6) + + (FROMOCTAL(ch[2]) << 3) + + FROMOCTAL(ch[3]); + else if (VARSIZE_ANY_EXHDR(arg1) > 0) + result = ch[0]; else result = '\0'; @@ -195,13 +227,21 @@ Datum char_text(PG_FUNCTION_ARGS) { char arg1 = PG_GETARG_CHAR(0); - text *result = palloc(VARHDRSZ + 1); + text *result = palloc(VARHDRSZ + 4); /* - * Convert \0 to an empty string, for consistency with charout (and - * because the text stuff doesn't like embedded nulls all that well). + * Conversion rules are the same as in charout(), but here we need to be + * honest about converting 0x00 to an empty string. */ - if (arg1 != '\0') + if (IS_HIGHBIT_SET(arg1)) + { + SET_VARSIZE(result, VARHDRSZ + 4); + (VARDATA(result))[0] = '\\'; + (VARDATA(result))[1] = TOOCTAL(((unsigned char) arg1) >> 6); + (VARDATA(result))[2] = TOOCTAL((((unsigned char) arg1) >> 3) & 07); + (VARDATA(result))[3] = TOOCTAL(((unsigned char) arg1) & 07); + } + else if (arg1 != '\0') { SET_VARSIZE(result, VARHDRSZ + 1); *(VARDATA(result)) = arg1; diff --git a/src/test/regress/expected/char.out b/src/test/regress/expected/char.out index 2d78f90f3b..ea9b0b8eeb 100644 --- a/src/test/regress/expected/char.out +++ b/src/test/regress/expected/char.out @@ -1,8 +1,8 @@ -- -- CHAR -- --- fixed-length by value --- internally passed by value if <= 4 bytes in storage +-- Per SQL standard, CHAR means character(1), that is a varlena type +-- with a constraint restricting it to one character (not byte) SELECT char 'c' = char 'c' AS true; true ------ @@ -119,3 +119,62 @@ SELECT * FROM CHAR_TBL; abcd (4 rows) +-- +-- Also test "char", which is an ad-hoc one-byte type. It can only +-- really store ASCII characters, but we allow high-bit-set characters +-- to be accessed via bytea-like escapes. +-- +SELECT 'a'::"char"; + char +------ + a +(1 row) + +SELECT '\101'::"char"; + char +------ + A +(1 row) + +SELECT '\377'::"char"; + char +------ + \377 +(1 row) + +SELECT 'a'::"char"::text; + text +------ + a +(1 row) + +SELECT '\377'::"char"::text; + text +------ + \377 +(1 row) + +SELECT '\000'::"char"::text; + text +------ + +(1 row) + +SELECT 'a'::text::"char"; + char +------ + a +(1 row) + +SELECT '\377'::text::"char"; + char +------ + \377 +(1 row) + +SELECT ''::text::"char"; + char +------ + +(1 row) + diff --git a/src/test/regress/expected/char_1.out b/src/test/regress/expected/char_1.out index fa6644d692..ffd31551de 100644 --- a/src/test/regress/expected/char_1.out +++ b/src/test/regress/expected/char_1.out @@ -1,8 +1,8 @@ -- -- CHAR -- --- fixed-length by value --- internally passed by value if <= 4 bytes in storage +-- Per SQL standard, CHAR means character(1), that is a varlena type +-- with a constraint restricting it to one character (not byte) SELECT char 'c' = char 'c' AS true; true ------ @@ -119,3 +119,62 @@ SELECT * FROM CHAR_TBL; abcd (4 rows) +-- +-- Also test "char", which is an ad-hoc one-byte type. It can only +-- really store ASCII characters, but we allow high-bit-set characters +-- to be accessed via bytea-like escapes. +-- +SELECT 'a'::"char"; + char +------ + a +(1 row) + +SELECT '\101'::"char"; + char +------ + A +(1 row) + +SELECT '\377'::"char"; + char +------ + \377 +(1 row) + +SELECT 'a'::"char"::text; + text +------ + a +(1 row) + +SELECT '\377'::"char"::text; + text +------ + \377 +(1 row) + +SELECT '\000'::"char"::text; + text +------ + +(1 row) + +SELECT 'a'::text::"char"; + char +------ + a +(1 row) + +SELECT '\377'::text::"char"; + char +------ + \377 +(1 row) + +SELECT ''::text::"char"; + char +------ + +(1 row) + diff --git a/src/test/regress/expected/char_2.out b/src/test/regress/expected/char_2.out index 09434a44cd..56818f824b 100644 --- a/src/test/regress/expected/char_2.out +++ b/src/test/regress/expected/char_2.out @@ -1,8 +1,8 @@ -- -- CHAR -- --- fixed-length by value --- internally passed by value if <= 4 bytes in storage +-- Per SQL standard, CHAR means character(1), that is a varlena type +-- with a constraint restricting it to one character (not byte) SELECT char 'c' = char 'c' AS true; true ------ @@ -119,3 +119,62 @@ SELECT * FROM CHAR_TBL; abcd (4 rows) +-- +-- Also test "char", which is an ad-hoc one-byte type. It can only +-- really store ASCII characters, but we allow high-bit-set characters +-- to be accessed via bytea-like escapes. +-- +SELECT 'a'::"char"; + char +------ + a +(1 row) + +SELECT '\101'::"char"; + char +------ + A +(1 row) + +SELECT '\377'::"char"; + char +------ + \377 +(1 row) + +SELECT 'a'::"char"::text; + text +------ + a +(1 row) + +SELECT '\377'::"char"::text; + text +------ + \377 +(1 row) + +SELECT '\000'::"char"::text; + text +------ + +(1 row) + +SELECT 'a'::text::"char"; + char +------ + a +(1 row) + +SELECT '\377'::text::"char"; + char +------ + \377 +(1 row) + +SELECT ''::text::"char"; + char +------ + +(1 row) + diff --git a/src/test/regress/sql/char.sql b/src/test/regress/sql/char.sql index 9c83c45e34..120fed53e5 100644 --- a/src/test/regress/sql/char.sql +++ b/src/test/regress/sql/char.sql @@ -2,8 +2,8 @@ -- CHAR -- --- fixed-length by value --- internally passed by value if <= 4 bytes in storage +-- Per SQL standard, CHAR means character(1), that is a varlena type +-- with a constraint restricting it to one character (not byte) SELECT char 'c' = char 'c' AS true; @@ -71,3 +71,19 @@ DROP TABLE CHAR_TBL; INSERT INTO CHAR_TBL (f1) VALUES ('abcde'); SELECT * FROM CHAR_TBL; + +-- +-- Also test "char", which is an ad-hoc one-byte type. It can only +-- really store ASCII characters, but we allow high-bit-set characters +-- to be accessed via bytea-like escapes. +-- + +SELECT 'a'::"char"; +SELECT '\101'::"char"; +SELECT '\377'::"char"; +SELECT 'a'::"char"::text; +SELECT '\377'::"char"::text; +SELECT '\000'::"char"::text; +SELECT 'a'::text::"char"; +SELECT '\377'::text::"char"; +SELECT ''::text::"char";
On 2022-07-31 Su 18:25, Tom Lane wrote: > I wrote: >> This came up again today [1], so here's a concrete proposal. >> Let's use \ooo for high-bit-set chars, but keep backslash as just >> backslash (so it's only semi-compatible with bytea). > Hearing no howls of protest, here's a fleshed out, potentially-committable > version. I added some regression test coverage for the modified code. > (I also fixed an astonishingly obsolete comment about what the regular > char type does.) I looked at the SGML docs too, but I don't think there > is anything to change there. The docs say "single-byte internal type" > and are silent about "char" beyond that. I think that's exactly where > we want to be: any more detail would encourage people to use the type, > which we don't really want. Possibly we could change the text to > "single-byte internal type, meant to hold ASCII characters" but I'm > not sure that's better. > > The next question is what to do with this. I propose to commit it into > HEAD and v15 before next week's beta3 release. If we don't get a lot > of pushback, we could consider back-patching further for the November > releases; but I'm hesitant to shove something like this into stable > branches with only a week's notice. > > Maybe we should add some words to the docs explicitly discouraging its use in user tables. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 2022-07-31 Su 18:25, Tom Lane wrote: >> ... I looked at the SGML docs too, but I don't think there >> is anything to change there. The docs say "single-byte internal type" >> and are silent about "char" beyond that. I think that's exactly where >> we want to be: any more detail would encourage people to use the type, >> which we don't really want. Possibly we could change the text to >> "single-byte internal type, meant to hold ASCII characters" but I'm >> not sure that's better. > Maybe we should add some words to the docs explicitly discouraging its > use in user tables. Hmm, I thought we already did --- but you're right, the intro para for Table 8.5 only explicitly discourages use of "name". We probably want similar wording for both types. Maybe like There are two other fixed-length character types in PostgreSQL, shown in Table 8.5. Both are used in the system catalogs and are not intended for use in user tables. The name type ... regards, tom lane