Thread: Unicode escapes with any backend encoding
I threatened to do this in another thread [1], so here it is. This patch removes the restriction that the server encoding must be UTF-8 in order to write any Unicode escape with a value outside the ASCII range. Instead, we'll allow the notation and convert to the server encoding if that's possible. (If it isn't, of course you get an encoding conversion failure.) In the cases that were already supported, namely ASCII characters or UTF-8 server encoding, this should be only immeasurably slower than before. Otherwise, it calls the appropriate encoding conversion procedure, which of course will take a little time. But that's better than failing, surely. One way in which this is slightly less good than before is that you no longer get a syntax error cursor pointing at the problematic escape when conversion fails. If we were really excited about that, something could be done with setting up an errcontext stack entry. But that would add a few cycles, so I wasn't sure whether to do it. Grepping for other direct uses of unicode_to_utf8(), I notice that there are a couple of places in the JSON code where we have a similar restriction that you can only write a Unicode escape in UTF8 server encoding. I'm not sure whether these same semantics could be applied there, so I didn't touch that. Thoughts? regards, tom lane [1] https://www.postgresql.org/message-id/flat/CACPNZCvaoa3EgVWm5yZhcSTX6RAtaLgniCPcBVOCwm8h3xpWkw%40mail.gmail.com diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml index c908e0b..e134877 100644 --- a/doc/src/sgml/syntax.sgml +++ b/doc/src/sgml/syntax.sgml @@ -189,6 +189,23 @@ UPDATE "my_table" SET "a" = 5; ampersands. The length limitation still applies. </para> + <para> + Quoting an identifier also makes it case-sensitive, whereas + unquoted names are always folded to lower case. For example, the + identifiers <literal>FOO</literal>, <literal>foo</literal>, and + <literal>"foo"</literal> are considered the same by + <productname>PostgreSQL</productname>, but + <literal>"Foo"</literal> and <literal>"FOO"</literal> are + different from these three and each other. (The folding of + unquoted names to lower case in <productname>PostgreSQL</productname> is + incompatible with the SQL standard, which says that unquoted names + should be folded to upper case. Thus, <literal>foo</literal> + should be equivalent to <literal>"FOO"</literal> not + <literal>"foo"</literal> according to the standard. If you want + to write portable applications you are advised to always quote a + particular name or never quote it.) + </para> + <indexterm> <primary>Unicode escape</primary> <secondary>in identifiers</secondary> @@ -230,7 +247,8 @@ U&"d!0061t!+000061" UESCAPE '!' The escape character can be any single character other than a hexadecimal digit, the plus sign, a single quote, a double quote, or a whitespace character. Note that the escape character is - written in single quotes, not double quotes. + written in single quotes, not double quotes, + after <literal>UESCAPE</literal>. </para> <para> @@ -239,32 +257,18 @@ U&"d!0061t!+000061" UESCAPE '!' </para> <para> - The Unicode escape syntax works only when the server encoding is - <literal>UTF8</literal>. When other server encodings are used, only code - points in the ASCII range (up to <literal>\007F</literal>) can be - specified. Both the 4-digit and the 6-digit form can be used to + Either the 4-digit or the 6-digit escape form can be used to specify UTF-16 surrogate pairs to compose characters with code points larger than U+FFFF, although the availability of the 6-digit form technically makes this unnecessary. (Surrogate - pairs are not stored directly, but combined into a single - code point that is then encoded in UTF-8.) + pairs are not stored directly, but are combined into a single + code point.) </para> <para> - Quoting an identifier also makes it case-sensitive, whereas - unquoted names are always folded to lower case. For example, the - identifiers <literal>FOO</literal>, <literal>foo</literal>, and - <literal>"foo"</literal> are considered the same by - <productname>PostgreSQL</productname>, but - <literal>"Foo"</literal> and <literal>"FOO"</literal> are - different from these three and each other. (The folding of - unquoted names to lower case in <productname>PostgreSQL</productname> is - incompatible with the SQL standard, which says that unquoted names - should be folded to upper case. Thus, <literal>foo</literal> - should be equivalent to <literal>"FOO"</literal> not - <literal>"foo"</literal> according to the standard. If you want - to write portable applications you are advised to always quote a - particular name or never quote it.) + If the server encoding is not UTF-8, the Unicode code point identified + by one of these escape sequences is converted to the actual server + encoding; an error is reported if that's not possible. </para> </sect2> @@ -427,25 +431,11 @@ SELECT 'foo' 'bar'; <para> It is your responsibility that the byte sequences you create, especially when using the octal or hexadecimal escapes, compose - valid characters in the server character set encoding. When the - server encoding is UTF-8, then the Unicode escapes or the + valid characters in the server character set encoding. + A useful alternative is to use Unicode escapes or the alternative Unicode escape syntax, explained - in <xref linkend="sql-syntax-strings-uescape"/>, should be used - instead. (The alternative would be doing the UTF-8 encoding by - hand and writing out the bytes, which would be very cumbersome.) - </para> - - <para> - The Unicode escape syntax works fully only when the server - encoding is <literal>UTF8</literal>. When other server encodings are - used, only code points in the ASCII range (up - to <literal>\u007F</literal>) can be specified. Both the 4-digit and - the 8-digit form can be used to specify UTF-16 surrogate pairs to - compose characters with code points larger than U+FFFF, although - the availability of the 8-digit form technically makes this - unnecessary. (When surrogate pairs are used when the server - encoding is <literal>UTF8</literal>, they are first combined into a - single code point that is then encoded in UTF-8.) + in <xref linkend="sql-syntax-strings-uescape"/>; then the server + will check that the character conversion is possible. </para> <caution> @@ -524,16 +514,23 @@ U&'d!0061t!+000061' UESCAPE '!' </para> <para> - The Unicode escape syntax works only when the server encoding is - <literal>UTF8</literal>. When other server encodings are used, only - code points in the ASCII range (up to <literal>\007F</literal>) - can be specified. Both the 4-digit and the 6-digit form can be - used to specify UTF-16 surrogate pairs to compose characters with - code points larger than U+FFFF, although the availability of the - 6-digit form technically makes this unnecessary. (When surrogate - pairs are used when the server encoding is <literal>UTF8</literal>, they - are first combined into a single code point that is then encoded - in UTF-8.) + To include the escape character in the string literally, write + it twice. + </para> + + <para> + Either the 4-digit or the 6-digit escape form can be used to + specify UTF-16 surrogate pairs to compose characters with code + points larger than U+FFFF, although the availability of the + 6-digit form technically makes this unnecessary. (Surrogate + pairs are not stored directly, but are combined into a single + code point.) + </para> + + <para> + If the server encoding is not UTF-8, the Unicode code point identified + by one of these escape sequences is converted to the actual server + encoding; an error is reported if that's not possible. </para> <para> @@ -546,11 +543,6 @@ U&'d!0061t!+000061' UESCAPE '!' parameter is set to off, this syntax will be rejected with an error message. </para> - - <para> - To include the escape character in the string literally, write it - twice. - </para> </sect3> <sect3 id="sql-syntax-dollar-quoting"> diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c index 1bf1144..e88a5e0 100644 --- a/src/backend/parser/parser.c +++ b/src/backend/parser/parser.c @@ -292,7 +292,7 @@ hexval(unsigned char c) return 0; /* not reached */ } -/* is Unicode code point acceptable in database's encoding? */ +/* is Unicode code point acceptable? */ static void check_unicode_value(pg_wchar c, int pos, core_yyscan_t yyscanner) { @@ -302,12 +302,6 @@ check_unicode_value(pg_wchar c, int pos, core_yyscan_t yyscanner) (errcode(ERRCODE_SYNTAX_ERROR), errmsg("invalid Unicode escape value"), scanner_errposition(pos, yyscanner))); - - if (c > 0x7F && GetDatabaseEncoding() != PG_UTF8) - ereport(ERROR, - (errcode(ERRCODE_SYNTAX_ERROR), - errmsg("Unicode escape values cannot be used for code point values above 007F when the server encodingis not UTF8"), - scanner_errposition(pos, yyscanner))); } /* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */ @@ -338,18 +332,30 @@ str_udeescape(const char *str, char escape, const char *in; char *new, *out; + size_t new_len; pg_wchar pair_first = 0; /* - * This relies on the subtle assumption that a UTF-8 expansion cannot be - * longer than its escaped representation. + * Guesstimate that result will be no longer than input, but allow enough + * padding for Unicode conversion. */ - new = palloc(strlen(str) + 1); + new_len = strlen(str) + MAX_UNICODE_EQUIVALENT_STRING + 1; + new = palloc(new_len); in = str; out = new; while (*in) { + /* Enlarge string if needed */ + size_t out_dist = out - new; + + if (out_dist > new_len - (MAX_UNICODE_EQUIVALENT_STRING + 1)) + { + new_len *= 2; + new = repalloc(new, new_len); + out = new + out_dist; + } + if (in[0] == escape) { if (in[1] == escape) @@ -390,8 +396,8 @@ str_udeescape(const char *str, char escape, pair_first = unicode; else { - unicode_to_utf8(unicode, (unsigned char *) out); - out += pg_mblen(out); + pg_unicode_to_server(unicode, (unsigned char *) out); + out += strlen(out); } in += 5; } @@ -431,8 +437,8 @@ str_udeescape(const char *str, char escape, pair_first = unicode; else { - unicode_to_utf8(unicode, (unsigned char *) out); - out += pg_mblen(out); + pg_unicode_to_server(unicode, (unsigned char *) out); + out += strlen(out); } in += 8; } @@ -457,13 +463,6 @@ str_udeescape(const char *str, char escape, goto invalid_pair; *out = '\0'; - - /* - * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII - * codes; but it's probably not worth the trouble, since this isn't likely - * to be a performance-critical path. - */ - pg_verifymbstr(new, out - new, false); return new; invalid_pair: diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l index 84c7391..3903df8 100644 --- a/src/backend/parser/scan.l +++ b/src/backend/parser/scan.l @@ -1226,19 +1226,18 @@ process_integer_literal(const char *token, YYSTYPE *lval) static void addunicode(pg_wchar c, core_yyscan_t yyscanner) { - char buf[8]; + char buf[MAX_UNICODE_EQUIVALENT_STRING + 1]; /* See also check_unicode_value() in parser.c */ if (c == 0 || c > 0x10FFFF) yyerror("invalid Unicode escape value"); - if (c > 0x7F) - { - if (GetDatabaseEncoding() != PG_UTF8) - yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is notUTF8"); - yyextra->saw_non_ascii = true; - } - unicode_to_utf8(c, (unsigned char *) buf); - addlit(buf, pg_mblen(buf), yyscanner); + + /* + * We expect that pg_unicode_to_server() will complain about any + * unconvertible code point, so we don't have to set saw_non_ascii. + */ + pg_unicode_to_server(c, (unsigned char *) buf); + addlit(buf, strlen(buf), yyscanner); } static unsigned char diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c index 3808c30..a2d2a0b 100644 --- a/src/backend/utils/adt/xml.c +++ b/src/backend/utils/adt/xml.c @@ -2086,26 +2086,6 @@ map_sql_identifier_to_xml_name(const char *ident, bool fully_escaped, /* - * Map a Unicode codepoint into the current server encoding. - */ -static char * -unicode_to_sqlchar(pg_wchar c) -{ - char utf8string[8]; /* need room for trailing zero */ - char *result; - - memset(utf8string, 0, sizeof(utf8string)); - unicode_to_utf8(c, (unsigned char *) utf8string); - - result = pg_any_to_server(utf8string, strlen(utf8string), PG_UTF8); - /* if pg_any_to_server didn't strdup, we must */ - if (result == utf8string) - result = pstrdup(result); - return result; -} - - -/* * Map XML name to SQL identifier; see SQL/XML:2008 section 9.3. */ char * @@ -2125,10 +2105,12 @@ map_xml_name_to_sql_identifier(const char *name) && isxdigit((unsigned char) *(p + 5)) && *(p + 6) == '_') { + char cbuf[MAX_UNICODE_EQUIVALENT_STRING + 1]; unsigned int u; sscanf(p + 2, "%X", &u); - appendStringInfoString(&buf, unicode_to_sqlchar(u)); + pg_unicode_to_server(u, (unsigned char *) cbuf); + appendStringInfoString(&buf, cbuf); p += 6; } else diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c index 5d7cc74..7d90ac9 100644 --- a/src/backend/utils/mb/mbutils.c +++ b/src/backend/utils/mb/mbutils.c @@ -68,6 +68,13 @@ static FmgrInfo *ToServerConvProc = NULL; static FmgrInfo *ToClientConvProc = NULL; /* + * This variable stores the conversion function to convert from UTF-8 + * to the server encoding. It's NULL if the server encoding *is* UTF-8, + * or if we lack a conversion function for this. + */ +static FmgrInfo *Utf8ToServerConvProc = NULL; + +/* * These variables track the currently-selected encodings. */ static const pg_enc2name *ClientEncoding = &pg_enc2name_tbl[PG_SQL_ASCII]; @@ -273,6 +280,8 @@ SetClientEncoding(int encoding) void InitializeClientEncoding(void) { + int current_server_encoding; + Assert(!backend_startup_complete); backend_startup_complete = true; @@ -289,6 +298,35 @@ InitializeClientEncoding(void) pg_enc2name_tbl[pending_client_encoding].name, GetDatabaseEncodingName()))); } + + /* + * Also look up the UTF8-to-server conversion function if needed. Since + * the server encoding is fixed within any one backend process, we don't + * have to do this more than once. + */ + current_server_encoding = GetDatabaseEncoding(); + if (current_server_encoding != PG_UTF8 && + current_server_encoding != PG_SQL_ASCII) + { + Oid utf8_to_server_proc; + + Assert(IsTransactionState()); + utf8_to_server_proc = + FindDefaultConversionProc(PG_UTF8, + current_server_encoding); + /* If there's no such conversion, just leave the pointer as NULL */ + if (OidIsValid(utf8_to_server_proc)) + { + FmgrInfo *finfo; + + finfo = (FmgrInfo *) MemoryContextAlloc(TopMemoryContext, + sizeof(FmgrInfo)); + fmgr_info_cxt(utf8_to_server_proc, finfo, + TopMemoryContext); + /* Set Utf8ToServerConvProc only after data is fully valid */ + Utf8ToServerConvProc = finfo; + } + } } /* @@ -752,6 +790,73 @@ perform_default_encoding_conversion(const char *src, int len, return result; } +/* + * Convert a single Unicode code point into a string in the server encoding. + * + * The code point given by "c" is converted and stored at *s, which must + * have at least MAX_UNICODE_EQUIVALENT_STRING+1 bytes available. + * The output will have a trailing '\0'. Throws error if the conversion + * cannot be performed. + * + * Note that this relies on having previously looked up any required + * conversion function. That's partly for speed but mostly because the parser + * may call this outside any transaction, or in an aborted transaction. + */ +void +pg_unicode_to_server(pg_wchar c, unsigned char *s) +{ + unsigned char c_as_utf8[MAX_MULTIBYTE_CHAR_LEN + 1]; + int c_as_utf8_len; + int server_encoding; + + /* + * Complain if invalid Unicode code point. The choice of errcode here is + * debatable, but really our caller should have checked this anyway. + */ + if (c == 0 || c > 0x10FFFF) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("invalid Unicode code point"))); + + /* Otherwise, if it's in ASCII range, conversion is trivial */ + if (c <= 0x7F) + { + s[0] = (unsigned char) c; + s[1] = '\0'; + return; + } + + /* If the server encoding is UTF-8, we just need to reformat the code */ + server_encoding = GetDatabaseEncoding(); + if (server_encoding == PG_UTF8) + { + unicode_to_utf8(c, s); + s[pg_utf_mblen(s)] = '\0'; + return; + } + + /* For all other cases, we must have a conversion function available */ + if (Utf8ToServerConvProc == NULL) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("conversion between %s and %s is not supported", + pg_enc2name_tbl[PG_UTF8].name, + GetDatabaseEncodingName()))); + + /* Construct UTF-8 source string */ + unicode_to_utf8(c, c_as_utf8); + c_as_utf8_len = pg_utf_mblen(c_as_utf8); + c_as_utf8[c_as_utf8_len] = '\0'; + + /* Convert, or throw error if we can't */ + FunctionCall5(Utf8ToServerConvProc, + Int32GetDatum(PG_UTF8), + Int32GetDatum(server_encoding), + CStringGetDatum(c_as_utf8), + CStringGetDatum(s), + Int32GetDatum(c_as_utf8_len)); +} + /* convert a multibyte string to a wchar */ int diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h index 7fb5fa4..2daf301 100644 --- a/src/include/mb/pg_wchar.h +++ b/src/include/mb/pg_wchar.h @@ -316,6 +316,15 @@ typedef enum pg_enc #define MAX_CONVERSION_GROWTH 4 /* + * Maximum byte length of the string equivalent to any one Unicode code point, + * in any backend encoding. The current value assumes that a 4-byte UTF-8 + * character might expand by MAX_CONVERSION_GROWTH, which is a huge + * overestimate. But in current usage we don't allocate large multiples of + * this, so there's little point in being stingy. + */ +#define MAX_UNICODE_EQUIVALENT_STRING 16 + +/* * Table for mapping an encoding number to official encoding name and * possibly other subsidiary data. Be careful to check encoding number * before accessing a table entry! @@ -602,6 +611,8 @@ extern char *pg_server_to_client(const char *s, int len); extern char *pg_any_to_server(const char *s, int len, int encoding); extern char *pg_server_to_any(const char *s, int len, int encoding); +extern void pg_unicode_to_server(pg_wchar c, unsigned char *s); + extern unsigned short BIG5toCNS(unsigned short big5, unsigned char *lc); extern unsigned short CNStoBIG5(unsigned short cns, unsigned char lc);
On Tue, Jan 14, 2020 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Grepping for other direct uses of unicode_to_utf8(), I notice that > there are a couple of places in the JSON code where we have a similar > restriction that you can only write a Unicode escape in UTF8 server > encoding. I'm not sure whether these same semantics could be > applied there, so I didn't touch that. > Off the cuff I'd be inclined to say we should keep the text escape rules the same. We've already extended the JSON standard y allowing non-UTF8 encodings. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > On Tue, Jan 14, 2020 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Grepping for other direct uses of unicode_to_utf8(), I notice that >> there are a couple of places in the JSON code where we have a similar >> restriction that you can only write a Unicode escape in UTF8 server >> encoding. I'm not sure whether these same semantics could be >> applied there, so I didn't touch that. > Off the cuff I'd be inclined to say we should keep the text escape > rules the same. We've already extended the JSON standard y allowing > non-UTF8 encodings. Right. I'm just thinking though that if you can write "é" literally in a JSON string, even though you're using LATIN1 not UTF8, then why not allow writing that as "\u00E9" instead? The latter is arguably truer to spec. However, if JSONB collapses "\u00E9" to LATIN1 "é", that would be bad, unless we have a way to undo it on printout. So there might be some more moving parts here than I thought. regards, tom lane
I wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> On Tue, Jan 14, 2020 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Grepping for other direct uses of unicode_to_utf8(), I notice that >>> there are a couple of places in the JSON code where we have a similar >>> restriction that you can only write a Unicode escape in UTF8 server >>> encoding. I'm not sure whether these same semantics could be >>> applied there, so I didn't touch that. >> Off the cuff I'd be inclined to say we should keep the text escape >> rules the same. We've already extended the JSON standard y allowing >> non-UTF8 encodings. > Right. I'm just thinking though that if you can write "é" literally > in a JSON string, even though you're using LATIN1 not UTF8, then why > not allow writing that as "\u00E9" instead? The latter is arguably > truer to spec. > However, if JSONB collapses "\u00E9" to LATIN1 "é", that would be bad, > unless we have a way to undo it on printout. So there might be > some more moving parts here than I thought. On third thought, what would be so bad about that? Let's suppose I write: INSERT ... values('{"x": "\u00E9"}'::jsonb); and the jsonb parsing logic chooses to collapse the backslash to the represented character, i.e., "é". Why should it matter whether the database encoding is UTF8 or LATIN1? If I am using UTF8 client encoding, I will see the "é" in UTF8 encoding either way, because of output encoding conversion. If I am using LATIN1 client encoding, I will see the "é" in LATIN1 either way --- or at least, I will if the database encoding is UTF8. Right now I get an error for that when the database encoding is LATIN1 ... but if I store the "é" as literal "é", it works, either way. So it seems to me that this error is just useless pedantry. As long as the DB encoding can represent the desired character, it should be transparent to users. regards, tom lane
On 1/14/20 10:10 AM, Tom Lane wrote: > to me that this error is just useless pedantry. As long as the DB > encoding can represent the desired character, it should be transparent > to users. That's my position too. Regards, -Chap
On Wed, Jan 15, 2020 at 4:25 AM Chapman Flack <chap@anastigmatix.net> wrote: > > On 1/14/20 10:10 AM, Tom Lane wrote: > > to me that this error is just useless pedantry. As long as the DB > > encoding can represent the desired character, it should be transparent > > to users. > > That's my position too. > and mine. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > On Wed, Jan 15, 2020 at 4:25 AM Chapman Flack <chap@anastigmatix.net> wrote: >> On 1/14/20 10:10 AM, Tom Lane wrote: >>> to me that this error is just useless pedantry. As long as the DB >>> encoding can represent the desired character, it should be transparent >>> to users. >> That's my position too. > and mine. I'm confused --- yesterday you seemed to be against this idea. Have you changed your mind? I'll gladly go change the patch if people are on board with this. regards, tom lane
On 1/14/20 4:25 PM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> On Wed, Jan 15, 2020 at 4:25 AM Chapman Flack <chap@anastigmatix.net> wrote: >>> On 1/14/20 10:10 AM, Tom Lane wrote: >>>> to me that this error is just useless pedantry. As long as the DB >>>> encoding can represent the desired character, it should be transparent >>>> to users. > >>> That's my position too. > >> and mine. > > I'm confused --- yesterday you seemed to be against this idea. > Have you changed your mind? > > I'll gladly go change the patch if people are on board with this. Hmm, well, let me clarify for my own part what I think I'm agreeing with ... perhaps it's misaligned with something further upthread. In an ideal world (which may be ideal in more ways than are in scope for the present discussion) I would expect to see these principles: 1. On input, whether a Unicode escape is or isn't allowed should not depend on any encoding settings. It should be lexically allowed always, and if it represents a character that exists in the server encoding, it should mean that character. If it's not representable in the storage format, it should produce an error that says that. 2. If it happens that the character is representable in both the storage encoding and the client encoding, it shouldn't matter whether it arrives literally as an é or as an escape. Either should get stored on disk as the same bytes. 3. On output, as long as the character is representable in the client encoding, there is nothing to worry about. It will be sent as its representation in the client encoding (which may be different bytes than its representation in the server encoding). 4. If a character to be output isn't in the client encoding, it will be datatype-dependent whether there is any way to escape. For example, xml_out could produce ????; forms, and json_out could produce \u???? forms. 5. If the datatype being output has no escaping rules available (as would be the case for an ordinary text column, say), then the unrepresentable character has to be reported in an error. (Encoding conversions often have the option of substituting a replacement character like ? but I don't believe a DBMS has any business making such changes to data, unless by explicit opt-in. If it can't give you the data you wanted, it should say "here's why I can't give you that.") 6. While 'text' in general provides no escaping mechanism, some functions that produce text may still have that option. For example, quote_literal and quote_ident could conceivably produce the U&'...' or U&"..." forms, respectively, if the argument contains characters that won't go in the client encoding. I understand that on the way from 1 to 6 I will have drifted further from what's discussed in this thread; for example, I bet that quote_literal/quote_ident never produce U& forms now, and that no one is proposing to change that, and I'm pretending not to notice the question of how astonishing such behavior could be. (Not to mention, how would they know whether they are returning a value that's destined to go across the client encoding, rather than to be used in a purely server-side expression? Maybe distinct versions of those functions could take an encoding argument, and produce the U& forms when the content won't go in the specified encoding. That would avoid astonishing changes to existing functions.) Regards, -Chap
On Wed, Jan 15, 2020 at 7:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > > On Wed, Jan 15, 2020 at 4:25 AM Chapman Flack <chap@anastigmatix.net> wrote: > >> On 1/14/20 10:10 AM, Tom Lane wrote: > >>> to me that this error is just useless pedantry. As long as the DB > >>> encoding can represent the desired character, it should be transparent > >>> to users. > > >> That's my position too. > > > and mine. > > I'm confused --- yesterday you seemed to be against this idea. > Have you changed your mind? > > I'll gladly go change the patch if people are on board with this. > > Perhaps I expressed myself badly. What I meant was that we should keep the json and text escape rules in sync, as they are now. Since we're changing the text rules to allow resolvable non-ascii unicode escapes in non-utf8 locales, we should do the same for json. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > Perhaps I expressed myself badly. What I meant was that we should keep > the json and text escape rules in sync, as they are now. Since we're > changing the text rules to allow resolvable non-ascii unicode escapes > in non-utf8 locales, we should do the same for json. Got it. I'll make the patch do that in a little bit. regards, tom lane
I wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> Perhaps I expressed myself badly. What I meant was that we should keep >> the json and text escape rules in sync, as they are now. Since we're >> changing the text rules to allow resolvable non-ascii unicode escapes >> in non-utf8 locales, we should do the same for json. > Got it. I'll make the patch do that in a little bit. OK, here's v2, which brings JSONB into the fold and also makes some effort to produce an accurate error cursor for invalid Unicode escapes. As it's set up, we only pay the extra cost of setting up an error context callback when we're actually processing a Unicode escape, so I think that's an acceptable cost. (It's not much of a cost, anyway.) The callback support added here is pretty much a straight copy-and-paste of the existing functions setup_parser_errposition_callback() and friends. That's slightly annoying --- we could perhaps merge those into one. But I didn't see a good common header to put such a thing into, so I just did it like this. Another note is that we could use the additional scanner infrastructure to produce more accurate error pointers for other cases where we're whining about a bad escape sequence, or some other sub-part of a lexical token. I think that'd likely be a good idea, since the existing cursor placement at the start of the token isn't too helpful if e.g. you're dealing with a very long string constant. But to keep this focused, I only touched the behavior for Unicode escapes. The rest could be done as a separate patch. This also mops up after 7f380c59 by making use of the new pg_wchar.c exports is_utf16_surrogate_first() etc everyplace that they're relevant (which is just the JSON code I was touching anyway, as it happens). I also made a bit of an effort to ensure test coverage of all the code touched in that patch and this one. regards, tom lane diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml index 6ff8751..0f0d0c6 100644 --- a/doc/src/sgml/json.sgml +++ b/doc/src/sgml/json.sgml @@ -61,8 +61,8 @@ </para> <para> - <productname>PostgreSQL</productname> allows only one character set - encoding per database. It is therefore not possible for the JSON + RFC 7159 specifies that JSON strings should be encoded in UTF8. + It is therefore not possible for the JSON types to conform rigidly to the JSON specification unless the database encoding is UTF8. Attempts to directly include characters that cannot be represented in the database encoding will fail; conversely, @@ -77,13 +77,13 @@ regardless of the database encoding, and are checked only for syntactic correctness (that is, that four hex digits follow <literal>\u</literal>). However, the input function for <type>jsonb</type> is stricter: it disallows - Unicode escapes for non-ASCII characters (those above <literal>U+007F</literal>) - unless the database encoding is UTF8. The <type>jsonb</type> type also + Unicode escapes for characters that cannot be represented in the database + encoding. The <type>jsonb</type> type also rejects <literal>\u0000</literal> (because that cannot be represented in <productname>PostgreSQL</productname>'s <type>text</type> type), and it insists that any use of Unicode surrogate pairs to designate characters outside the Unicode Basic Multilingual Plane be correct. Valid Unicode escapes - are converted to the equivalent ASCII or UTF8 character for storage; + are converted to the equivalent single character for storage; this includes folding surrogate pairs into a single character. </para> @@ -96,9 +96,8 @@ not <type>jsonb</type>. The fact that the <type>json</type> input function does not make these checks may be considered a historical artifact, although it does allow for simple storage (without processing) of JSON Unicode - escapes in a non-UTF8 database encoding. In general, it is best to - avoid mixing Unicode escapes in JSON with a non-UTF8 database encoding, - if possible. + escapes in a database encoding that does not support the represented + characters. </para> </note> @@ -144,8 +143,8 @@ <row> <entry><type>string</type></entry> <entry><type>text</type></entry> - <entry><literal>\u0000</literal> is disallowed, as are non-ASCII Unicode - escapes if database encoding is not UTF8</entry> + <entry><literal>\u0000</literal> is disallowed, as are Unicode escapes + representing characters not available in the database encoding</entry> </row> <row> <entry><type>number</type></entry> diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml index c908e0b..e134877 100644 --- a/doc/src/sgml/syntax.sgml +++ b/doc/src/sgml/syntax.sgml @@ -189,6 +189,23 @@ UPDATE "my_table" SET "a" = 5; ampersands. The length limitation still applies. </para> + <para> + Quoting an identifier also makes it case-sensitive, whereas + unquoted names are always folded to lower case. For example, the + identifiers <literal>FOO</literal>, <literal>foo</literal>, and + <literal>"foo"</literal> are considered the same by + <productname>PostgreSQL</productname>, but + <literal>"Foo"</literal> and <literal>"FOO"</literal> are + different from these three and each other. (The folding of + unquoted names to lower case in <productname>PostgreSQL</productname> is + incompatible with the SQL standard, which says that unquoted names + should be folded to upper case. Thus, <literal>foo</literal> + should be equivalent to <literal>"FOO"</literal> not + <literal>"foo"</literal> according to the standard. If you want + to write portable applications you are advised to always quote a + particular name or never quote it.) + </para> + <indexterm> <primary>Unicode escape</primary> <secondary>in identifiers</secondary> @@ -230,7 +247,8 @@ U&"d!0061t!+000061" UESCAPE '!' The escape character can be any single character other than a hexadecimal digit, the plus sign, a single quote, a double quote, or a whitespace character. Note that the escape character is - written in single quotes, not double quotes. + written in single quotes, not double quotes, + after <literal>UESCAPE</literal>. </para> <para> @@ -239,32 +257,18 @@ U&"d!0061t!+000061" UESCAPE '!' </para> <para> - The Unicode escape syntax works only when the server encoding is - <literal>UTF8</literal>. When other server encodings are used, only code - points in the ASCII range (up to <literal>\007F</literal>) can be - specified. Both the 4-digit and the 6-digit form can be used to + Either the 4-digit or the 6-digit escape form can be used to specify UTF-16 surrogate pairs to compose characters with code points larger than U+FFFF, although the availability of the 6-digit form technically makes this unnecessary. (Surrogate - pairs are not stored directly, but combined into a single - code point that is then encoded in UTF-8.) + pairs are not stored directly, but are combined into a single + code point.) </para> <para> - Quoting an identifier also makes it case-sensitive, whereas - unquoted names are always folded to lower case. For example, the - identifiers <literal>FOO</literal>, <literal>foo</literal>, and - <literal>"foo"</literal> are considered the same by - <productname>PostgreSQL</productname>, but - <literal>"Foo"</literal> and <literal>"FOO"</literal> are - different from these three and each other. (The folding of - unquoted names to lower case in <productname>PostgreSQL</productname> is - incompatible with the SQL standard, which says that unquoted names - should be folded to upper case. Thus, <literal>foo</literal> - should be equivalent to <literal>"FOO"</literal> not - <literal>"foo"</literal> according to the standard. If you want - to write portable applications you are advised to always quote a - particular name or never quote it.) + If the server encoding is not UTF-8, the Unicode code point identified + by one of these escape sequences is converted to the actual server + encoding; an error is reported if that's not possible. </para> </sect2> @@ -427,25 +431,11 @@ SELECT 'foo' 'bar'; <para> It is your responsibility that the byte sequences you create, especially when using the octal or hexadecimal escapes, compose - valid characters in the server character set encoding. When the - server encoding is UTF-8, then the Unicode escapes or the + valid characters in the server character set encoding. + A useful alternative is to use Unicode escapes or the alternative Unicode escape syntax, explained - in <xref linkend="sql-syntax-strings-uescape"/>, should be used - instead. (The alternative would be doing the UTF-8 encoding by - hand and writing out the bytes, which would be very cumbersome.) - </para> - - <para> - The Unicode escape syntax works fully only when the server - encoding is <literal>UTF8</literal>. When other server encodings are - used, only code points in the ASCII range (up - to <literal>\u007F</literal>) can be specified. Both the 4-digit and - the 8-digit form can be used to specify UTF-16 surrogate pairs to - compose characters with code points larger than U+FFFF, although - the availability of the 8-digit form technically makes this - unnecessary. (When surrogate pairs are used when the server - encoding is <literal>UTF8</literal>, they are first combined into a - single code point that is then encoded in UTF-8.) + in <xref linkend="sql-syntax-strings-uescape"/>; then the server + will check that the character conversion is possible. </para> <caution> @@ -524,16 +514,23 @@ U&'d!0061t!+000061' UESCAPE '!' </para> <para> - The Unicode escape syntax works only when the server encoding is - <literal>UTF8</literal>. When other server encodings are used, only - code points in the ASCII range (up to <literal>\007F</literal>) - can be specified. Both the 4-digit and the 6-digit form can be - used to specify UTF-16 surrogate pairs to compose characters with - code points larger than U+FFFF, although the availability of the - 6-digit form technically makes this unnecessary. (When surrogate - pairs are used when the server encoding is <literal>UTF8</literal>, they - are first combined into a single code point that is then encoded - in UTF-8.) + To include the escape character in the string literally, write + it twice. + </para> + + <para> + Either the 4-digit or the 6-digit escape form can be used to + specify UTF-16 surrogate pairs to compose characters with code + points larger than U+FFFF, although the availability of the + 6-digit form technically makes this unnecessary. (Surrogate + pairs are not stored directly, but are combined into a single + code point.) + </para> + + <para> + If the server encoding is not UTF-8, the Unicode code point identified + by one of these escape sequences is converted to the actual server + encoding; an error is reported if that's not possible. </para> <para> @@ -546,11 +543,6 @@ U&'d!0061t!+000061' UESCAPE '!' parameter is set to off, this syntax will be rejected with an error message. </para> - - <para> - To include the escape character in the string literally, write it - twice. - </para> </sect3> <sect3 id="sql-syntax-dollar-quoting"> diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c index 1bf1144..22c9479 100644 --- a/src/backend/parser/parser.c +++ b/src/backend/parser/parser.c @@ -292,22 +292,15 @@ hexval(unsigned char c) return 0; /* not reached */ } -/* is Unicode code point acceptable in database's encoding? */ +/* is Unicode code point acceptable? */ static void -check_unicode_value(pg_wchar c, int pos, core_yyscan_t yyscanner) +check_unicode_value(pg_wchar c) { /* See also addunicode() in scan.l */ if (c == 0 || c > 0x10FFFF) ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), - errmsg("invalid Unicode escape value"), - scanner_errposition(pos, yyscanner))); - - if (c > 0x7F && GetDatabaseEncoding() != PG_UTF8) - ereport(ERROR, - (errcode(ERRCODE_SYNTAX_ERROR), - errmsg("Unicode escape values cannot be used for code point values above 007F when the server encodingis not UTF8"), - scanner_errposition(pos, yyscanner))); + errmsg("invalid Unicode escape value"))); } /* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */ @@ -338,20 +331,39 @@ str_udeescape(const char *str, char escape, const char *in; char *new, *out; + size_t new_len; pg_wchar pair_first = 0; + ScannerCallbackState scbstate; /* - * This relies on the subtle assumption that a UTF-8 expansion cannot be - * longer than its escaped representation. + * Guesstimate that result will be no longer than input, but allow enough + * padding for Unicode conversion. */ - new = palloc(strlen(str) + 1); + new_len = strlen(str) + MAX_UNICODE_EQUIVALENT_STRING + 1; + new = palloc(new_len); in = str; out = new; while (*in) { + /* Enlarge string if needed */ + size_t out_dist = out - new; + + if (out_dist > new_len - (MAX_UNICODE_EQUIVALENT_STRING + 1)) + { + new_len *= 2; + new = repalloc(new, new_len); + out = new + out_dist; + } + if (in[0] == escape) { + /* + * Any errors reported while processing this escape sequence will + * have an error cursor pointing at the escape. + */ + setup_scanner_errposition_callback(&scbstate, yyscanner, + in - str + position + 3); /* 3 for U&" */ if (in[1] == escape) { if (pair_first) @@ -370,9 +382,7 @@ str_udeescape(const char *str, char escape, (hexval(in[2]) << 8) + (hexval(in[3]) << 4) + hexval(in[4]); - check_unicode_value(unicode, - in - str + position + 3, /* 3 for U&" */ - yyscanner); + check_unicode_value(unicode); if (pair_first) { if (is_utf16_surrogate_second(unicode)) @@ -390,8 +400,8 @@ str_udeescape(const char *str, char escape, pair_first = unicode; else { - unicode_to_utf8(unicode, (unsigned char *) out); - out += pg_mblen(out); + pg_unicode_to_server(unicode, (unsigned char *) out); + out += strlen(out); } in += 5; } @@ -411,9 +421,7 @@ str_udeescape(const char *str, char escape, (hexval(in[5]) << 8) + (hexval(in[6]) << 4) + hexval(in[7]); - check_unicode_value(unicode, - in - str + position + 3, /* 3 for U&" */ - yyscanner); + check_unicode_value(unicode); if (pair_first) { if (is_utf16_surrogate_second(unicode)) @@ -431,17 +439,18 @@ str_udeescape(const char *str, char escape, pair_first = unicode; else { - unicode_to_utf8(unicode, (unsigned char *) out); - out += pg_mblen(out); + pg_unicode_to_server(unicode, (unsigned char *) out); + out += strlen(out); } in += 8; } else ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), - errmsg("invalid Unicode escape value"), - scanner_errposition(in - str + position + 3, /* 3 for U&" */ - yyscanner))); + errmsg("invalid Unicode escape"), + errhint("Unicode escapes must be \\XXXX or \\+XXXXXX."))); + + cancel_scanner_errposition_callback(&scbstate); } else { @@ -457,15 +466,13 @@ str_udeescape(const char *str, char escape, goto invalid_pair; *out = '\0'; + return new; /* - * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII - * codes; but it's probably not worth the trouble, since this isn't likely - * to be a performance-critical path. + * We might get here with the error callback active, or not. Call + * scanner_errposition to make sure an error cursor appears; if the + * callback is active, this is duplicative but harmless. */ - pg_verifymbstr(new, out - new, false); - return new; - invalid_pair: ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l index 84c7391..685aa84 100644 --- a/src/backend/parser/scan.l +++ b/src/backend/parser/scan.l @@ -106,6 +106,18 @@ const uint16 ScanKeywordTokens[] = { */ #define ADVANCE_YYLLOC(delta) ( *(yylloc) += (delta) ) +/* + * Sometimes, we do want yylloc to point into the middle of a token; this is + * useful for instance to throw an error about an escape sequence within a + * string literal. But if we find no error there, we want to revert yylloc + * to the token start, so that that's the location reported to the parser. + * Use PUSH_YYLLOC/POP_YYLLOC to save/restore yylloc around such code. + * (Currently the implied "stack" is just one location, but someday we might + * need to nest these.) + */ +#define PUSH_YYLLOC() (yyextra->save_yylloc = *(yylloc)) +#define POP_YYLLOC() (*(yylloc) = yyextra->save_yylloc) + #define startlit() ( yyextra->literallen = 0 ) static void addlit(char *ytext, int yleng, core_yyscan_t yyscanner); static void addlitchar(unsigned char ychar, core_yyscan_t yyscanner); @@ -605,8 +617,18 @@ other . <xe>{xeunicode} { pg_wchar c = strtoul(yytext + 2, NULL, 16); + /* + * For consistency with other productions, issue any + * escape warning with cursor pointing to start of string. + * We might want to change that, someday. + */ check_escape_warning(yyscanner); + /* Remember start of overall string token ... */ + PUSH_YYLLOC(); + /* ... and set the error cursor to point at this esc seq */ + SET_YYLLOC(); + if (is_utf16_surrogate_first(c)) { yyextra->utf16_first_part = c; @@ -616,10 +638,18 @@ other . yyerror("invalid Unicode surrogate pair"); else addunicode(c, yyscanner); + + /* Restore yylloc to be start of string token */ + POP_YYLLOC(); } <xeu>{xeunicode} { pg_wchar c = strtoul(yytext + 2, NULL, 16); + /* Remember start of overall string token ... */ + PUSH_YYLLOC(); + /* ... and set the error cursor to point at this esc seq */ + SET_YYLLOC(); + if (!is_utf16_surrogate_second(c)) yyerror("invalid Unicode surrogate pair"); @@ -627,12 +657,21 @@ other . addunicode(c, yyscanner); + /* Restore yylloc to be start of string token */ + POP_YYLLOC(); + BEGIN(xe); } -<xeu>. { yyerror("invalid Unicode surrogate pair"); } -<xeu>\n { yyerror("invalid Unicode surrogate pair"); } -<xeu><<EOF>> { yyerror("invalid Unicode surrogate pair"); } +<xeu>. | +<xeu>\n | +<xeu><<EOF>> { + /* Set the error cursor to point at missing esc seq */ + SET_YYLLOC(); + yyerror("invalid Unicode surrogate pair"); + } <xe,xeu>{xeunicodefail} { + /* Set the error cursor to point at malformed esc seq */ + SET_YYLLOC(); ereport(ERROR, (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), errmsg("invalid Unicode escape"), @@ -1029,12 +1068,13 @@ other . * scanner_errposition * Report a lexer or grammar error cursor position, if possible. * - * This is expected to be used within an ereport() call. The return value + * This is expected to be used within an ereport() call, or via an error + * callback such as setup_scanner_errposition_callback(). The return value * is a dummy (always 0, in fact). * * Note that this can only be used for messages emitted during raw parsing - * (essentially, scan.l and gram.y), since it requires the yyscanner struct - * to still be available. + * (essentially, scan.l, parser.c, and gram.y), since it requires the + * yyscanner struct to still be available. */ int scanner_errposition(int location, core_yyscan_t yyscanner) @@ -1051,6 +1091,62 @@ scanner_errposition(int location, core_yyscan_t yyscanner) } /* + * Error context callback for inserting scanner error location. + * + * Note that this will be called for *any* error occurring while the + * callback is installed. We avoid inserting an irrelevant error location + * if the error is a query cancel --- are there any other important cases? + */ +static void +scb_error_callback(void *arg) +{ + ScannerCallbackState *scbstate = (ScannerCallbackState *) arg; + + if (geterrcode() != ERRCODE_QUERY_CANCELED) + (void) scanner_errposition(scbstate->location, scbstate->yyscanner); +} + +/* + * setup_scanner_errposition_callback + * Arrange for non-scanner errors to report an error position + * + * Sometimes the scanner calls functions that aren't part of the scanner + * subsystem and can't reasonably be passed the yyscanner pointer; yet + * we would like any errors thrown in those functions to be tagged with an + * error location. Use this function to set up an error context stack + * entry that will accomplish that. Usage pattern: + * + * declare a local variable "ScannerCallbackState scbstate" + * ... + * setup_scanner_errposition_callback(&scbstate, yyscanner, location); + * call function that might throw error; + * cancel_scanner_errposition_callback(&scbstate); + */ +void +setup_scanner_errposition_callback(ScannerCallbackState *scbstate, + core_yyscan_t yyscanner, + int location) +{ + /* Setup error traceback support for ereport() */ + scbstate->yyscanner = yyscanner; + scbstate->location = location; + scbstate->errcallback.callback = scb_error_callback; + scbstate->errcallback.arg = (void *) scbstate; + scbstate->errcallback.previous = error_context_stack; + error_context_stack = &scbstate->errcallback; +} + +/* + * Cancel a previously-set-up errposition callback. + */ +void +cancel_scanner_errposition_callback(ScannerCallbackState *scbstate) +{ + /* Pop the error context stack */ + error_context_stack = scbstate->errcallback.previous; +} + +/* * scanner_yyerror * Report a lexer or grammar error. * @@ -1226,19 +1322,21 @@ process_integer_literal(const char *token, YYSTYPE *lval) static void addunicode(pg_wchar c, core_yyscan_t yyscanner) { - char buf[8]; + ScannerCallbackState scbstate; + char buf[MAX_UNICODE_EQUIVALENT_STRING + 1]; /* See also check_unicode_value() in parser.c */ if (c == 0 || c > 0x10FFFF) yyerror("invalid Unicode escape value"); - if (c > 0x7F) - { - if (GetDatabaseEncoding() != PG_UTF8) - yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is notUTF8"); - yyextra->saw_non_ascii = true; - } - unicode_to_utf8(c, (unsigned char *) buf); - addlit(buf, pg_mblen(buf), yyscanner); + + /* + * We expect that pg_unicode_to_server() will complain about any + * unconvertible code point, so we don't have to set saw_non_ascii. + */ + setup_scanner_errposition_callback(&scbstate, yyscanner, *(yylloc)); + pg_unicode_to_server(c, (unsigned char *) buf); + cancel_scanner_errposition_callback(&scbstate); + addlit(buf, strlen(buf), yyscanner); } static unsigned char diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c index 458505a..62b97f5 100644 --- a/src/backend/utils/adt/json.c +++ b/src/backend/utils/adt/json.c @@ -831,10 +831,10 @@ json_lex_string(JsonLexContext *lex) } if (lex->strval != NULL) { - char utf8str[5]; - int utf8len; - - if (ch >= 0xd800 && ch <= 0xdbff) + /* + * Combine surrogate pairs. + */ + if (is_utf16_surrogate_first(ch)) { if (hi_surrogate != -1) ereport(ERROR, @@ -843,10 +843,10 @@ json_lex_string(JsonLexContext *lex) "json"), errdetail("Unicode high surrogate must not follow a high surrogate."), report_json_context(lex))); - hi_surrogate = (ch & 0x3ff) << 10; + hi_surrogate = ch; continue; } - else if (ch >= 0xdc00 && ch <= 0xdfff) + else if (is_utf16_surrogate_second(ch)) { if (hi_surrogate == -1) ereport(ERROR, @@ -854,7 +854,7 @@ json_lex_string(JsonLexContext *lex) errmsg("invalid input syntax for type %s", "json"), errdetail("Unicode low surrogate must follow a high surrogate."), report_json_context(lex))); - ch = 0x10000 + hi_surrogate + (ch & 0x3ff); + ch = surrogate_pair_to_codepoint(hi_surrogate, ch); hi_surrogate = -1; } @@ -866,12 +866,8 @@ json_lex_string(JsonLexContext *lex) report_json_context(lex))); /* - * For UTF8, replace the escape sequence by the actual - * utf8 character in lex->strval. Do this also for other - * encodings if the escape designates an ASCII character, - * otherwise raise an error. + * Add the represented character to lex->strval. */ - if (ch == 0) { /* We can't allow this, since our TEXT type doesn't */ @@ -881,30 +877,13 @@ json_lex_string(JsonLexContext *lex) errdetail("\\u0000 cannot be converted to text."), report_json_context(lex))); } - else if (GetDatabaseEncoding() == PG_UTF8) - { - unicode_to_utf8(ch, (unsigned char *) utf8str); - utf8len = pg_utf_mblen((unsigned char *) utf8str); - appendBinaryStringInfo(lex->strval, utf8str, utf8len); - } - else if (ch <= 0x007f) - { - /* - * This is the only way to designate things like a - * form feed character in JSON, so it's useful in all - * encodings. - */ - appendStringInfoChar(lex->strval, (char) ch); - } else { - ereport(ERROR, - (errcode(ERRCODE_UNTRANSLATABLE_CHARACTER), - errmsg("unsupported Unicode escape sequence"), - errdetail("Unicode escape values cannot be used for code point values above 007F when theserver encoding is not UTF8."), - report_json_context(lex))); - } + char cbuf[MAX_UNICODE_EQUIVALENT_STRING + 1]; + pg_unicode_to_server(ch, (unsigned char *) cbuf); + appendStringInfoString(lex->strval, cbuf); + } } } else if (lex->strval != NULL) diff --git a/src/backend/utils/adt/jsonpath_scan.l b/src/backend/utils/adt/jsonpath_scan.l index 70681b7..be0a2cf 100644 --- a/src/backend/utils/adt/jsonpath_scan.l +++ b/src/backend/utils/adt/jsonpath_scan.l @@ -486,13 +486,6 @@ hexval(char c) static void addUnicodeChar(int ch) { - /* - * For UTF8, replace the escape sequence by the actual - * utf8 character in lex->strval. Do this also for other - * encodings if the escape designates an ASCII character, - * otherwise raise an error. - */ - if (ch == 0) { /* We can't allow this, since our TEXT type doesn't */ @@ -501,40 +494,20 @@ addUnicodeChar(int ch) errmsg("unsupported Unicode escape sequence"), errdetail("\\u0000 cannot be converted to text."))); } - else if (GetDatabaseEncoding() == PG_UTF8) - { - char utf8str[5]; - int utf8len; - - unicode_to_utf8(ch, (unsigned char *) utf8str); - utf8len = pg_utf_mblen((unsigned char *) utf8str); - addstring(false, utf8str, utf8len); - } - else if (ch <= 0x007f) - { - /* - * This is the only way to designate things like a - * form feed character in JSON, so it's useful in all - * encodings. - */ - addchar(false, (char) ch); - } else { - ereport(ERROR, - (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), - errmsg("invalid input syntax for type %s", "jsonpath"), - errdetail("Unicode escape values cannot be used for code " - "point values above 007F when the server encoding " - "is not UTF8."))); + char cbuf[MAX_UNICODE_EQUIVALENT_STRING + 1]; + + pg_unicode_to_server(ch, (unsigned char *) cbuf); + addstring(false, cbuf, strlen(cbuf)); } } -/* Add unicode character and process its hi surrogate */ +/* Add unicode character, processing any surrogate pairs */ static void addUnicode(int ch, int *hi_surrogate) { - if (ch >= 0xd800 && ch <= 0xdbff) + if (is_utf16_surrogate_first(ch)) { if (*hi_surrogate != -1) ereport(ERROR, @@ -542,10 +515,10 @@ addUnicode(int ch, int *hi_surrogate) errmsg("invalid input syntax for type %s", "jsonpath"), errdetail("Unicode high surrogate must not follow " "a high surrogate."))); - *hi_surrogate = (ch & 0x3ff) << 10; + *hi_surrogate = ch; return; } - else if (ch >= 0xdc00 && ch <= 0xdfff) + else if (is_utf16_surrogate_second(ch)) { if (*hi_surrogate == -1) ereport(ERROR, @@ -553,7 +526,7 @@ addUnicode(int ch, int *hi_surrogate) errmsg("invalid input syntax for type %s", "jsonpath"), errdetail("Unicode low surrogate must follow a high " "surrogate."))); - ch = 0x10000 + *hi_surrogate + (ch & 0x3ff); + ch = surrogate_pair_to_codepoint(*hi_surrogate, ch); *hi_surrogate = -1; } else if (*hi_surrogate != -1) diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c index 3808c30..a2d2a0b 100644 --- a/src/backend/utils/adt/xml.c +++ b/src/backend/utils/adt/xml.c @@ -2086,26 +2086,6 @@ map_sql_identifier_to_xml_name(const char *ident, bool fully_escaped, /* - * Map a Unicode codepoint into the current server encoding. - */ -static char * -unicode_to_sqlchar(pg_wchar c) -{ - char utf8string[8]; /* need room for trailing zero */ - char *result; - - memset(utf8string, 0, sizeof(utf8string)); - unicode_to_utf8(c, (unsigned char *) utf8string); - - result = pg_any_to_server(utf8string, strlen(utf8string), PG_UTF8); - /* if pg_any_to_server didn't strdup, we must */ - if (result == utf8string) - result = pstrdup(result); - return result; -} - - -/* * Map XML name to SQL identifier; see SQL/XML:2008 section 9.3. */ char * @@ -2125,10 +2105,12 @@ map_xml_name_to_sql_identifier(const char *name) && isxdigit((unsigned char) *(p + 5)) && *(p + 6) == '_') { + char cbuf[MAX_UNICODE_EQUIVALENT_STRING + 1]; unsigned int u; sscanf(p + 2, "%X", &u); - appendStringInfoString(&buf, unicode_to_sqlchar(u)); + pg_unicode_to_server(u, (unsigned char *) cbuf); + appendStringInfoString(&buf, cbuf); p += 6; } else diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c index 5d7cc74..7d90ac9 100644 --- a/src/backend/utils/mb/mbutils.c +++ b/src/backend/utils/mb/mbutils.c @@ -68,6 +68,13 @@ static FmgrInfo *ToServerConvProc = NULL; static FmgrInfo *ToClientConvProc = NULL; /* + * This variable stores the conversion function to convert from UTF-8 + * to the server encoding. It's NULL if the server encoding *is* UTF-8, + * or if we lack a conversion function for this. + */ +static FmgrInfo *Utf8ToServerConvProc = NULL; + +/* * These variables track the currently-selected encodings. */ static const pg_enc2name *ClientEncoding = &pg_enc2name_tbl[PG_SQL_ASCII]; @@ -273,6 +280,8 @@ SetClientEncoding(int encoding) void InitializeClientEncoding(void) { + int current_server_encoding; + Assert(!backend_startup_complete); backend_startup_complete = true; @@ -289,6 +298,35 @@ InitializeClientEncoding(void) pg_enc2name_tbl[pending_client_encoding].name, GetDatabaseEncodingName()))); } + + /* + * Also look up the UTF8-to-server conversion function if needed. Since + * the server encoding is fixed within any one backend process, we don't + * have to do this more than once. + */ + current_server_encoding = GetDatabaseEncoding(); + if (current_server_encoding != PG_UTF8 && + current_server_encoding != PG_SQL_ASCII) + { + Oid utf8_to_server_proc; + + Assert(IsTransactionState()); + utf8_to_server_proc = + FindDefaultConversionProc(PG_UTF8, + current_server_encoding); + /* If there's no such conversion, just leave the pointer as NULL */ + if (OidIsValid(utf8_to_server_proc)) + { + FmgrInfo *finfo; + + finfo = (FmgrInfo *) MemoryContextAlloc(TopMemoryContext, + sizeof(FmgrInfo)); + fmgr_info_cxt(utf8_to_server_proc, finfo, + TopMemoryContext); + /* Set Utf8ToServerConvProc only after data is fully valid */ + Utf8ToServerConvProc = finfo; + } + } } /* @@ -752,6 +790,73 @@ perform_default_encoding_conversion(const char *src, int len, return result; } +/* + * Convert a single Unicode code point into a string in the server encoding. + * + * The code point given by "c" is converted and stored at *s, which must + * have at least MAX_UNICODE_EQUIVALENT_STRING+1 bytes available. + * The output will have a trailing '\0'. Throws error if the conversion + * cannot be performed. + * + * Note that this relies on having previously looked up any required + * conversion function. That's partly for speed but mostly because the parser + * may call this outside any transaction, or in an aborted transaction. + */ +void +pg_unicode_to_server(pg_wchar c, unsigned char *s) +{ + unsigned char c_as_utf8[MAX_MULTIBYTE_CHAR_LEN + 1]; + int c_as_utf8_len; + int server_encoding; + + /* + * Complain if invalid Unicode code point. The choice of errcode here is + * debatable, but really our caller should have checked this anyway. + */ + if (c == 0 || c > 0x10FFFF) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("invalid Unicode code point"))); + + /* Otherwise, if it's in ASCII range, conversion is trivial */ + if (c <= 0x7F) + { + s[0] = (unsigned char) c; + s[1] = '\0'; + return; + } + + /* If the server encoding is UTF-8, we just need to reformat the code */ + server_encoding = GetDatabaseEncoding(); + if (server_encoding == PG_UTF8) + { + unicode_to_utf8(c, s); + s[pg_utf_mblen(s)] = '\0'; + return; + } + + /* For all other cases, we must have a conversion function available */ + if (Utf8ToServerConvProc == NULL) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("conversion between %s and %s is not supported", + pg_enc2name_tbl[PG_UTF8].name, + GetDatabaseEncodingName()))); + + /* Construct UTF-8 source string */ + unicode_to_utf8(c, c_as_utf8); + c_as_utf8_len = pg_utf_mblen(c_as_utf8); + c_as_utf8[c_as_utf8_len] = '\0'; + + /* Convert, or throw error if we can't */ + FunctionCall5(Utf8ToServerConvProc, + Int32GetDatum(PG_UTF8), + Int32GetDatum(server_encoding), + CStringGetDatum(c_as_utf8), + CStringGetDatum(s), + Int32GetDatum(c_as_utf8_len)); +} + /* convert a multibyte string to a wchar */ int diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h index 7fb5fa4..2daf301 100644 --- a/src/include/mb/pg_wchar.h +++ b/src/include/mb/pg_wchar.h @@ -316,6 +316,15 @@ typedef enum pg_enc #define MAX_CONVERSION_GROWTH 4 /* + * Maximum byte length of the string equivalent to any one Unicode code point, + * in any backend encoding. The current value assumes that a 4-byte UTF-8 + * character might expand by MAX_CONVERSION_GROWTH, which is a huge + * overestimate. But in current usage we don't allocate large multiples of + * this, so there's little point in being stingy. + */ +#define MAX_UNICODE_EQUIVALENT_STRING 16 + +/* * Table for mapping an encoding number to official encoding name and * possibly other subsidiary data. Be careful to check encoding number * before accessing a table entry! @@ -602,6 +611,8 @@ extern char *pg_server_to_client(const char *s, int len); extern char *pg_any_to_server(const char *s, int len, int encoding); extern char *pg_server_to_any(const char *s, int len, int encoding); +extern void pg_unicode_to_server(pg_wchar c, unsigned char *s); + extern unsigned short BIG5toCNS(unsigned short big5, unsigned char *lc); extern unsigned short CNStoBIG5(unsigned short cns, unsigned char lc); diff --git a/src/include/parser/scanner.h b/src/include/parser/scanner.h index 7a0e5e5..a27352a 100644 --- a/src/include/parser/scanner.h +++ b/src/include/parser/scanner.h @@ -99,9 +99,13 @@ typedef struct core_yy_extra_type int literallen; /* actual current string length */ int literalalloc; /* current allocated buffer size */ + /* + * Random assorted scanner state. + */ int state_before_str_stop; /* start cond. before end quote */ int xcdepth; /* depth of nesting in slash-star comments */ char *dolqstart; /* current $foo$ quote start string */ + YYLTYPE save_yylloc; /* one-element stack for PUSH_YYLLOC() */ /* first part of UTF16 surrogate pair for Unicode escapes */ int32 utf16_first_part; @@ -116,6 +120,14 @@ typedef struct core_yy_extra_type */ typedef void *core_yyscan_t; +/* Support for scanner_errposition_callback function */ +typedef struct ScannerCallbackState +{ + core_yyscan_t yyscanner; + int location; + ErrorContextCallback errcallback; +} ScannerCallbackState; + /* Constant data exported from parser/scan.l */ extern PGDLLIMPORT const uint16 ScanKeywordTokens[]; @@ -129,6 +141,10 @@ extern void scanner_finish(core_yyscan_t yyscanner); extern int core_yylex(core_YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner); extern int scanner_errposition(int location, core_yyscan_t yyscanner); +extern void setup_scanner_errposition_callback(ScannerCallbackState *scbstate, + core_yyscan_t yyscanner, + int location); +extern void cancel_scanner_errposition_callback(ScannerCallbackState *scbstate); extern void scanner_yyerror(const char *message, core_yyscan_t yyscanner) pg_attribute_noreturn(); #endif /* SCANNER_H */ diff --git a/src/test/regress/expected/json_encoding.out b/src/test/regress/expected/json_encoding.out index d8d34f4..f343f74 100644 --- a/src/test/regress/expected/json_encoding.out +++ b/src/test/regress/expected/json_encoding.out @@ -1,4 +1,19 @@ +-- -- encoding-sensitive tests for json and jsonb +-- +-- We provide expected-results files for UTF8 (json_encoding.out) +-- and for SQL_ASCII (json_encoding_1.out). Skip otherwise. +SELECT getdatabaseencoding() NOT IN ('UTF8', 'SQL_ASCII') + AS skip_test \gset +\if :skip_test +\quit +\endif +SELECT getdatabaseencoding(); -- just to label the results files + getdatabaseencoding +--------------------- + UTF8 +(1 row) + -- first json -- basic unicode input SELECT '"\u"'::json; -- ERROR, incomplete escape diff --git a/src/test/regress/expected/json_encoding_1.out b/src/test/regress/expected/json_encoding_1.out index 79ed78e..e2fc131 100644 --- a/src/test/regress/expected/json_encoding_1.out +++ b/src/test/regress/expected/json_encoding_1.out @@ -1,4 +1,19 @@ +-- -- encoding-sensitive tests for json and jsonb +-- +-- We provide expected-results files for UTF8 (json_encoding.out) +-- and for SQL_ASCII (json_encoding_1.out). Skip otherwise. +SELECT getdatabaseencoding() NOT IN ('UTF8', 'SQL_ASCII') + AS skip_test \gset +\if :skip_test +\quit +\endif +SELECT getdatabaseencoding(); -- just to label the results files + getdatabaseencoding +--------------------- + SQL_ASCII +(1 row) + -- first json -- basic unicode input SELECT '"\u"'::json; -- ERROR, incomplete escape @@ -33,9 +48,7 @@ SELECT '"\uaBcD"'::json; -- OK, uppercase and lower case both OK -- handling of unicode surrogate pairs select json '{ "a": "\ud83d\ude04\ud83d\udc36" }' -> 'a' as correct_in_utf8; -ERROR: unsupported Unicode escape sequence -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. -CONTEXT: JSON data, line 1: { "a":... +ERROR: conversion between UTF8 and SQL_ASCII is not supported select json '{ "a": "\ud83d\ud83d" }' -> 'a'; -- 2 high surrogates in a row ERROR: invalid input syntax for type json DETAIL: Unicode high surrogate must not follow a high surrogate. @@ -84,9 +97,7 @@ select json '{ "a": "null \\u0000 escape" }' as not_an_escape; (1 row) select json '{ "a": "the Copyright \u00a9 sign" }' ->> 'a' as correct_in_utf8; -ERROR: unsupported Unicode escape sequence -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. -CONTEXT: JSON data, line 1: { "a":... +ERROR: conversion between UTF8 and SQL_ASCII is not supported select json '{ "a": "dollar \u0024 character" }' ->> 'a' as correct_everywhere; correct_everywhere -------------------- @@ -144,18 +155,14 @@ CONTEXT: JSON data, line 1: ... -- use octet_length here so we don't get an odd unicode char in the -- output SELECT octet_length('"\uaBcD"'::jsonb::text); -- OK, uppercase and lower case both OK -ERROR: unsupported Unicode escape sequence +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: SELECT octet_length('"\uaBcD"'::jsonb::text); ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. -CONTEXT: JSON data, line 1: ... -- handling of unicode surrogate pairs SELECT octet_length((jsonb '{ "a": "\ud83d\ude04\ud83d\udc36" }' -> 'a')::text) AS correct_in_utf8; -ERROR: unsupported Unicode escape sequence +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: SELECT octet_length((jsonb '{ "a": "\ud83d\ude04\ud83d\udc3... ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. -CONTEXT: JSON data, line 1: { "a":... SELECT jsonb '{ "a": "\ud83d\ud83d" }' -> 'a'; -- 2 high surrogates in a row ERROR: invalid input syntax for type json LINE 1: SELECT jsonb '{ "a": "\ud83d\ud83d" }' -> 'a'; @@ -182,11 +189,9 @@ DETAIL: Unicode low surrogate must follow a high surrogate. CONTEXT: JSON data, line 1: { "a":... -- handling of simple unicode escapes SELECT jsonb '{ "a": "the Copyright \u00a9 sign" }' as correct_in_utf8; -ERROR: unsupported Unicode escape sequence +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: SELECT jsonb '{ "a": "the Copyright \u00a9 sign" }' as corr... ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. -CONTEXT: JSON data, line 1: { "a":... SELECT jsonb '{ "a": "dollar \u0024 character" }' as correct_everywhere; correct_everywhere ----------------------------- @@ -212,11 +217,9 @@ SELECT jsonb '{ "a": "null \\u0000 escape" }' as not_an_escape; (1 row) SELECT jsonb '{ "a": "the Copyright \u00a9 sign" }' ->> 'a' as correct_in_utf8; -ERROR: unsupported Unicode escape sequence +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: SELECT jsonb '{ "a": "the Copyright \u00a9 sign" }' ->> 'a'... ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. -CONTEXT: JSON data, line 1: { "a":... SELECT jsonb '{ "a": "dollar \u0024 character" }' ->> 'a' as correct_everywhere; correct_everywhere -------------------- diff --git a/src/test/regress/expected/json_encoding_2.out b/src/test/regress/expected/json_encoding_2.out new file mode 100644 index 0000000..4fc8f02 --- /dev/null +++ b/src/test/regress/expected/json_encoding_2.out @@ -0,0 +1,9 @@ +-- +-- encoding-sensitive tests for json and jsonb +-- +-- We provide expected-results files for UTF8 (json_encoding.out) +-- and for SQL_ASCII (json_encoding_1.out). Skip otherwise. +SELECT getdatabaseencoding() NOT IN ('UTF8', 'SQL_ASCII') + AS skip_test \gset +\if :skip_test +\quit diff --git a/src/test/regress/expected/jsonpath_encoding.out b/src/test/regress/expected/jsonpath_encoding.out index ecffe09..7cbfb6a 100644 --- a/src/test/regress/expected/jsonpath_encoding.out +++ b/src/test/regress/expected/jsonpath_encoding.out @@ -1,4 +1,19 @@ +-- -- encoding-sensitive tests for jsonpath +-- +-- We provide expected-results files for UTF8 (jsonpath_encoding.out) +-- and for SQL_ASCII (jsonpath_encoding_1.out). Skip otherwise. +SELECT getdatabaseencoding() NOT IN ('UTF8', 'SQL_ASCII') + AS skip_test \gset +\if :skip_test +\quit +\endif +SELECT getdatabaseencoding(); -- just to label the results files + getdatabaseencoding +--------------------- + UTF8 +(1 row) + -- checks for double-quoted values -- basic unicode input SELECT '"\u"'::jsonpath; -- ERROR, incomplete escape diff --git a/src/test/regress/expected/jsonpath_encoding_1.out b/src/test/regress/expected/jsonpath_encoding_1.out index c8cc217..005136c 100644 --- a/src/test/regress/expected/jsonpath_encoding_1.out +++ b/src/test/regress/expected/jsonpath_encoding_1.out @@ -1,4 +1,19 @@ +-- -- encoding-sensitive tests for jsonpath +-- +-- We provide expected-results files for UTF8 (jsonpath_encoding.out) +-- and for SQL_ASCII (jsonpath_encoding_1.out). Skip otherwise. +SELECT getdatabaseencoding() NOT IN ('UTF8', 'SQL_ASCII') + AS skip_test \gset +\if :skip_test +\quit +\endif +SELECT getdatabaseencoding(); -- just to label the results files + getdatabaseencoding +--------------------- + SQL_ASCII +(1 row) + -- checks for double-quoted values -- basic unicode input SELECT '"\u"'::jsonpath; -- ERROR, incomplete escape @@ -19,16 +34,14 @@ LINE 1: SELECT '"\u0000"'::jsonpath; ^ DETAIL: \u0000 cannot be converted to text. SELECT '"\uaBcD"'::jsonpath; -- OK, uppercase and lower case both OK -ERROR: invalid input syntax for type jsonpath +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: SELECT '"\uaBcD"'::jsonpath; ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. -- handling of unicode surrogate pairs select '"\ud83d\ude04\ud83d\udc36"'::jsonpath as correct_in_utf8; -ERROR: invalid input syntax for type jsonpath +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: select '"\ud83d\ude04\ud83d\udc36"'::jsonpath as correct_in_... ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. select '"\ud83d\ud83d"'::jsonpath; -- 2 high surrogates in a row ERROR: invalid input syntax for type jsonpath LINE 1: select '"\ud83d\ud83d"'::jsonpath; @@ -51,10 +64,9 @@ LINE 1: select '"\ude04X"'::jsonpath; DETAIL: Unicode low surrogate must follow a high surrogate. --handling of simple unicode escapes select '"the Copyright \u00a9 sign"'::jsonpath as correct_in_utf8; -ERROR: invalid input syntax for type jsonpath +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: select '"the Copyright \u00a9 sign"'::jsonpath as correct_in... ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. select '"dollar \u0024 character"'::jsonpath as correct_everywhere; correct_everywhere ---------------------- @@ -98,16 +110,14 @@ LINE 1: SELECT '$."\u0000"'::jsonpath; ^ DETAIL: \u0000 cannot be converted to text. SELECT '$."\uaBcD"'::jsonpath; -- OK, uppercase and lower case both OK -ERROR: invalid input syntax for type jsonpath +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: SELECT '$."\uaBcD"'::jsonpath; ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. -- handling of unicode surrogate pairs select '$."\ud83d\ude04\ud83d\udc36"'::jsonpath as correct_in_utf8; -ERROR: invalid input syntax for type jsonpath +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: select '$."\ud83d\ude04\ud83d\udc36"'::jsonpath as correct_i... ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. select '$."\ud83d\ud83d"'::jsonpath; -- 2 high surrogates in a row ERROR: invalid input syntax for type jsonpath LINE 1: select '$."\ud83d\ud83d"'::jsonpath; @@ -130,10 +140,9 @@ LINE 1: select '$."\ude04X"'::jsonpath; DETAIL: Unicode low surrogate must follow a high surrogate. --handling of simple unicode escapes select '$."the Copyright \u00a9 sign"'::jsonpath as correct_in_utf8; -ERROR: invalid input syntax for type jsonpath +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: select '$."the Copyright \u00a9 sign"'::jsonpath as correct_... ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. select '$."dollar \u0024 character"'::jsonpath as correct_everywhere; correct_everywhere ------------------------ diff --git a/src/test/regress/expected/jsonpath_encoding_2.out b/src/test/regress/expected/jsonpath_encoding_2.out new file mode 100644 index 0000000..bb71bfe --- /dev/null +++ b/src/test/regress/expected/jsonpath_encoding_2.out @@ -0,0 +1,9 @@ +-- +-- encoding-sensitive tests for jsonpath +-- +-- We provide expected-results files for UTF8 (jsonpath_encoding.out) +-- and for SQL_ASCII (jsonpath_encoding_1.out). Skip otherwise. +SELECT getdatabaseencoding() NOT IN ('UTF8', 'SQL_ASCII') + AS skip_test \gset +\if :skip_test +\quit diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out index 60cb861..6c4443a 100644 --- a/src/test/regress/expected/strings.out +++ b/src/test/regress/expected/strings.out @@ -35,6 +35,12 @@ SELECT U&'d!0061t\+000061' UESCAPE '!' AS U&"d*0061t\+000061" UESCAPE '*'; dat\+000061 (1 row) +SELECT U&'a\\b' AS "a\b"; + a\b +----- + a\b +(1 row) + SELECT U&' \' UESCAPE '!' AS "tricky"; tricky -------- @@ -48,13 +54,15 @@ SELECT 'tricky' AS U&"\" UESCAPE '!'; (1 row) SELECT U&'wrong: \061'; -ERROR: invalid Unicode escape value +ERROR: invalid Unicode escape LINE 1: SELECT U&'wrong: \061'; ^ +HINT: Unicode escapes must be \XXXX or \+XXXXXX. SELECT U&'wrong: \+0061'; -ERROR: invalid Unicode escape value +ERROR: invalid Unicode escape LINE 1: SELECT U&'wrong: \+0061'; ^ +HINT: Unicode escapes must be \XXXX or \+XXXXXX. SELECT U&'wrong: +0061' UESCAPE +; ERROR: UESCAPE must be followed by a simple string literal at or near "+" LINE 1: SELECT U&'wrong: +0061' UESCAPE +; @@ -63,6 +71,77 @@ SELECT U&'wrong: +0061' UESCAPE '+'; ERROR: invalid Unicode escape character at or near "'+'" LINE 1: SELECT U&'wrong: +0061' UESCAPE '+'; ^ +SELECT U&'wrong: \db99'; +ERROR: invalid Unicode surrogate pair +LINE 1: SELECT U&'wrong: \db99'; + ^ +SELECT U&'wrong: \db99xy'; +ERROR: invalid Unicode surrogate pair +LINE 1: SELECT U&'wrong: \db99xy'; + ^ +SELECT U&'wrong: \db99\\'; +ERROR: invalid Unicode surrogate pair +LINE 1: SELECT U&'wrong: \db99\\'; + ^ +SELECT U&'wrong: \db99\0061'; +ERROR: invalid Unicode surrogate pair +LINE 1: SELECT U&'wrong: \db99\0061'; + ^ +SELECT U&'wrong: \+00db99\+000061'; +ERROR: invalid Unicode surrogate pair +LINE 1: SELECT U&'wrong: \+00db99\+000061'; + ^ +SELECT U&'wrong: \+2FFFFF'; +ERROR: invalid Unicode escape value +LINE 1: SELECT U&'wrong: \+2FFFFF'; + ^ +-- while we're here, check the same cases in E-style literals +SELECT E'd\u0061t\U00000061' AS "data"; + data +------ + data +(1 row) + +SELECT E'a\\b' AS "a\b"; + a\b +----- + a\b +(1 row) + +SELECT E'wrong: \u061'; +ERROR: invalid Unicode escape +LINE 1: SELECT E'wrong: \u061'; + ^ +HINT: Unicode escapes must be \uXXXX or \UXXXXXXXX. +SELECT E'wrong: \U0061'; +ERROR: invalid Unicode escape +LINE 1: SELECT E'wrong: \U0061'; + ^ +HINT: Unicode escapes must be \uXXXX or \UXXXXXXXX. +SELECT E'wrong: \udb99'; +ERROR: invalid Unicode surrogate pair at or near "'" +LINE 1: SELECT E'wrong: \udb99'; + ^ +SELECT E'wrong: \udb99xy'; +ERROR: invalid Unicode surrogate pair at or near "x" +LINE 1: SELECT E'wrong: \udb99xy'; + ^ +SELECT E'wrong: \udb99\\'; +ERROR: invalid Unicode surrogate pair at or near "\" +LINE 1: SELECT E'wrong: \udb99\\'; + ^ +SELECT E'wrong: \udb99\u0061'; +ERROR: invalid Unicode surrogate pair at or near "\u0061" +LINE 1: SELECT E'wrong: \udb99\u0061'; + ^ +SELECT E'wrong: \U0000db99\U00000061'; +ERROR: invalid Unicode surrogate pair at or near "\U00000061" +LINE 1: SELECT E'wrong: \U0000db99\U00000061'; + ^ +SELECT E'wrong: \U002FFFFF'; +ERROR: invalid Unicode escape value at or near "\U002FFFFF" +LINE 1: SELECT E'wrong: \U002FFFFF'; + ^ SET standard_conforming_strings TO off; SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061"; ERROR: unsafe use of string constant with Unicode escapes diff --git a/src/test/regress/sql/json_encoding.sql b/src/test/regress/sql/json_encoding.sql index 87a2d56..d7fac69 100644 --- a/src/test/regress/sql/json_encoding.sql +++ b/src/test/regress/sql/json_encoding.sql @@ -1,5 +1,16 @@ - +-- -- encoding-sensitive tests for json and jsonb +-- + +-- We provide expected-results files for UTF8 (json_encoding.out) +-- and for SQL_ASCII (json_encoding_1.out). Skip otherwise. +SELECT getdatabaseencoding() NOT IN ('UTF8', 'SQL_ASCII') + AS skip_test \gset +\if :skip_test +\quit +\endif + +SELECT getdatabaseencoding(); -- just to label the results files -- first json diff --git a/src/test/regress/sql/jsonpath_encoding.sql b/src/test/regress/sql/jsonpath_encoding.sql index 3a23b72..55d9e30 100644 --- a/src/test/regress/sql/jsonpath_encoding.sql +++ b/src/test/regress/sql/jsonpath_encoding.sql @@ -1,5 +1,16 @@ - +-- -- encoding-sensitive tests for jsonpath +-- + +-- We provide expected-results files for UTF8 (jsonpath_encoding.out) +-- and for SQL_ASCII (jsonpath_encoding_1.out). Skip otherwise. +SELECT getdatabaseencoding() NOT IN ('UTF8', 'SQL_ASCII') + AS skip_test \gset +\if :skip_test +\quit +\endif + +SELECT getdatabaseencoding(); -- just to label the results files -- checks for double-quoted values diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql index c5cd151..3e28cd1 100644 --- a/src/test/regress/sql/strings.sql +++ b/src/test/regress/sql/strings.sql @@ -21,6 +21,7 @@ SET standard_conforming_strings TO on; SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061"; SELECT U&'d!0061t\+000061' UESCAPE '!' AS U&"d*0061t\+000061" UESCAPE '*'; +SELECT U&'a\\b' AS "a\b"; SELECT U&' \' UESCAPE '!' AS "tricky"; SELECT 'tricky' AS U&"\" UESCAPE '!'; @@ -30,6 +31,25 @@ SELECT U&'wrong: \+0061'; SELECT U&'wrong: +0061' UESCAPE +; SELECT U&'wrong: +0061' UESCAPE '+'; +SELECT U&'wrong: \db99'; +SELECT U&'wrong: \db99xy'; +SELECT U&'wrong: \db99\\'; +SELECT U&'wrong: \db99\0061'; +SELECT U&'wrong: \+00db99\+000061'; +SELECT U&'wrong: \+2FFFFF'; + +-- while we're here, check the same cases in E-style literals +SELECT E'd\u0061t\U00000061' AS "data"; +SELECT E'a\\b' AS "a\b"; +SELECT E'wrong: \u061'; +SELECT E'wrong: \U0061'; +SELECT E'wrong: \udb99'; +SELECT E'wrong: \udb99xy'; +SELECT E'wrong: \udb99\\'; +SELECT E'wrong: \udb99\u0061'; +SELECT E'wrong: \U0000db99\U00000061'; +SELECT E'wrong: \U002FFFFF'; + SET standard_conforming_strings TO off; SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
I wrote: > [ unicode-escapes-with-other-server-encodings-2.patch ] I see this patch got sideswiped by the recent refactoring of JSON lexing. Here's an attempt at fixing it up. Since the frontend code isn't going to have access to encoding conversion facilities, this creates a difference between frontend and backend handling of JSON Unicode escapes, which is mildly annoying but probably isn't going to bother anyone in the real world. Outside of jsonapi.c, there are no changes from v2. regards, tom lane diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml index 1b6aaf0..a9c68c7 100644 --- a/doc/src/sgml/json.sgml +++ b/doc/src/sgml/json.sgml @@ -61,8 +61,8 @@ </para> <para> - <productname>PostgreSQL</productname> allows only one character set - encoding per database. It is therefore not possible for the JSON + RFC 7159 specifies that JSON strings should be encoded in UTF8. + It is therefore not possible for the JSON types to conform rigidly to the JSON specification unless the database encoding is UTF8. Attempts to directly include characters that cannot be represented in the database encoding will fail; conversely, @@ -77,13 +77,13 @@ regardless of the database encoding, and are checked only for syntactic correctness (that is, that four hex digits follow <literal>\u</literal>). However, the input function for <type>jsonb</type> is stricter: it disallows - Unicode escapes for non-ASCII characters (those above <literal>U+007F</literal>) - unless the database encoding is UTF8. The <type>jsonb</type> type also + Unicode escapes for characters that cannot be represented in the database + encoding. The <type>jsonb</type> type also rejects <literal>\u0000</literal> (because that cannot be represented in <productname>PostgreSQL</productname>'s <type>text</type> type), and it insists that any use of Unicode surrogate pairs to designate characters outside the Unicode Basic Multilingual Plane be correct. Valid Unicode escapes - are converted to the equivalent ASCII or UTF8 character for storage; + are converted to the equivalent single character for storage; this includes folding surrogate pairs into a single character. </para> @@ -96,9 +96,8 @@ not <type>jsonb</type>. The fact that the <type>json</type> input function does not make these checks may be considered a historical artifact, although it does allow for simple storage (without processing) of JSON Unicode - escapes in a non-UTF8 database encoding. In general, it is best to - avoid mixing Unicode escapes in JSON with a non-UTF8 database encoding, - if possible. + escapes in a database encoding that does not support the represented + characters. </para> </note> @@ -144,8 +143,8 @@ <row> <entry><type>string</type></entry> <entry><type>text</type></entry> - <entry><literal>\u0000</literal> is disallowed, as are non-ASCII Unicode - escapes if database encoding is not UTF8</entry> + <entry><literal>\u0000</literal> is disallowed, as are Unicode escapes + representing characters not available in the database encoding</entry> </row> <row> <entry><type>number</type></entry> diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml index c908e0b..e134877 100644 --- a/doc/src/sgml/syntax.sgml +++ b/doc/src/sgml/syntax.sgml @@ -189,6 +189,23 @@ UPDATE "my_table" SET "a" = 5; ampersands. The length limitation still applies. </para> + <para> + Quoting an identifier also makes it case-sensitive, whereas + unquoted names are always folded to lower case. For example, the + identifiers <literal>FOO</literal>, <literal>foo</literal>, and + <literal>"foo"</literal> are considered the same by + <productname>PostgreSQL</productname>, but + <literal>"Foo"</literal> and <literal>"FOO"</literal> are + different from these three and each other. (The folding of + unquoted names to lower case in <productname>PostgreSQL</productname> is + incompatible with the SQL standard, which says that unquoted names + should be folded to upper case. Thus, <literal>foo</literal> + should be equivalent to <literal>"FOO"</literal> not + <literal>"foo"</literal> according to the standard. If you want + to write portable applications you are advised to always quote a + particular name or never quote it.) + </para> + <indexterm> <primary>Unicode escape</primary> <secondary>in identifiers</secondary> @@ -230,7 +247,8 @@ U&"d!0061t!+000061" UESCAPE '!' The escape character can be any single character other than a hexadecimal digit, the plus sign, a single quote, a double quote, or a whitespace character. Note that the escape character is - written in single quotes, not double quotes. + written in single quotes, not double quotes, + after <literal>UESCAPE</literal>. </para> <para> @@ -239,32 +257,18 @@ U&"d!0061t!+000061" UESCAPE '!' </para> <para> - The Unicode escape syntax works only when the server encoding is - <literal>UTF8</literal>. When other server encodings are used, only code - points in the ASCII range (up to <literal>\007F</literal>) can be - specified. Both the 4-digit and the 6-digit form can be used to + Either the 4-digit or the 6-digit escape form can be used to specify UTF-16 surrogate pairs to compose characters with code points larger than U+FFFF, although the availability of the 6-digit form technically makes this unnecessary. (Surrogate - pairs are not stored directly, but combined into a single - code point that is then encoded in UTF-8.) + pairs are not stored directly, but are combined into a single + code point.) </para> <para> - Quoting an identifier also makes it case-sensitive, whereas - unquoted names are always folded to lower case. For example, the - identifiers <literal>FOO</literal>, <literal>foo</literal>, and - <literal>"foo"</literal> are considered the same by - <productname>PostgreSQL</productname>, but - <literal>"Foo"</literal> and <literal>"FOO"</literal> are - different from these three and each other. (The folding of - unquoted names to lower case in <productname>PostgreSQL</productname> is - incompatible with the SQL standard, which says that unquoted names - should be folded to upper case. Thus, <literal>foo</literal> - should be equivalent to <literal>"FOO"</literal> not - <literal>"foo"</literal> according to the standard. If you want - to write portable applications you are advised to always quote a - particular name or never quote it.) + If the server encoding is not UTF-8, the Unicode code point identified + by one of these escape sequences is converted to the actual server + encoding; an error is reported if that's not possible. </para> </sect2> @@ -427,25 +431,11 @@ SELECT 'foo' 'bar'; <para> It is your responsibility that the byte sequences you create, especially when using the octal or hexadecimal escapes, compose - valid characters in the server character set encoding. When the - server encoding is UTF-8, then the Unicode escapes or the + valid characters in the server character set encoding. + A useful alternative is to use Unicode escapes or the alternative Unicode escape syntax, explained - in <xref linkend="sql-syntax-strings-uescape"/>, should be used - instead. (The alternative would be doing the UTF-8 encoding by - hand and writing out the bytes, which would be very cumbersome.) - </para> - - <para> - The Unicode escape syntax works fully only when the server - encoding is <literal>UTF8</literal>. When other server encodings are - used, only code points in the ASCII range (up - to <literal>\u007F</literal>) can be specified. Both the 4-digit and - the 8-digit form can be used to specify UTF-16 surrogate pairs to - compose characters with code points larger than U+FFFF, although - the availability of the 8-digit form technically makes this - unnecessary. (When surrogate pairs are used when the server - encoding is <literal>UTF8</literal>, they are first combined into a - single code point that is then encoded in UTF-8.) + in <xref linkend="sql-syntax-strings-uescape"/>; then the server + will check that the character conversion is possible. </para> <caution> @@ -524,16 +514,23 @@ U&'d!0061t!+000061' UESCAPE '!' </para> <para> - The Unicode escape syntax works only when the server encoding is - <literal>UTF8</literal>. When other server encodings are used, only - code points in the ASCII range (up to <literal>\007F</literal>) - can be specified. Both the 4-digit and the 6-digit form can be - used to specify UTF-16 surrogate pairs to compose characters with - code points larger than U+FFFF, although the availability of the - 6-digit form technically makes this unnecessary. (When surrogate - pairs are used when the server encoding is <literal>UTF8</literal>, they - are first combined into a single code point that is then encoded - in UTF-8.) + To include the escape character in the string literally, write + it twice. + </para> + + <para> + Either the 4-digit or the 6-digit escape form can be used to + specify UTF-16 surrogate pairs to compose characters with code + points larger than U+FFFF, although the availability of the + 6-digit form technically makes this unnecessary. (Surrogate + pairs are not stored directly, but are combined into a single + code point.) + </para> + + <para> + If the server encoding is not UTF-8, the Unicode code point identified + by one of these escape sequences is converted to the actual server + encoding; an error is reported if that's not possible. </para> <para> @@ -546,11 +543,6 @@ U&'d!0061t!+000061' UESCAPE '!' parameter is set to off, this syntax will be rejected with an error message. </para> - - <para> - To include the escape character in the string literally, write it - twice. - </para> </sect3> <sect3 id="sql-syntax-dollar-quoting"> diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c index 1bf1144..22c9479 100644 --- a/src/backend/parser/parser.c +++ b/src/backend/parser/parser.c @@ -292,22 +292,15 @@ hexval(unsigned char c) return 0; /* not reached */ } -/* is Unicode code point acceptable in database's encoding? */ +/* is Unicode code point acceptable? */ static void -check_unicode_value(pg_wchar c, int pos, core_yyscan_t yyscanner) +check_unicode_value(pg_wchar c) { /* See also addunicode() in scan.l */ if (c == 0 || c > 0x10FFFF) ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), - errmsg("invalid Unicode escape value"), - scanner_errposition(pos, yyscanner))); - - if (c > 0x7F && GetDatabaseEncoding() != PG_UTF8) - ereport(ERROR, - (errcode(ERRCODE_SYNTAX_ERROR), - errmsg("Unicode escape values cannot be used for code point values above 007F when the server encodingis not UTF8"), - scanner_errposition(pos, yyscanner))); + errmsg("invalid Unicode escape value"))); } /* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */ @@ -338,20 +331,39 @@ str_udeescape(const char *str, char escape, const char *in; char *new, *out; + size_t new_len; pg_wchar pair_first = 0; + ScannerCallbackState scbstate; /* - * This relies on the subtle assumption that a UTF-8 expansion cannot be - * longer than its escaped representation. + * Guesstimate that result will be no longer than input, but allow enough + * padding for Unicode conversion. */ - new = palloc(strlen(str) + 1); + new_len = strlen(str) + MAX_UNICODE_EQUIVALENT_STRING + 1; + new = palloc(new_len); in = str; out = new; while (*in) { + /* Enlarge string if needed */ + size_t out_dist = out - new; + + if (out_dist > new_len - (MAX_UNICODE_EQUIVALENT_STRING + 1)) + { + new_len *= 2; + new = repalloc(new, new_len); + out = new + out_dist; + } + if (in[0] == escape) { + /* + * Any errors reported while processing this escape sequence will + * have an error cursor pointing at the escape. + */ + setup_scanner_errposition_callback(&scbstate, yyscanner, + in - str + position + 3); /* 3 for U&" */ if (in[1] == escape) { if (pair_first) @@ -370,9 +382,7 @@ str_udeescape(const char *str, char escape, (hexval(in[2]) << 8) + (hexval(in[3]) << 4) + hexval(in[4]); - check_unicode_value(unicode, - in - str + position + 3, /* 3 for U&" */ - yyscanner); + check_unicode_value(unicode); if (pair_first) { if (is_utf16_surrogate_second(unicode)) @@ -390,8 +400,8 @@ str_udeescape(const char *str, char escape, pair_first = unicode; else { - unicode_to_utf8(unicode, (unsigned char *) out); - out += pg_mblen(out); + pg_unicode_to_server(unicode, (unsigned char *) out); + out += strlen(out); } in += 5; } @@ -411,9 +421,7 @@ str_udeescape(const char *str, char escape, (hexval(in[5]) << 8) + (hexval(in[6]) << 4) + hexval(in[7]); - check_unicode_value(unicode, - in - str + position + 3, /* 3 for U&" */ - yyscanner); + check_unicode_value(unicode); if (pair_first) { if (is_utf16_surrogate_second(unicode)) @@ -431,17 +439,18 @@ str_udeescape(const char *str, char escape, pair_first = unicode; else { - unicode_to_utf8(unicode, (unsigned char *) out); - out += pg_mblen(out); + pg_unicode_to_server(unicode, (unsigned char *) out); + out += strlen(out); } in += 8; } else ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), - errmsg("invalid Unicode escape value"), - scanner_errposition(in - str + position + 3, /* 3 for U&" */ - yyscanner))); + errmsg("invalid Unicode escape"), + errhint("Unicode escapes must be \\XXXX or \\+XXXXXX."))); + + cancel_scanner_errposition_callback(&scbstate); } else { @@ -457,15 +466,13 @@ str_udeescape(const char *str, char escape, goto invalid_pair; *out = '\0'; + return new; /* - * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII - * codes; but it's probably not worth the trouble, since this isn't likely - * to be a performance-critical path. + * We might get here with the error callback active, or not. Call + * scanner_errposition to make sure an error cursor appears; if the + * callback is active, this is duplicative but harmless. */ - pg_verifymbstr(new, out - new, false); - return new; - invalid_pair: ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l index 84c7391..685aa84 100644 --- a/src/backend/parser/scan.l +++ b/src/backend/parser/scan.l @@ -106,6 +106,18 @@ const uint16 ScanKeywordTokens[] = { */ #define ADVANCE_YYLLOC(delta) ( *(yylloc) += (delta) ) +/* + * Sometimes, we do want yylloc to point into the middle of a token; this is + * useful for instance to throw an error about an escape sequence within a + * string literal. But if we find no error there, we want to revert yylloc + * to the token start, so that that's the location reported to the parser. + * Use PUSH_YYLLOC/POP_YYLLOC to save/restore yylloc around such code. + * (Currently the implied "stack" is just one location, but someday we might + * need to nest these.) + */ +#define PUSH_YYLLOC() (yyextra->save_yylloc = *(yylloc)) +#define POP_YYLLOC() (*(yylloc) = yyextra->save_yylloc) + #define startlit() ( yyextra->literallen = 0 ) static void addlit(char *ytext, int yleng, core_yyscan_t yyscanner); static void addlitchar(unsigned char ychar, core_yyscan_t yyscanner); @@ -605,8 +617,18 @@ other . <xe>{xeunicode} { pg_wchar c = strtoul(yytext + 2, NULL, 16); + /* + * For consistency with other productions, issue any + * escape warning with cursor pointing to start of string. + * We might want to change that, someday. + */ check_escape_warning(yyscanner); + /* Remember start of overall string token ... */ + PUSH_YYLLOC(); + /* ... and set the error cursor to point at this esc seq */ + SET_YYLLOC(); + if (is_utf16_surrogate_first(c)) { yyextra->utf16_first_part = c; @@ -616,10 +638,18 @@ other . yyerror("invalid Unicode surrogate pair"); else addunicode(c, yyscanner); + + /* Restore yylloc to be start of string token */ + POP_YYLLOC(); } <xeu>{xeunicode} { pg_wchar c = strtoul(yytext + 2, NULL, 16); + /* Remember start of overall string token ... */ + PUSH_YYLLOC(); + /* ... and set the error cursor to point at this esc seq */ + SET_YYLLOC(); + if (!is_utf16_surrogate_second(c)) yyerror("invalid Unicode surrogate pair"); @@ -627,12 +657,21 @@ other . addunicode(c, yyscanner); + /* Restore yylloc to be start of string token */ + POP_YYLLOC(); + BEGIN(xe); } -<xeu>. { yyerror("invalid Unicode surrogate pair"); } -<xeu>\n { yyerror("invalid Unicode surrogate pair"); } -<xeu><<EOF>> { yyerror("invalid Unicode surrogate pair"); } +<xeu>. | +<xeu>\n | +<xeu><<EOF>> { + /* Set the error cursor to point at missing esc seq */ + SET_YYLLOC(); + yyerror("invalid Unicode surrogate pair"); + } <xe,xeu>{xeunicodefail} { + /* Set the error cursor to point at malformed esc seq */ + SET_YYLLOC(); ereport(ERROR, (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE), errmsg("invalid Unicode escape"), @@ -1029,12 +1068,13 @@ other . * scanner_errposition * Report a lexer or grammar error cursor position, if possible. * - * This is expected to be used within an ereport() call. The return value + * This is expected to be used within an ereport() call, or via an error + * callback such as setup_scanner_errposition_callback(). The return value * is a dummy (always 0, in fact). * * Note that this can only be used for messages emitted during raw parsing - * (essentially, scan.l and gram.y), since it requires the yyscanner struct - * to still be available. + * (essentially, scan.l, parser.c, and gram.y), since it requires the + * yyscanner struct to still be available. */ int scanner_errposition(int location, core_yyscan_t yyscanner) @@ -1051,6 +1091,62 @@ scanner_errposition(int location, core_yyscan_t yyscanner) } /* + * Error context callback for inserting scanner error location. + * + * Note that this will be called for *any* error occurring while the + * callback is installed. We avoid inserting an irrelevant error location + * if the error is a query cancel --- are there any other important cases? + */ +static void +scb_error_callback(void *arg) +{ + ScannerCallbackState *scbstate = (ScannerCallbackState *) arg; + + if (geterrcode() != ERRCODE_QUERY_CANCELED) + (void) scanner_errposition(scbstate->location, scbstate->yyscanner); +} + +/* + * setup_scanner_errposition_callback + * Arrange for non-scanner errors to report an error position + * + * Sometimes the scanner calls functions that aren't part of the scanner + * subsystem and can't reasonably be passed the yyscanner pointer; yet + * we would like any errors thrown in those functions to be tagged with an + * error location. Use this function to set up an error context stack + * entry that will accomplish that. Usage pattern: + * + * declare a local variable "ScannerCallbackState scbstate" + * ... + * setup_scanner_errposition_callback(&scbstate, yyscanner, location); + * call function that might throw error; + * cancel_scanner_errposition_callback(&scbstate); + */ +void +setup_scanner_errposition_callback(ScannerCallbackState *scbstate, + core_yyscan_t yyscanner, + int location) +{ + /* Setup error traceback support for ereport() */ + scbstate->yyscanner = yyscanner; + scbstate->location = location; + scbstate->errcallback.callback = scb_error_callback; + scbstate->errcallback.arg = (void *) scbstate; + scbstate->errcallback.previous = error_context_stack; + error_context_stack = &scbstate->errcallback; +} + +/* + * Cancel a previously-set-up errposition callback. + */ +void +cancel_scanner_errposition_callback(ScannerCallbackState *scbstate) +{ + /* Pop the error context stack */ + error_context_stack = scbstate->errcallback.previous; +} + +/* * scanner_yyerror * Report a lexer or grammar error. * @@ -1226,19 +1322,21 @@ process_integer_literal(const char *token, YYSTYPE *lval) static void addunicode(pg_wchar c, core_yyscan_t yyscanner) { - char buf[8]; + ScannerCallbackState scbstate; + char buf[MAX_UNICODE_EQUIVALENT_STRING + 1]; /* See also check_unicode_value() in parser.c */ if (c == 0 || c > 0x10FFFF) yyerror("invalid Unicode escape value"); - if (c > 0x7F) - { - if (GetDatabaseEncoding() != PG_UTF8) - yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is notUTF8"); - yyextra->saw_non_ascii = true; - } - unicode_to_utf8(c, (unsigned char *) buf); - addlit(buf, pg_mblen(buf), yyscanner); + + /* + * We expect that pg_unicode_to_server() will complain about any + * unconvertible code point, so we don't have to set saw_non_ascii. + */ + setup_scanner_errposition_callback(&scbstate, yyscanner, *(yylloc)); + pg_unicode_to_server(c, (unsigned char *) buf); + cancel_scanner_errposition_callback(&scbstate); + addlit(buf, strlen(buf), yyscanner); } static unsigned char diff --git a/src/backend/utils/adt/jsonpath_scan.l b/src/backend/utils/adt/jsonpath_scan.l index 70681b7..be0a2cf 100644 --- a/src/backend/utils/adt/jsonpath_scan.l +++ b/src/backend/utils/adt/jsonpath_scan.l @@ -486,13 +486,6 @@ hexval(char c) static void addUnicodeChar(int ch) { - /* - * For UTF8, replace the escape sequence by the actual - * utf8 character in lex->strval. Do this also for other - * encodings if the escape designates an ASCII character, - * otherwise raise an error. - */ - if (ch == 0) { /* We can't allow this, since our TEXT type doesn't */ @@ -501,40 +494,20 @@ addUnicodeChar(int ch) errmsg("unsupported Unicode escape sequence"), errdetail("\\u0000 cannot be converted to text."))); } - else if (GetDatabaseEncoding() == PG_UTF8) - { - char utf8str[5]; - int utf8len; - - unicode_to_utf8(ch, (unsigned char *) utf8str); - utf8len = pg_utf_mblen((unsigned char *) utf8str); - addstring(false, utf8str, utf8len); - } - else if (ch <= 0x007f) - { - /* - * This is the only way to designate things like a - * form feed character in JSON, so it's useful in all - * encodings. - */ - addchar(false, (char) ch); - } else { - ereport(ERROR, - (errcode(ERRCODE_INVALID_TEXT_REPRESENTATION), - errmsg("invalid input syntax for type %s", "jsonpath"), - errdetail("Unicode escape values cannot be used for code " - "point values above 007F when the server encoding " - "is not UTF8."))); + char cbuf[MAX_UNICODE_EQUIVALENT_STRING + 1]; + + pg_unicode_to_server(ch, (unsigned char *) cbuf); + addstring(false, cbuf, strlen(cbuf)); } } -/* Add unicode character and process its hi surrogate */ +/* Add unicode character, processing any surrogate pairs */ static void addUnicode(int ch, int *hi_surrogate) { - if (ch >= 0xd800 && ch <= 0xdbff) + if (is_utf16_surrogate_first(ch)) { if (*hi_surrogate != -1) ereport(ERROR, @@ -542,10 +515,10 @@ addUnicode(int ch, int *hi_surrogate) errmsg("invalid input syntax for type %s", "jsonpath"), errdetail("Unicode high surrogate must not follow " "a high surrogate."))); - *hi_surrogate = (ch & 0x3ff) << 10; + *hi_surrogate = ch; return; } - else if (ch >= 0xdc00 && ch <= 0xdfff) + else if (is_utf16_surrogate_second(ch)) { if (*hi_surrogate == -1) ereport(ERROR, @@ -553,7 +526,7 @@ addUnicode(int ch, int *hi_surrogate) errmsg("invalid input syntax for type %s", "jsonpath"), errdetail("Unicode low surrogate must follow a high " "surrogate."))); - ch = 0x10000 + *hi_surrogate + (ch & 0x3ff); + ch = surrogate_pair_to_codepoint(*hi_surrogate, ch); *hi_surrogate = -1; } else if (*hi_surrogate != -1) diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c index 3808c30..a2d2a0b 100644 --- a/src/backend/utils/adt/xml.c +++ b/src/backend/utils/adt/xml.c @@ -2086,26 +2086,6 @@ map_sql_identifier_to_xml_name(const char *ident, bool fully_escaped, /* - * Map a Unicode codepoint into the current server encoding. - */ -static char * -unicode_to_sqlchar(pg_wchar c) -{ - char utf8string[8]; /* need room for trailing zero */ - char *result; - - memset(utf8string, 0, sizeof(utf8string)); - unicode_to_utf8(c, (unsigned char *) utf8string); - - result = pg_any_to_server(utf8string, strlen(utf8string), PG_UTF8); - /* if pg_any_to_server didn't strdup, we must */ - if (result == utf8string) - result = pstrdup(result); - return result; -} - - -/* * Map XML name to SQL identifier; see SQL/XML:2008 section 9.3. */ char * @@ -2125,10 +2105,12 @@ map_xml_name_to_sql_identifier(const char *name) && isxdigit((unsigned char) *(p + 5)) && *(p + 6) == '_') { + char cbuf[MAX_UNICODE_EQUIVALENT_STRING + 1]; unsigned int u; sscanf(p + 2, "%X", &u); - appendStringInfoString(&buf, unicode_to_sqlchar(u)); + pg_unicode_to_server(u, (unsigned char *) cbuf); + appendStringInfoString(&buf, cbuf); p += 6; } else diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c index 86787bc..f1c539e 100644 --- a/src/backend/utils/mb/mbutils.c +++ b/src/backend/utils/mb/mbutils.c @@ -68,6 +68,13 @@ static FmgrInfo *ToServerConvProc = NULL; static FmgrInfo *ToClientConvProc = NULL; /* + * This variable stores the conversion function to convert from UTF-8 + * to the server encoding. It's NULL if the server encoding *is* UTF-8, + * or if we lack a conversion function for this. + */ +static FmgrInfo *Utf8ToServerConvProc = NULL; + +/* * These variables track the currently-selected encodings. */ static const pg_enc2name *ClientEncoding = &pg_enc2name_tbl[PG_SQL_ASCII]; @@ -273,6 +280,8 @@ SetClientEncoding(int encoding) void InitializeClientEncoding(void) { + int current_server_encoding; + Assert(!backend_startup_complete); backend_startup_complete = true; @@ -289,6 +298,35 @@ InitializeClientEncoding(void) pg_enc2name_tbl[pending_client_encoding].name, GetDatabaseEncodingName()))); } + + /* + * Also look up the UTF8-to-server conversion function if needed. Since + * the server encoding is fixed within any one backend process, we don't + * have to do this more than once. + */ + current_server_encoding = GetDatabaseEncoding(); + if (current_server_encoding != PG_UTF8 && + current_server_encoding != PG_SQL_ASCII) + { + Oid utf8_to_server_proc; + + Assert(IsTransactionState()); + utf8_to_server_proc = + FindDefaultConversionProc(PG_UTF8, + current_server_encoding); + /* If there's no such conversion, just leave the pointer as NULL */ + if (OidIsValid(utf8_to_server_proc)) + { + FmgrInfo *finfo; + + finfo = (FmgrInfo *) MemoryContextAlloc(TopMemoryContext, + sizeof(FmgrInfo)); + fmgr_info_cxt(utf8_to_server_proc, finfo, + TopMemoryContext); + /* Set Utf8ToServerConvProc only after data is fully valid */ + Utf8ToServerConvProc = finfo; + } + } } /* @@ -752,6 +790,73 @@ perform_default_encoding_conversion(const char *src, int len, return result; } +/* + * Convert a single Unicode code point into a string in the server encoding. + * + * The code point given by "c" is converted and stored at *s, which must + * have at least MAX_UNICODE_EQUIVALENT_STRING+1 bytes available. + * The output will have a trailing '\0'. Throws error if the conversion + * cannot be performed. + * + * Note that this relies on having previously looked up any required + * conversion function. That's partly for speed but mostly because the parser + * may call this outside any transaction, or in an aborted transaction. + */ +void +pg_unicode_to_server(pg_wchar c, unsigned char *s) +{ + unsigned char c_as_utf8[MAX_MULTIBYTE_CHAR_LEN + 1]; + int c_as_utf8_len; + int server_encoding; + + /* + * Complain if invalid Unicode code point. The choice of errcode here is + * debatable, but really our caller should have checked this anyway. + */ + if (c == 0 || c > 0x10FFFF) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("invalid Unicode code point"))); + + /* Otherwise, if it's in ASCII range, conversion is trivial */ + if (c <= 0x7F) + { + s[0] = (unsigned char) c; + s[1] = '\0'; + return; + } + + /* If the server encoding is UTF-8, we just need to reformat the code */ + server_encoding = GetDatabaseEncoding(); + if (server_encoding == PG_UTF8) + { + unicode_to_utf8(c, s); + s[pg_utf_mblen(s)] = '\0'; + return; + } + + /* For all other cases, we must have a conversion function available */ + if (Utf8ToServerConvProc == NULL) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("conversion between %s and %s is not supported", + pg_enc2name_tbl[PG_UTF8].name, + GetDatabaseEncodingName()))); + + /* Construct UTF-8 source string */ + unicode_to_utf8(c, c_as_utf8); + c_as_utf8_len = pg_utf_mblen(c_as_utf8); + c_as_utf8[c_as_utf8_len] = '\0'; + + /* Convert, or throw error if we can't */ + FunctionCall5(Utf8ToServerConvProc, + Int32GetDatum(PG_UTF8), + Int32GetDatum(server_encoding), + CStringGetDatum(c_as_utf8), + CStringGetDatum(s), + Int32GetDatum(c_as_utf8_len)); +} + /* convert a multibyte string to a wchar */ int diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c index f08a03c..7df231c 100644 --- a/src/common/jsonapi.c +++ b/src/common/jsonapi.c @@ -744,21 +744,21 @@ json_lex_string(JsonLexContext *lex) } if (lex->strval != NULL) { - char utf8str[5]; - int utf8len; - - if (ch >= 0xd800 && ch <= 0xdbff) + /* + * Combine surrogate pairs. + */ + if (is_utf16_surrogate_first(ch)) { if (hi_surrogate != -1) return JSON_UNICODE_HIGH_SURROGATE; - hi_surrogate = (ch & 0x3ff) << 10; + hi_surrogate = ch; continue; } - else if (ch >= 0xdc00 && ch <= 0xdfff) + else if (is_utf16_surrogate_second(ch)) { if (hi_surrogate == -1) return JSON_UNICODE_LOW_SURROGATE; - ch = 0x10000 + hi_surrogate + (ch & 0x3ff); + ch = surrogate_pair_to_codepoint(hi_surrogate, ch); hi_surrogate = -1; } @@ -766,35 +766,52 @@ json_lex_string(JsonLexContext *lex) return JSON_UNICODE_LOW_SURROGATE; /* - * For UTF8, replace the escape sequence by the actual - * utf8 character in lex->strval. Do this also for other - * encodings if the escape designates an ASCII character, - * otherwise raise an error. + * Reject invalid cases. We can't have a value above + * 0xFFFF here (since we only accepted 4 hex digits + * above), so no need to test for out-of-range chars. */ - if (ch == 0) { /* We can't allow this, since our TEXT type doesn't */ return JSON_UNICODE_CODE_POINT_ZERO; } - else if (lex->input_encoding == PG_UTF8) + + /* + * Add the represented character to lex->strval. In the + * backend, we can let pg_unicode_to_server() handle any + * required character set conversion; in frontend, we can + * only deal with trivial conversions. + * + * Note: pg_unicode_to_server() will throw an error for a + * conversion failure, rather than returning a failure + * indication. That seems OK. + */ +#ifndef FRONTEND + { + char cbuf[MAX_UNICODE_EQUIVALENT_STRING + 1]; + + pg_unicode_to_server(ch, (unsigned char *) cbuf); + appendStringInfoString(lex->strval, cbuf); + } +#else + if (lex->input_encoding == PG_UTF8) { + /* OK, we can map the code point to UTF8 easily */ + char utf8str[5]; + int utf8len; + unicode_to_utf8(ch, (unsigned char *) utf8str); utf8len = pg_utf_mblen((unsigned char *) utf8str); appendBinaryStringInfo(lex->strval, utf8str, utf8len); } else if (ch <= 0x007f) { - /* - * This is the only way to designate things like a - * form feed character in JSON, so it's useful in all - * encodings. - */ + /* The ASCII range is the same in all encodings */ appendStringInfoChar(lex->strval, (char) ch); } else return JSON_UNICODE_HIGH_ESCAPE; - +#endif /* FRONTEND */ } } else if (lex->strval != NULL) @@ -1083,7 +1100,8 @@ json_errdetail(JsonParseErrorType error, JsonLexContext *lex) case JSON_UNICODE_ESCAPE_FORMAT: return _("\"\\u\" must be followed by four hexadecimal digits."); case JSON_UNICODE_HIGH_ESCAPE: - return _("Unicode escape values cannot be used for code point values above 007F when the server encoding isnot UTF8."); + /* note: this case is only reachable in frontend not backend */ + return _("Unicode escape values cannot be used for code point values above 007F when the encoding is not UTF8."); case JSON_UNICODE_HIGH_SURROGATE: return _("Unicode high surrogate must not follow a high surrogate."); case JSON_UNICODE_LOW_SURROGATE: diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h index b8892ef..1394e3d 100644 --- a/src/include/mb/pg_wchar.h +++ b/src/include/mb/pg_wchar.h @@ -316,6 +316,15 @@ typedef enum pg_enc #define MAX_CONVERSION_GROWTH 4 /* + * Maximum byte length of the string equivalent to any one Unicode code point, + * in any backend encoding. The current value assumes that a 4-byte UTF-8 + * character might expand by MAX_CONVERSION_GROWTH, which is a huge + * overestimate. But in current usage we don't allocate large multiples of + * this, so there's little point in being stingy. + */ +#define MAX_UNICODE_EQUIVALENT_STRING 16 + +/* * Table for mapping an encoding number to official encoding name and * possibly other subsidiary data. Be careful to check encoding number * before accessing a table entry! @@ -603,6 +612,8 @@ extern char *pg_server_to_client(const char *s, int len); extern char *pg_any_to_server(const char *s, int len, int encoding); extern char *pg_server_to_any(const char *s, int len, int encoding); +extern void pg_unicode_to_server(pg_wchar c, unsigned char *s); + extern unsigned short BIG5toCNS(unsigned short big5, unsigned char *lc); extern unsigned short CNStoBIG5(unsigned short cns, unsigned char lc); diff --git a/src/include/parser/scanner.h b/src/include/parser/scanner.h index 7a0e5e5..a27352a 100644 --- a/src/include/parser/scanner.h +++ b/src/include/parser/scanner.h @@ -99,9 +99,13 @@ typedef struct core_yy_extra_type int literallen; /* actual current string length */ int literalalloc; /* current allocated buffer size */ + /* + * Random assorted scanner state. + */ int state_before_str_stop; /* start cond. before end quote */ int xcdepth; /* depth of nesting in slash-star comments */ char *dolqstart; /* current $foo$ quote start string */ + YYLTYPE save_yylloc; /* one-element stack for PUSH_YYLLOC() */ /* first part of UTF16 surrogate pair for Unicode escapes */ int32 utf16_first_part; @@ -116,6 +120,14 @@ typedef struct core_yy_extra_type */ typedef void *core_yyscan_t; +/* Support for scanner_errposition_callback function */ +typedef struct ScannerCallbackState +{ + core_yyscan_t yyscanner; + int location; + ErrorContextCallback errcallback; +} ScannerCallbackState; + /* Constant data exported from parser/scan.l */ extern PGDLLIMPORT const uint16 ScanKeywordTokens[]; @@ -129,6 +141,10 @@ extern void scanner_finish(core_yyscan_t yyscanner); extern int core_yylex(core_YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner); extern int scanner_errposition(int location, core_yyscan_t yyscanner); +extern void setup_scanner_errposition_callback(ScannerCallbackState *scbstate, + core_yyscan_t yyscanner, + int location); +extern void cancel_scanner_errposition_callback(ScannerCallbackState *scbstate); extern void scanner_yyerror(const char *message, core_yyscan_t yyscanner) pg_attribute_noreturn(); #endif /* SCANNER_H */ diff --git a/src/test/regress/expected/json_encoding.out b/src/test/regress/expected/json_encoding.out index d8d34f4..f343f74 100644 --- a/src/test/regress/expected/json_encoding.out +++ b/src/test/regress/expected/json_encoding.out @@ -1,4 +1,19 @@ +-- -- encoding-sensitive tests for json and jsonb +-- +-- We provide expected-results files for UTF8 (json_encoding.out) +-- and for SQL_ASCII (json_encoding_1.out). Skip otherwise. +SELECT getdatabaseencoding() NOT IN ('UTF8', 'SQL_ASCII') + AS skip_test \gset +\if :skip_test +\quit +\endif +SELECT getdatabaseencoding(); -- just to label the results files + getdatabaseencoding +--------------------- + UTF8 +(1 row) + -- first json -- basic unicode input SELECT '"\u"'::json; -- ERROR, incomplete escape diff --git a/src/test/regress/expected/json_encoding_1.out b/src/test/regress/expected/json_encoding_1.out index 79ed78e..e2fc131 100644 --- a/src/test/regress/expected/json_encoding_1.out +++ b/src/test/regress/expected/json_encoding_1.out @@ -1,4 +1,19 @@ +-- -- encoding-sensitive tests for json and jsonb +-- +-- We provide expected-results files for UTF8 (json_encoding.out) +-- and for SQL_ASCII (json_encoding_1.out). Skip otherwise. +SELECT getdatabaseencoding() NOT IN ('UTF8', 'SQL_ASCII') + AS skip_test \gset +\if :skip_test +\quit +\endif +SELECT getdatabaseencoding(); -- just to label the results files + getdatabaseencoding +--------------------- + SQL_ASCII +(1 row) + -- first json -- basic unicode input SELECT '"\u"'::json; -- ERROR, incomplete escape @@ -33,9 +48,7 @@ SELECT '"\uaBcD"'::json; -- OK, uppercase and lower case both OK -- handling of unicode surrogate pairs select json '{ "a": "\ud83d\ude04\ud83d\udc36" }' -> 'a' as correct_in_utf8; -ERROR: unsupported Unicode escape sequence -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. -CONTEXT: JSON data, line 1: { "a":... +ERROR: conversion between UTF8 and SQL_ASCII is not supported select json '{ "a": "\ud83d\ud83d" }' -> 'a'; -- 2 high surrogates in a row ERROR: invalid input syntax for type json DETAIL: Unicode high surrogate must not follow a high surrogate. @@ -84,9 +97,7 @@ select json '{ "a": "null \\u0000 escape" }' as not_an_escape; (1 row) select json '{ "a": "the Copyright \u00a9 sign" }' ->> 'a' as correct_in_utf8; -ERROR: unsupported Unicode escape sequence -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. -CONTEXT: JSON data, line 1: { "a":... +ERROR: conversion between UTF8 and SQL_ASCII is not supported select json '{ "a": "dollar \u0024 character" }' ->> 'a' as correct_everywhere; correct_everywhere -------------------- @@ -144,18 +155,14 @@ CONTEXT: JSON data, line 1: ... -- use octet_length here so we don't get an odd unicode char in the -- output SELECT octet_length('"\uaBcD"'::jsonb::text); -- OK, uppercase and lower case both OK -ERROR: unsupported Unicode escape sequence +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: SELECT octet_length('"\uaBcD"'::jsonb::text); ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. -CONTEXT: JSON data, line 1: ... -- handling of unicode surrogate pairs SELECT octet_length((jsonb '{ "a": "\ud83d\ude04\ud83d\udc36" }' -> 'a')::text) AS correct_in_utf8; -ERROR: unsupported Unicode escape sequence +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: SELECT octet_length((jsonb '{ "a": "\ud83d\ude04\ud83d\udc3... ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. -CONTEXT: JSON data, line 1: { "a":... SELECT jsonb '{ "a": "\ud83d\ud83d" }' -> 'a'; -- 2 high surrogates in a row ERROR: invalid input syntax for type json LINE 1: SELECT jsonb '{ "a": "\ud83d\ud83d" }' -> 'a'; @@ -182,11 +189,9 @@ DETAIL: Unicode low surrogate must follow a high surrogate. CONTEXT: JSON data, line 1: { "a":... -- handling of simple unicode escapes SELECT jsonb '{ "a": "the Copyright \u00a9 sign" }' as correct_in_utf8; -ERROR: unsupported Unicode escape sequence +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: SELECT jsonb '{ "a": "the Copyright \u00a9 sign" }' as corr... ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. -CONTEXT: JSON data, line 1: { "a":... SELECT jsonb '{ "a": "dollar \u0024 character" }' as correct_everywhere; correct_everywhere ----------------------------- @@ -212,11 +217,9 @@ SELECT jsonb '{ "a": "null \\u0000 escape" }' as not_an_escape; (1 row) SELECT jsonb '{ "a": "the Copyright \u00a9 sign" }' ->> 'a' as correct_in_utf8; -ERROR: unsupported Unicode escape sequence +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: SELECT jsonb '{ "a": "the Copyright \u00a9 sign" }' ->> 'a'... ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. -CONTEXT: JSON data, line 1: { "a":... SELECT jsonb '{ "a": "dollar \u0024 character" }' ->> 'a' as correct_everywhere; correct_everywhere -------------------- diff --git a/src/test/regress/expected/json_encoding_2.out b/src/test/regress/expected/json_encoding_2.out new file mode 100644 index 0000000..4fc8f02 --- /dev/null +++ b/src/test/regress/expected/json_encoding_2.out @@ -0,0 +1,9 @@ +-- +-- encoding-sensitive tests for json and jsonb +-- +-- We provide expected-results files for UTF8 (json_encoding.out) +-- and for SQL_ASCII (json_encoding_1.out). Skip otherwise. +SELECT getdatabaseencoding() NOT IN ('UTF8', 'SQL_ASCII') + AS skip_test \gset +\if :skip_test +\quit diff --git a/src/test/regress/expected/jsonpath_encoding.out b/src/test/regress/expected/jsonpath_encoding.out index ecffe09..7cbfb6a 100644 --- a/src/test/regress/expected/jsonpath_encoding.out +++ b/src/test/regress/expected/jsonpath_encoding.out @@ -1,4 +1,19 @@ +-- -- encoding-sensitive tests for jsonpath +-- +-- We provide expected-results files for UTF8 (jsonpath_encoding.out) +-- and for SQL_ASCII (jsonpath_encoding_1.out). Skip otherwise. +SELECT getdatabaseencoding() NOT IN ('UTF8', 'SQL_ASCII') + AS skip_test \gset +\if :skip_test +\quit +\endif +SELECT getdatabaseencoding(); -- just to label the results files + getdatabaseencoding +--------------------- + UTF8 +(1 row) + -- checks for double-quoted values -- basic unicode input SELECT '"\u"'::jsonpath; -- ERROR, incomplete escape diff --git a/src/test/regress/expected/jsonpath_encoding_1.out b/src/test/regress/expected/jsonpath_encoding_1.out index c8cc217..005136c 100644 --- a/src/test/regress/expected/jsonpath_encoding_1.out +++ b/src/test/regress/expected/jsonpath_encoding_1.out @@ -1,4 +1,19 @@ +-- -- encoding-sensitive tests for jsonpath +-- +-- We provide expected-results files for UTF8 (jsonpath_encoding.out) +-- and for SQL_ASCII (jsonpath_encoding_1.out). Skip otherwise. +SELECT getdatabaseencoding() NOT IN ('UTF8', 'SQL_ASCII') + AS skip_test \gset +\if :skip_test +\quit +\endif +SELECT getdatabaseencoding(); -- just to label the results files + getdatabaseencoding +--------------------- + SQL_ASCII +(1 row) + -- checks for double-quoted values -- basic unicode input SELECT '"\u"'::jsonpath; -- ERROR, incomplete escape @@ -19,16 +34,14 @@ LINE 1: SELECT '"\u0000"'::jsonpath; ^ DETAIL: \u0000 cannot be converted to text. SELECT '"\uaBcD"'::jsonpath; -- OK, uppercase and lower case both OK -ERROR: invalid input syntax for type jsonpath +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: SELECT '"\uaBcD"'::jsonpath; ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. -- handling of unicode surrogate pairs select '"\ud83d\ude04\ud83d\udc36"'::jsonpath as correct_in_utf8; -ERROR: invalid input syntax for type jsonpath +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: select '"\ud83d\ude04\ud83d\udc36"'::jsonpath as correct_in_... ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. select '"\ud83d\ud83d"'::jsonpath; -- 2 high surrogates in a row ERROR: invalid input syntax for type jsonpath LINE 1: select '"\ud83d\ud83d"'::jsonpath; @@ -51,10 +64,9 @@ LINE 1: select '"\ude04X"'::jsonpath; DETAIL: Unicode low surrogate must follow a high surrogate. --handling of simple unicode escapes select '"the Copyright \u00a9 sign"'::jsonpath as correct_in_utf8; -ERROR: invalid input syntax for type jsonpath +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: select '"the Copyright \u00a9 sign"'::jsonpath as correct_in... ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. select '"dollar \u0024 character"'::jsonpath as correct_everywhere; correct_everywhere ---------------------- @@ -98,16 +110,14 @@ LINE 1: SELECT '$."\u0000"'::jsonpath; ^ DETAIL: \u0000 cannot be converted to text. SELECT '$."\uaBcD"'::jsonpath; -- OK, uppercase and lower case both OK -ERROR: invalid input syntax for type jsonpath +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: SELECT '$."\uaBcD"'::jsonpath; ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. -- handling of unicode surrogate pairs select '$."\ud83d\ude04\ud83d\udc36"'::jsonpath as correct_in_utf8; -ERROR: invalid input syntax for type jsonpath +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: select '$."\ud83d\ude04\ud83d\udc36"'::jsonpath as correct_i... ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. select '$."\ud83d\ud83d"'::jsonpath; -- 2 high surrogates in a row ERROR: invalid input syntax for type jsonpath LINE 1: select '$."\ud83d\ud83d"'::jsonpath; @@ -130,10 +140,9 @@ LINE 1: select '$."\ude04X"'::jsonpath; DETAIL: Unicode low surrogate must follow a high surrogate. --handling of simple unicode escapes select '$."the Copyright \u00a9 sign"'::jsonpath as correct_in_utf8; -ERROR: invalid input syntax for type jsonpath +ERROR: conversion between UTF8 and SQL_ASCII is not supported LINE 1: select '$."the Copyright \u00a9 sign"'::jsonpath as correct_... ^ -DETAIL: Unicode escape values cannot be used for code point values above 007F when the server encoding is not UTF8. select '$."dollar \u0024 character"'::jsonpath as correct_everywhere; correct_everywhere ------------------------ diff --git a/src/test/regress/expected/jsonpath_encoding_2.out b/src/test/regress/expected/jsonpath_encoding_2.out new file mode 100644 index 0000000..bb71bfe --- /dev/null +++ b/src/test/regress/expected/jsonpath_encoding_2.out @@ -0,0 +1,9 @@ +-- +-- encoding-sensitive tests for jsonpath +-- +-- We provide expected-results files for UTF8 (jsonpath_encoding.out) +-- and for SQL_ASCII (jsonpath_encoding_1.out). Skip otherwise. +SELECT getdatabaseencoding() NOT IN ('UTF8', 'SQL_ASCII') + AS skip_test \gset +\if :skip_test +\quit diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out index 60cb861..6c4443a 100644 --- a/src/test/regress/expected/strings.out +++ b/src/test/regress/expected/strings.out @@ -35,6 +35,12 @@ SELECT U&'d!0061t\+000061' UESCAPE '!' AS U&"d*0061t\+000061" UESCAPE '*'; dat\+000061 (1 row) +SELECT U&'a\\b' AS "a\b"; + a\b +----- + a\b +(1 row) + SELECT U&' \' UESCAPE '!' AS "tricky"; tricky -------- @@ -48,13 +54,15 @@ SELECT 'tricky' AS U&"\" UESCAPE '!'; (1 row) SELECT U&'wrong: \061'; -ERROR: invalid Unicode escape value +ERROR: invalid Unicode escape LINE 1: SELECT U&'wrong: \061'; ^ +HINT: Unicode escapes must be \XXXX or \+XXXXXX. SELECT U&'wrong: \+0061'; -ERROR: invalid Unicode escape value +ERROR: invalid Unicode escape LINE 1: SELECT U&'wrong: \+0061'; ^ +HINT: Unicode escapes must be \XXXX or \+XXXXXX. SELECT U&'wrong: +0061' UESCAPE +; ERROR: UESCAPE must be followed by a simple string literal at or near "+" LINE 1: SELECT U&'wrong: +0061' UESCAPE +; @@ -63,6 +71,77 @@ SELECT U&'wrong: +0061' UESCAPE '+'; ERROR: invalid Unicode escape character at or near "'+'" LINE 1: SELECT U&'wrong: +0061' UESCAPE '+'; ^ +SELECT U&'wrong: \db99'; +ERROR: invalid Unicode surrogate pair +LINE 1: SELECT U&'wrong: \db99'; + ^ +SELECT U&'wrong: \db99xy'; +ERROR: invalid Unicode surrogate pair +LINE 1: SELECT U&'wrong: \db99xy'; + ^ +SELECT U&'wrong: \db99\\'; +ERROR: invalid Unicode surrogate pair +LINE 1: SELECT U&'wrong: \db99\\'; + ^ +SELECT U&'wrong: \db99\0061'; +ERROR: invalid Unicode surrogate pair +LINE 1: SELECT U&'wrong: \db99\0061'; + ^ +SELECT U&'wrong: \+00db99\+000061'; +ERROR: invalid Unicode surrogate pair +LINE 1: SELECT U&'wrong: \+00db99\+000061'; + ^ +SELECT U&'wrong: \+2FFFFF'; +ERROR: invalid Unicode escape value +LINE 1: SELECT U&'wrong: \+2FFFFF'; + ^ +-- while we're here, check the same cases in E-style literals +SELECT E'd\u0061t\U00000061' AS "data"; + data +------ + data +(1 row) + +SELECT E'a\\b' AS "a\b"; + a\b +----- + a\b +(1 row) + +SELECT E'wrong: \u061'; +ERROR: invalid Unicode escape +LINE 1: SELECT E'wrong: \u061'; + ^ +HINT: Unicode escapes must be \uXXXX or \UXXXXXXXX. +SELECT E'wrong: \U0061'; +ERROR: invalid Unicode escape +LINE 1: SELECT E'wrong: \U0061'; + ^ +HINT: Unicode escapes must be \uXXXX or \UXXXXXXXX. +SELECT E'wrong: \udb99'; +ERROR: invalid Unicode surrogate pair at or near "'" +LINE 1: SELECT E'wrong: \udb99'; + ^ +SELECT E'wrong: \udb99xy'; +ERROR: invalid Unicode surrogate pair at or near "x" +LINE 1: SELECT E'wrong: \udb99xy'; + ^ +SELECT E'wrong: \udb99\\'; +ERROR: invalid Unicode surrogate pair at or near "\" +LINE 1: SELECT E'wrong: \udb99\\'; + ^ +SELECT E'wrong: \udb99\u0061'; +ERROR: invalid Unicode surrogate pair at or near "\u0061" +LINE 1: SELECT E'wrong: \udb99\u0061'; + ^ +SELECT E'wrong: \U0000db99\U00000061'; +ERROR: invalid Unicode surrogate pair at or near "\U00000061" +LINE 1: SELECT E'wrong: \U0000db99\U00000061'; + ^ +SELECT E'wrong: \U002FFFFF'; +ERROR: invalid Unicode escape value at or near "\U002FFFFF" +LINE 1: SELECT E'wrong: \U002FFFFF'; + ^ SET standard_conforming_strings TO off; SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061"; ERROR: unsafe use of string constant with Unicode escapes diff --git a/src/test/regress/sql/json_encoding.sql b/src/test/regress/sql/json_encoding.sql index 87a2d56..d7fac69 100644 --- a/src/test/regress/sql/json_encoding.sql +++ b/src/test/regress/sql/json_encoding.sql @@ -1,5 +1,16 @@ - +-- -- encoding-sensitive tests for json and jsonb +-- + +-- We provide expected-results files for UTF8 (json_encoding.out) +-- and for SQL_ASCII (json_encoding_1.out). Skip otherwise. +SELECT getdatabaseencoding() NOT IN ('UTF8', 'SQL_ASCII') + AS skip_test \gset +\if :skip_test +\quit +\endif + +SELECT getdatabaseencoding(); -- just to label the results files -- first json diff --git a/src/test/regress/sql/jsonpath_encoding.sql b/src/test/regress/sql/jsonpath_encoding.sql index 3a23b72..55d9e30 100644 --- a/src/test/regress/sql/jsonpath_encoding.sql +++ b/src/test/regress/sql/jsonpath_encoding.sql @@ -1,5 +1,16 @@ - +-- -- encoding-sensitive tests for jsonpath +-- + +-- We provide expected-results files for UTF8 (jsonpath_encoding.out) +-- and for SQL_ASCII (jsonpath_encoding_1.out). Skip otherwise. +SELECT getdatabaseencoding() NOT IN ('UTF8', 'SQL_ASCII') + AS skip_test \gset +\if :skip_test +\quit +\endif + +SELECT getdatabaseencoding(); -- just to label the results files -- checks for double-quoted values diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql index c5cd151..3e28cd1 100644 --- a/src/test/regress/sql/strings.sql +++ b/src/test/regress/sql/strings.sql @@ -21,6 +21,7 @@ SET standard_conforming_strings TO on; SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061"; SELECT U&'d!0061t\+000061' UESCAPE '!' AS U&"d*0061t\+000061" UESCAPE '*'; +SELECT U&'a\\b' AS "a\b"; SELECT U&' \' UESCAPE '!' AS "tricky"; SELECT 'tricky' AS U&"\" UESCAPE '!'; @@ -30,6 +31,25 @@ SELECT U&'wrong: \+0061'; SELECT U&'wrong: +0061' UESCAPE +; SELECT U&'wrong: +0061' UESCAPE '+'; +SELECT U&'wrong: \db99'; +SELECT U&'wrong: \db99xy'; +SELECT U&'wrong: \db99\\'; +SELECT U&'wrong: \db99\0061'; +SELECT U&'wrong: \+00db99\+000061'; +SELECT U&'wrong: \+2FFFFF'; + +-- while we're here, check the same cases in E-style literals +SELECT E'd\u0061t\U00000061' AS "data"; +SELECT E'a\\b' AS "a\b"; +SELECT E'wrong: \u061'; +SELECT E'wrong: \U0061'; +SELECT E'wrong: \udb99'; +SELECT E'wrong: \udb99xy'; +SELECT E'wrong: \udb99\\'; +SELECT E'wrong: \udb99\u0061'; +SELECT E'wrong: \U0000db99\U00000061'; +SELECT E'wrong: \U002FFFFF'; + SET standard_conforming_strings TO off; SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
On Mon, Feb 24, 2020 at 11:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I see this patch got sideswiped by the recent refactoring of JSON > lexing. Here's an attempt at fixing it up. Since the frontend > code isn't going to have access to encoding conversion facilities, > this creates a difference between frontend and backend handling > of JSON Unicode escapes, which is mildly annoying but probably > isn't going to bother anyone in the real world. Outside of > jsonapi.c, there are no changes from v2. For the record, as far as JSON goes, I think I'm responsible for the current set of restrictions, and I'm not attached to them. I believe I was uncertain of my ability to implement anything better than what we have now and also slightly unclear on what the semantics ought to be. I'm happy to see it improved, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 25, 2020 at 1:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: > > [ unicode-escapes-with-other-server-encodings-2.patch ] > > I see this patch got sideswiped by the recent refactoring of JSON > lexing. Here's an attempt at fixing it up. Since the frontend > code isn't going to have access to encoding conversion facilities, > this creates a difference between frontend and backend handling > of JSON Unicode escapes, which is mildly annoying but probably > isn't going to bother anyone in the real world. Outside of > jsonapi.c, there are no changes from v2. With v3, I successfully converted escapes using a database with EUC-KR encoding, from strings, json, and jsonpath expressions. Then I ran a raw parsing microbenchmark with ASCII unicode escapes in UTF-8 to verify no significant regression. I also tried the same with EUC-KR, even though that's not really apples-to-apples since it doesn't work on HEAD. It seems to give the same numbers. (median of 3, done 3 times with postmaster restart in between) master, UTF-8 ascii 1.390s 1.405s 1.406s v3, UTF-8 ascii 1.396s 1.388s 1.390s v3, EUC-KR non-ascii 1.382s 1.401s 1.394s Not this patch's job perhaps, but now that check_unicode_value() only depends on the input, maybe it can be put into pgwchar.h with other static inline helper functions? That test is duplicated in addunicode() and pg_unicode_to_server(). Maybe: static inline bool codepoint_is_valid(pgwchar c) { return (c > 0 && c <= 0x10FFFF); } Maybe Chapman has a use case in mind he can test with? Barring that, the patch seems ready for commit. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
John Naylor <john.naylor@2ndquadrant.com> writes: > Not this patch's job perhaps, but now that check_unicode_value() only > depends on the input, maybe it can be put into pgwchar.h with other > static inline helper functions? That test is duplicated in > addunicode() and pg_unicode_to_server(). Maybe: > static inline bool > codepoint_is_valid(pgwchar c) > { > return (c > 0 && c <= 0x10FFFF); > } Seems reasonable, done. > Maybe Chapman has a use case in mind he can test with? Barring that, > the patch seems ready for commit. I went ahead and pushed this, just to get it out of my queue. Chapman's certainly welcome to kibitz some more of course. regards, tom lane
On 3/6/20 2:19 PM, Tom Lane wrote: >> Maybe Chapman has a use case in mind he can test with? Barring that, >> the patch seems ready for commit. > > I went ahead and pushed this, just to get it out of my queue. > Chapman's certainly welcome to kibitz some more of course. Sorry, yeah, I don't think I had any kibitzing to do. My use case was for an automated SQL generator to confidently emit Unicode- escaped forms with few required assumptions about the database they'll be loaded in, subject of course to the natural limitation that its encoding contain the characters being used, but not to arbitrary other limits. And unless I misunderstand the patch, it accomplishes that, thereby depriving me of stuff to kibitz about. Regards, -Chap