Re: Unicode escapes in literals - Mailing list pgsql-hackers
From | Peter Eisentraut |
---|---|
Subject | Re: Unicode escapes in literals |
Date | |
Msg-id | 4905F515.3020208@gmx.net Whole thread Raw |
In response to | Unicode escapes in literals (Peter Eisentraut <peter_e@gmx.net>) |
List | pgsql-hackers |
I wrote: > SQL has the following escape syntax for it: > > U&'special character: \xxxx' [ UESCAPE '\' ] Here is an in-progress patch for this. It still needs updates in the psql scanner and possibly other scanners. But the server-side functionality works. Index: doc/src/sgml/syntax.sgml =================================================================== RCS file: /cvsroot/pgsql/doc/src/sgml/syntax.sgml,v retrieving revision 1.123 diff -u -3 -p -c -r1.123 syntax.sgml *** doc/src/sgml/syntax.sgml 26 Jun 2008 22:24:42 -0000 1.123 --- doc/src/sgml/syntax.sgml 27 Oct 2008 16:54:26 -0000 *************** UPDATE "my_table" SET "a" = 5; *** 190,195 **** --- 190,247 ---- </para> <para> + A variant of quoted identifiers allows including escaped Unicode + characters identified by their code points. This variant starts + with <literal>U&</literal> (upper or lower case U followed by + ampersand) immediately before the opening double quote, without + any spaces in between, for example <literal>U&"foo"</literal>. + (Note that this creates an ambiguity with the + operator <literal>&</literal>. Use spaces around the operator to + avoid this problem.) Inside the quotes, Unicode characters can be + specified in escaped form by writing a backslash followed by the + four-digit hexadecimal code point number or alternatively a + backslash followed by a plus sign followed by a six-digt + hexadecimal code point number. For example, the + identifier <literal>"data"</literal> could be written as + <programlisting> + U&"d\0061t\0061" + </programlisting> + or equivalently + <programlisting> + U&"d\+000061t\+000061" + </programlisting> + The following less trivial example writes the Russian + word <quote>slon</quote> (elephant) in Cyrillic letters: + <programlisting> + U&"\0441\043B\043E\043D" + </programlisting> + </para> + + <para> + If a different escape character than backslash is desired, it can + be specified using the <literal>UESCAPE</literal> clause after the + string, for example: + <programlisting> + U&"d!0061t!0061" UESCAPE '!' + </programlisting> + The escape character can be any single character other than a + hexadecimal digit, the plus sign, a single quote, a double quote, + or a whitespace character. Note that the escape character is + written in single quotes, not double quotes. + </para> + + <para> + To include the escape character in the identifier literally, write + it twice. + </para> + + <para> + The Unicode escape syntax works only when the server encoding is + UTF8. When other server encodings are used, only code points in + the ASCII range (up to <literal>\007F</literal>) can be specified. + </para> + + <para> Quoting an identifier also makes it case-sensitive, whereas unquoted names are always folded to lower case. For example, the identifiers <literal>FOO</literal>, <literal>foo</literal>, and *************** UPDATE "my_table" SET "a" = 5; *** 245,251 **** write two adjacent single quotes, e.g. <literal>'Dianne''s horse'</literal>. Note that this is <emphasis>not</> the same as a double-quote ! character (<literal>"</>). </para> <para> --- 297,303 ---- write two adjacent single quotes, e.g. <literal>'Dianne''s horse'</literal>. Note that this is <emphasis>not</> the same as a double-quote ! character (<literal>"</>). <!-- font-lock sanity: " --> </para> <para> *************** SELECT 'foo' 'bar'; *** 269,282 **** by <acronym>SQL</acronym>; <productname>PostgreSQL</productname> is following the standard.) </para> - <para> <indexterm> <primary>escape string syntax</primary> </indexterm> <indexterm> <primary>backslash escapes</primary> </indexterm> <productname>PostgreSQL</productname> also accepts <quote>escape</> string constants, which are an extension to the SQL standard. An escape string constant is specified by writing the letter --- 321,339 ---- by <acronym>SQL</acronym>; <productname>PostgreSQL</productname> is following the standard.) </para> + </sect3> + + <sect3 id="sql-syntax-strings-escape"> + <title>String Constants with C-Style Escapes</title> <indexterm> <primary>escape string syntax</primary> </indexterm> <indexterm> <primary>backslash escapes</primary> </indexterm> + + <para> <productname>PostgreSQL</productname> also accepts <quote>escape</> string constants, which are an extension to the SQL standard. An escape string constant is specified by writing the letter *************** SELECT 'foo' 'bar'; *** 287,293 **** Within an escape string, a backslash character (<literal>\</>) begins a C-like <firstterm>backslash escape</> sequence, in which the combination of backslash and following character(s) represent a special byte ! value: <table id="sql-backslash-table"> <title>Backslash Escape Sequences</title> --- 344,351 ---- Within an escape string, a backslash character (<literal>\</>) begins a C-like <firstterm>backslash escape</> sequence, in which the combination of backslash and following character(s) represent a special byte ! value, shown in <xref linkend="sql-backslash-table"> ! </para> <table id="sql-backslash-table"> <title>Backslash Escape Sequences</title> *************** SELECT 'foo' 'bar'; *** 341,354 **** </tgroup> </table> ! It is your responsibility that the byte sequences you create are ! valid characters in the server character set encoding. Any other character following a backslash is taken literally. Thus, to include a backslash character, write two backslashes (<literal>\\</>). Also, a single quote can be included in an escape string by writing <literal>\'</literal>, in addition to the normal way of <literal>''</>. </para> <caution> <para> If the configuration parameter --- 399,422 ---- </tgroup> </table> ! <para> ! Any other character following a backslash is taken literally. Thus, to include a backslash character, write two backslashes (<literal>\\</>). Also, a single quote can be included in an escape string by writing <literal>\'</literal>, in addition to the normal way of <literal>''</>. </para> + <para> + It is your responsibility that the byte sequences you create are + valid characters in the server character set encoding. When the + server encoding is UTF-8, then the alternative Unicode escape + syntax, explained in <xref linkend="sql-syntax-strings-uescape">, + should be used instead. (The alternative would be doing the + UTF-8 encoding by hand and writing out the bytes, which would be + very cumbersome.) + </para> + <caution> <para> If the configuration parameter *************** SELECT 'foo' 'bar'; *** 379,384 **** --- 447,509 ---- </para> </sect3> + <sect3 id="sql-syntax-strings-uescape"> + <title>String Constants with Unicode Escapes</title> + + <para> + <productname>PostgreSQL</productname> also supports another type + of escape syntax for strings that allows specifying arbitrary + Unicode characters by code point. A Unicode escape string + constant starts with <literal>U&</literal> (upper or lower case + letter U followed by ampersand) immediately before the opening + quote, without any spaces in between, for + example <literal>U&'foo'</literal>. (Note that this creates an + ambiguity with the operator <literal>&</literal>. Use spaces + around the operator to avoid this problem.) Inside the quotes, + Unicode characters can be specified in escaped form by writing a + backslash followed by the four-digit hexadecimal code point + number or alternatively a backslash followed by a plus sign + followed by a six-digt hexadecimal code point number. For + example, the string <literal>'data'</literal> could be written as + <programlisting> + U&'d\0061t\0061' + </programlisting> + or equivalently + <programlisting> + U&'d\+000061t\+000061' + </programlisting> + The following less trivial example writes the Russian + word <quote>slon</quote> (elephant) in Cyrillic letters: + <programlisting> + U&'\0441\043B\043E\043D' + </programlisting> + </para> + + <para> + If a different escape character than backslash is desired, it can + be specified using the <literal>UESCAPE</literal> clause after + the string, for example: + <programlisting> + U&'d!0061t!0061' UESCAPE '!' + </programlisting> + The escape character can be any single character other than a + hexadecimal digit, the plus sign, a single quote, a double quote, + or a whitespace character. + </para> + + <para> + The Unicode escape syntax works only when the server encoding is + UTF8. When other server encodings are used, only code points in + the ASCII range (up to <literal>\007F</literal>) can be + specified. + </para> + + <para> + To include the escape character in the string literally, write it + twice. + </para> + </sect3> + <sect3 id="sql-syntax-dollar-quoting"> <title>Dollar-Quoted String Constants</title> Index: src/backend/parser/scan.l =================================================================== RCS file: /cvsroot/pgsql/src/backend/parser/scan.l,v retrieving revision 1.146 diff -u -3 -p -c -r1.146 scan.l *** src/backend/parser/scan.l 1 Sep 2008 20:42:45 -0000 1.146 --- src/backend/parser/scan.l 27 Oct 2008 16:54:27 -0000 *************** static int literalalloc; /* current all *** 76,81 **** --- 76,82 ---- static void addlit(char *ytext, int yleng); static void addlitchar(unsigned char ychar); static char *litbufdup(void); + static char *litbuf_udeescape(unsigned char escape); #define lexer_errposition() scanner_errposition(yylloc) *************** static unsigned char unescape_single_cha *** 125,130 **** --- 126,133 ---- * <xq> standard quoted strings * <xe> extended quoted strings (support backslash escape sequences) * <xdolq> $foo$ quoted strings + * <xui> quoted identifier with Unicode escapes + * <xus> quoted string with Unicode escapes */ %x xb *************** static unsigned char unescape_single_cha *** 134,139 **** --- 137,144 ---- %x xe %x xq %x xdolq + %x xui + %x xus /* * In order to make the world safe for Windows and Mac clients as well as *************** xdstop {dquote} *** 244,249 **** --- 249,273 ---- xddouble {dquote}{dquote} xdinside [^"]+ + /* Unicode escapes */ + uescape [uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote} + /* error rule to avoid backup */ + uescapefail ("-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]) + + /* Quoted identifier with Unicode escapes */ + xuistart [uU]&{dquote} + xuistop1 {dquote}{whitespace}*{uescapefail}? + xuistop2 {dquote}{whitespace}*{uescape} + + /* Quoted string with Unicode escapes */ + xusstart [uU]&{quote} + xusstop1 {quote}{whitespace}*{uescapefail}? + xusstop2 {quote}{whitespace}*{uescape} + + /* error rule to avoid backup */ + xufailed [uU]& + + /* C-style comments * * The "extended comment" syntax closely resembles allowable operator syntax. *************** other . *** 444,449 **** --- 468,478 ---- BEGIN(xe); startlit(); } + {xusstart} { + SET_YYLLOC(); + BEGIN(xus); + startlit(); + } <xq,xe>{quotestop} | <xq,xe>{quotefail} { yyless(1); *************** other . *** 456,465 **** yylval.str = litbufdup(); return SCONST; } ! <xq,xe>{xqdouble} { addlitchar('\''); } ! <xq>{xqinside} { addlit(yytext, yyleng); } <xe>{xeinside} { --- 485,506 ---- yylval.str = litbufdup(); return SCONST; } ! <xus>{xusstop1} { ! /* throw back all but the quote */ ! yyless(1); ! BEGIN(INITIAL); ! yylval.str = litbuf_udeescape('\\'); ! return SCONST; ! } ! <xus>{xusstop2} { ! BEGIN(INITIAL); ! yylval.str = litbuf_udeescape(yytext[yyleng-2]); ! return SCONST; ! } ! <xq,xe,xus>{xqdouble} { addlitchar('\''); } ! <xq,xus>{xqinside} { addlit(yytext, yyleng); } <xe>{xeinside} { *************** other . *** 496,509 **** if (IS_HIGHBIT_SET(c)) saw_high_bit = true; } ! <xq,xe>{quotecontinue} { /* ignore */ } <xe>. { /* This is only needed for \ just before EOF */ addlitchar(yytext[0]); } ! <xq,xe><<EOF>> { yyerror("unterminated quoted string"); } {dolqdelim} { SET_YYLLOC(); --- 537,550 ---- if (IS_HIGHBIT_SET(c)) saw_high_bit = true; } ! <xq,xe,xus>{quotecontinue} { /* ignore */ } <xe>. { /* This is only needed for \ just before EOF */ addlitchar(yytext[0]); } ! <xq,xe,xus><<EOF>> { yyerror("unterminated quoted string"); } {dolqdelim} { SET_YYLLOC(); *************** other . *** 553,558 **** --- 594,604 ---- BEGIN(xd); startlit(); } + {xuistart} { + SET_YYLLOC(); + BEGIN(xui); + startlit(); + } <xd>{xdstop} { char *ident; *************** other . *** 565,577 **** yylval.str = ident; return IDENT; } ! <xd>{xddouble} { addlitchar('"'); } ! <xd>{xdinside} { addlit(yytext, yyleng); } ! <xd><<EOF>> { yyerror("unterminated quoted identifier"); } {typecast} { SET_YYLLOC(); --- 611,656 ---- yylval.str = ident; return IDENT; } ! <xui>{xuistop1} { ! char *ident; ! ! BEGIN(INITIAL); ! if (literallen == 0) ! yyerror("zero-length delimited identifier"); ! ident = litbuf_udeescape('\\'); ! if (literallen >= NAMEDATALEN) ! truncate_identifier(ident, literallen, true); ! yylval.str = ident; ! /* throw back all but the quote */ ! yyless(1); ! return IDENT; ! } ! <xui>{xuistop2} { ! char *ident; ! ! BEGIN(INITIAL); ! if (literallen == 0) ! yyerror("zero-length delimited identifier"); ! ident = litbuf_udeescape(yytext[yyleng - 2]); ! if (literallen >= NAMEDATALEN) ! truncate_identifier(ident, literallen, true); ! yylval.str = ident; ! return IDENT; ! } ! <xd,xui>{xddouble} { addlitchar('"'); } ! <xd,xui>{xdinside} { addlit(yytext, yyleng); } ! <xd,xui><<EOF>> { yyerror("unterminated quoted identifier"); } ! ! {xufailed} { ! /* throw back all but the initial u/U */ ! yyless(1); ! /* and treat it as {other} */ ! return yytext[0]; ! } {typecast} { SET_YYLLOC(); *************** litbufdup(void) *** 908,913 **** --- 987,1082 ---- return new; } + static int + hexval(unsigned char c) + { + if (c >= '0' && c <= '9') + return c - '0'; + if (c >= 'a' && c <= 'f') + return c - 'a' + 0xA; + if (c >= 'A' && c <= 'F') + return c - 'A' + 0xA; + elog(ERROR, "invalid hexadecimal digit"); + return 0; /* not reached */ + } + + static void + check_unicode_value(pg_wchar c, char * loc) + { + if (GetDatabaseEncoding() == PG_UTF8) + return; + + if (c > 0x7F) + { + yylloc += (char *) loc - literalbuf + 3; /* 3 for U&" */ + yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is notUTF8"); + } + } + + static char * + litbuf_udeescape(unsigned char escape) + { + char *new; + char *in, *out; + + if (isxdigit(escape) + || escape == '+' + || escape == '\'' + || escape == '"' + || scanner_isspace(escape)) + yyerror("invalid Unicode escape character"); + + /* + * This relies on the subtle assumption that a UTF-8 expansion + * cannot be longer than its escaped representation. + */ + new = palloc(literallen + 1); + + in = literalbuf; + out = new; + while (*in) + { + if (in[0] == escape) + { + if (in[1] == escape) + { + *out++ = escape; + in += 2; + } + else if (isxdigit(in[1]) && isxdigit(in[2]) && isxdigit(in[3]) && isxdigit(in[4])) + { + pg_wchar unicode = hexval(in[1]) * 16*16*16 + hexval(in[2]) * 16*16 + hexval(in[3]) * 16 + hexval(in[4]); + check_unicode_value(unicode, in); + unicode_to_utf8(unicode, (unsigned char *) out); + in += 5; + out += pg_mblen(out); + } + else if (in[1] == '+' + && isxdigit(in[2]) && isxdigit(in[3]) + && isxdigit(in[4]) && isxdigit(in[5]) + && isxdigit(in[6]) && isxdigit(in[7])) + { + pg_wchar unicode = hexval(in[2]) * 16*16*16*16*16 + hexval(in[3]) * 16*16*16*16 + hexval(in[4]) * 16*16*16 + + hexval(in[5]) * 16*16 + hexval(in[6]) * 16 + hexval(in[7]); + check_unicode_value(unicode, in); + unicode_to_utf8(unicode, (unsigned char *) out); + in += 8; + out += pg_mblen(out); + } + else + { + yylloc += in - literalbuf + 3; /* 3 for U&" */ + yyerror("invalid Unicode escape value"); + } + } + else + *out++ = *in++; + } + + *out = '\0'; + pg_verifymbstr(new, out - new, false); + return new; + } static unsigned char unescape_single_char(unsigned char c) Index: src/backend/utils/adt/xml.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/adt/xml.c,v retrieving revision 1.79 diff -u -3 -p -c -r1.79 xml.c *** src/backend/utils/adt/xml.c 14 Oct 2008 17:12:33 -0000 1.79 --- src/backend/utils/adt/xml.c 27 Oct 2008 16:54:27 -0000 *************** unicode_to_sqlchar(pg_wchar c) *** 1497,1524 **** { static unsigned char utf8string[5]; /* need trailing zero */ ! if (c <= 0x7F) ! { ! utf8string[0] = c; ! } ! else if (c <= 0x7FF) ! { ! utf8string[0] = 0xC0 | ((c >> 6) & 0x1F); ! utf8string[1] = 0x80 | (c & 0x3F); ! } ! else if (c <= 0xFFFF) ! { ! utf8string[0] = 0xE0 | ((c >> 12) & 0x0F); ! utf8string[1] = 0x80 | ((c >> 6) & 0x3F); ! utf8string[2] = 0x80 | (c & 0x3F); ! } ! else ! { ! utf8string[0] = 0xF0 | ((c >> 18) & 0x07); ! utf8string[1] = 0x80 | ((c >> 12) & 0x3F); ! utf8string[2] = 0x80 | ((c >> 6) & 0x3F); ! utf8string[3] = 0x80 | (c & 0x3F); ! } return (char *) pg_do_encoding_conversion(utf8string, pg_mblen((char *) utf8string), --- 1497,1503 ---- { static unsigned char utf8string[5]; /* need trailing zero */ ! unicode_to_utf8(c, utf8string); return (char *) pg_do_encoding_conversion(utf8string, pg_mblen((char *) utf8string), Index: src/backend/utils/mb/wchar.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/utils/mb/wchar.c,v retrieving revision 1.66 diff -u -3 -p -c -r1.66 wchar.c *** src/backend/utils/mb/wchar.c 15 Nov 2007 21:14:40 -0000 1.66 --- src/backend/utils/mb/wchar.c 27 Oct 2008 16:54:27 -0000 *************** pg_utf2wchar_with_len(const unsigned cha *** 419,424 **** --- 419,459 ---- return cnt; } + + /* + * Map a Unicode codepoint to UTF-8. utf8string must have 4 bytes of + * space allocated. + */ + unsigned char * + unicode_to_utf8(pg_wchar c, unsigned char *utf8string) + { + if (c <= 0x7F) + { + utf8string[0] = c; + } + else if (c <= 0x7FF) + { + utf8string[0] = 0xC0 | ((c >> 6) & 0x1F); + utf8string[1] = 0x80 | (c & 0x3F); + } + else if (c <= 0xFFFF) + { + utf8string[0] = 0xE0 | ((c >> 12) & 0x0F); + utf8string[1] = 0x80 | ((c >> 6) & 0x3F); + utf8string[2] = 0x80 | (c & 0x3F); + } + else + { + utf8string[0] = 0xF0 | ((c >> 18) & 0x07); + utf8string[1] = 0x80 | ((c >> 12) & 0x3F); + utf8string[2] = 0x80 | ((c >> 6) & 0x3F); + utf8string[3] = 0x80 | (c & 0x3F); + } + + return utf8string; + } + + /* * Return the byte length of a UTF8 character pointed to by s * Index: src/include/mb/pg_wchar.h =================================================================== RCS file: /cvsroot/pgsql/src/include/mb/pg_wchar.h,v retrieving revision 1.79 diff -u -3 -p -c -r1.79 pg_wchar.h *** src/include/mb/pg_wchar.h 18 Jun 2008 18:42:54 -0000 1.79 --- src/include/mb/pg_wchar.h 27 Oct 2008 16:54:27 -0000 *************** extern const char *GetDatabaseEncodingNa *** 380,385 **** --- 380,386 ---- extern int pg_valid_client_encoding(const char *name); extern int pg_valid_server_encoding(const char *name); + extern unsigned char *unicode_to_utf8(pg_wchar c, unsigned char *utf8string); extern int pg_utf_mblen(const unsigned char *); extern unsigned char *pg_do_encoding_conversion(unsigned char *src, int len, int src_encoding,
pgsql-hackers by date: