Re: benchmarking Flex practices - Mailing list pgsql-hackers

From Tom Lane
Subject Re: benchmarking Flex practices
Date
Msg-id 31686.1574722301@sss.pgh.pa.us
Whole thread Raw
In response to Re: benchmarking Flex practices  (John Naylor <john.naylor@2ndquadrant.com>)
Responses Re: benchmarking Flex practices  (John Naylor <john.naylor@2ndquadrant.com>)
List pgsql-hackers
[ My apologies for being so slow to get back to this ]

John Naylor <john.naylor@2ndquadrant.com> writes:
> Now that I think of it, the regression in v7 was largely due to the
> fact that the parser has to call the lexer 3 times per string in this
> case, and that's going to be slower no matter what we do.

Ah, of course.  I'm not too fussed about the performance of queries with
an explicit UESCAPE clause, as that seems like a very minority use-case.
What we do want to pay attention to is not regressing for plain
identifiers/strings, and to a lesser extent the U& cases without UESCAPE.

> Inlining hexval() and friends seems to have helped somewhat for
> unicode escapes, but I'd have to profile to improve that further.
> However, v8 has regressed from v7 enough with both simple strings and
> the information schema that it's a noticeable regression from HEAD.
> I'm guessing getting rid of the "Uescape" production is to blame, but
> I haven't tried reverting just that one piece. Since inlining the
> rules didn't seem to help with the precedence hacks, it seems like the
> separate production was a better way. Thoughts?

I have duplicated your performance tests here, and get more or less
the same results (see below).  I agree that the performance of the
v8 patch isn't really where we want to be --- and it also seems
rather invasive to gram.y, and hence error-prone.  (If we do it
like that, I bet my bottom dollar that somebody would soon commit
a patch that adds a production using IDENT not Ident, and it'd take
a long time to notice.)

It struck me though that there's another solution we haven't discussed,
and that's to make the token lookahead filter in parser.c do the work
of converting UIDENT [UESCAPE SCONST] to IDENT, and similarly for the
string case.  I pursued that to the extent of developing the attached
incomplete patch ("v9"), which looks reasonable from a performance
standpoint.  I get these results with tests using the drive_parser
function:

information_schema

HEAD    3447.674 ms, 3433.498 ms, 3422.407 ms
v6    3381.851 ms, 3442.478 ms, 3402.629 ms
v7    3525.865 ms, 3441.038 ms, 3473.488 ms
v8    3567.640 ms, 3488.417 ms, 3556.544 ms
v9    3456.360 ms, 3403.635 ms, 3418.787 ms

pgbench str

HEAD    4414.046 ms, 4376.222 ms, 4356.468 ms
v6    4304.582 ms, 4245.534 ms, 4263.562 ms
v7    4395.815 ms, 4398.381 ms, 4460.304 ms
v8    4475.706 ms, 4466.665 ms, 4471.048 ms
v9    4392.473 ms, 4316.549 ms, 4318.472 ms

pgbench unicode

HEAD    4959.000 ms, 4921.751 ms, 4945.069 ms
v6    4856.998 ms, 4802.996 ms, 4855.486 ms
v7    5057.199 ms, 4948.342 ms, 4956.614 ms
v8    5008.090 ms, 4963.641 ms, 4983.576 ms
v9    4809.227 ms, 4767.355 ms, 4741.641 ms

pgbench uesc

HEAD    5114.401 ms, 5235.764 ms, 5200.567 ms
v6    5030.156 ms, 5083.398 ms, 4986.974 ms
v7    5915.508 ms, 5953.135 ms, 5929.775 ms
v8    5678.810 ms, 5665.239 ms, 5645.696 ms
v9    5648.965 ms, 5601.592 ms, 5600.480 ms

(A note about what we're looking at: on my machine, after using cpupower
to lock down the CPU frequency, and taskset to bind everything to one
CPU socket, I can get numbers that are very repeatable, to 0.1% or so
... until I restart the postmaster, and then I get different but equally
repeatable numbers.  The difference can be several percent, which is a lot
of noise compared to what we're looking for.  I believe the explanation is
that kernel ASLR has loaded the backend executable at some different
addresses and so there are different cache-line-boundary effects.  While
I could lock that down too by disabling ASLR, the result would be to
overemphasize chance effects of a particular set of cache line boundaries.
So I prefer to run all the tests over again after restarting the
postmaster, a few times, and then look at the overall set of results to
see what things look like.  Each number quoted above is median-of-three
tests within a single postmaster run.)

Anyway, my conclusion is that the attached patch is at least as fast
as today's HEAD; it's not as fast as v6, but on the other hand it's
an even smaller postmaster executable, so there's something to be said
for that:

$ size postg*
   text    data     bss     dec     hex filename
7478138   57928  203360 7739426  761822 postgres.head
7271218   57928  203360 7532506  72efda postgres.v6
7275810   57928  203360 7537098  7301ca postgres.v7
7276978   57928  203360 7538266  73065a postgres.v8
7266274   57928  203360 7527562  72dc8a postgres.v9

I based this on your v7 not v8; not sure if there's anything you
want to salvage from v8.

Generally, I'm pretty happy with this approach: it touches gram.y
hardly at all, and it removes just about all of the complexity from
scan.l.  I'm happier about dropping the support code into parser.c
than the other choices we've discussed.

There's still undone work here, though:

* I did not touch psql.  Probably your patch is fine for that.

* I did not do more with ecpg than get it to compile, using the
same hacks as in your v7.  It still fails its regression tests,
but now the reason is that what we've done in parser/parser.c
needs to be transposed into the identical functionality in
ecpg/preproc/parser.c.  Or at least some kind of functionality
there.  A problem with this approach is that it presumes we can
reduce a UIDENT sequence to a plain IDENT, but to do so we need
assumptions about the target encoding, and I'm not sure that
ecpg should make any such assumptions.  Maybe ecpg should just
reject all cases that produce non-ASCII identifiers?  (Probably
it could be made to do something smarter with more work, but
it's not clear to me that it's worth the trouble.)

* I haven't convinced myself either way as to whether it'd be
better to factor out the code duplicated between the UIDENT
and UCONST cases in base_yylex.

If this seems like a reasonable approach to you, please fill in
the missing psql and ecpg bits.

            regards, tom lane

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index c508684..1f10340 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -601,7 +601,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
  * DOT_DOT is unused in the core SQL grammar, and so will always provoke
  * parse errors.  It is needed by PL/pgSQL.
  */
-%token <str>    IDENT FCONST SCONST BCONST XCONST Op
+%token <str>    IDENT UIDENT FCONST SCONST UCONST BCONST XCONST Op
 %token <ival>    ICONST PARAM
 %token            TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
 %token            LESS_EQUALS GREATER_EQUALS NOT_EQUALS
@@ -691,7 +691,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
     TREAT TRIGGER TRIM TRUE_P
     TRUNCATE TRUSTED TYPE_P TYPES_P

-    UNBOUNDED UNCOMMITTED UNENCRYPTED UNION UNIQUE UNKNOWN UNLISTEN UNLOGGED
+    UESCAPE UNBOUNDED UNCOMMITTED UNENCRYPTED UNION UNIQUE UNKNOWN UNLISTEN UNLOGGED
     UNTIL UPDATE USER USING

     VACUUM VALID VALIDATE VALIDATOR VALUE_P VALUES VARCHAR VARIADIC VARYING
@@ -15374,6 +15374,7 @@ unreserved_keyword:
             | TRUSTED
             | TYPE_P
             | TYPES_P
+            | UESCAPE
             | UNBOUNDED
             | UNCOMMITTED
             | UNENCRYPTED
diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index 4c0c258..e64f701 100644
--- a/src/backend/parser/parser.c
+++ b/src/backend/parser/parser.c
@@ -23,6 +23,12 @@

 #include "parser/gramparse.h"
 #include "parser/parser.h"
+#include "parser/scansup.h"
+#include "mb/pg_wchar.h"
+
+static bool check_uescapechar(unsigned char escape);
+static char *str_udeescape(char escape, char *str, int position,
+                           core_yyscan_t yyscanner);


 /*
@@ -75,6 +81,10 @@ raw_parser(const char *str)
  * scanner backtrack, which would cost more performance than this filter
  * layer does.
  *
+ * We also use this filter to convert UIDENT and UCONST sequences into
+ * plain IDENT and SCONST tokens.  While that could be handled by additional
+ * productions in the main grammar, it's more efficient to do it like this.
+ *
  * The filter also provides a convenient place to translate between
  * the core_YYSTYPE and YYSTYPE representations (which are really the
  * same thing anyway, but notationally they're different).
@@ -104,7 +114,7 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
      * If this token isn't one that requires lookahead, just return it.  If it
      * does, determine the token length.  (We could get that via strlen(), but
      * since we have such a small set of possibilities, hardwiring seems
-     * feasible and more efficient.)
+     * feasible and more efficient --- at least for the fixed-length cases.)
      */
     switch (cur_token)
     {
@@ -117,6 +127,10 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
         case WITH:
             cur_token_length = 4;
             break;
+        case UIDENT:
+        case UCONST:
+            cur_token_length = strlen(yyextra->core_yy_extra.scanbuf + *llocp);
+            break;
         default:
             return cur_token;
     }
@@ -190,7 +204,311 @@ base_yylex(YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner)
                     break;
             }
             break;
+
+        case UIDENT:
+            /* Look ahead for UESCAPE */
+            if (next_token == UESCAPE)
+            {
+                /* Yup, so get third token, which had better be SCONST */
+                const char *escstr;
+
+                /* Again save and restore *llocp */
+                cur_yylloc = *llocp;
+
+                /* Get third token */
+                next_token = core_yylex(&(yyextra->lookahead_yylval),
+                                        llocp, yyscanner);
+
+                /* If we throw error here, it will point to third token */
+                if (next_token != SCONST)
+                    scanner_yyerror("UESCAPE must be followed by a simple string literal",
+                                    yyscanner);
+
+                escstr = yyextra->lookahead_yylval.str;
+                if (strlen(escstr) != 1 || !check_uescapechar(escstr[0]))
+                    scanner_yyerror("invalid Unicode escape character",
+                                    yyscanner);
+
+                /* Now restore *llocp; errors will point to first token */
+                *llocp = cur_yylloc;
+
+                /* Apply Unicode conversion */
+                lvalp->core_yystype.str =
+                    str_udeescape(escstr[0],
+                                  lvalp->core_yystype.str,
+                                  *llocp,
+                                  yyscanner);
+
+                /*
+                 * We don't need to un-revert truncation of UESCAPE.  What we
+                 * do want to do is clear have_lookahead, thereby consuming
+                 * all three tokens.
+                 */
+                yyextra->have_lookahead = false;
+            }
+            else
+            {
+                /* No UESCAPE, so convert using default escape character */
+                lvalp->core_yystype.str =
+                    str_udeescape('\\',
+                                  lvalp->core_yystype.str,
+                                  *llocp,
+                                  yyscanner);
+            }
+            /* It's an identifier, so truncate as appropriate */
+            truncate_identifier(lvalp->core_yystype.str,
+                                strlen(lvalp->core_yystype.str),
+                                true);
+            cur_token = IDENT;
+            break;
+
+        case UCONST:
+            /* Look ahead for UESCAPE */
+            if (next_token == UESCAPE)
+            {
+                /* Yup, so get third token, which had better be SCONST */
+                const char *escstr;
+
+                /* Again save and restore *llocp */
+                cur_yylloc = *llocp;
+
+                /* Get third token */
+                next_token = core_yylex(&(yyextra->lookahead_yylval),
+                                        llocp, yyscanner);
+
+                /* If we throw error here, it will point to third token */
+                if (next_token != SCONST)
+                    scanner_yyerror("UESCAPE must be followed by a simple string literal",
+                                    yyscanner);
+
+                escstr = yyextra->lookahead_yylval.str;
+                if (strlen(escstr) != 1 || !check_uescapechar(escstr[0]))
+                    scanner_yyerror("invalid Unicode escape character",
+                                    yyscanner);
+
+                /* Now restore *llocp; errors will point to first token */
+                *llocp = cur_yylloc;
+
+                /* Apply Unicode conversion */
+                lvalp->core_yystype.str =
+                    str_udeescape(escstr[0],
+                                  lvalp->core_yystype.str,
+                                  *llocp,
+                                  yyscanner);
+
+                /*
+                 * We don't need to un-revert truncation of UESCAPE.  What we
+                 * do want to do is clear have_lookahead, thereby consuming
+                 * all three tokens.
+                 */
+                yyextra->have_lookahead = false;
+            }
+            else
+            {
+                /* No UESCAPE, so convert using default escape character */
+                lvalp->core_yystype.str =
+                    str_udeescape('\\',
+                                  lvalp->core_yystype.str,
+                                  *llocp,
+                                  yyscanner);
+            }
+            cur_token = SCONST;
+            break;
     }

     return cur_token;
 }
+
+/* convert hex digit (caller should have verified that) to value */
+static unsigned int
+hexval(unsigned char c)
+{
+    if (c >= '0' && c <= '9')
+        return c - '0';
+    if (c >= 'a' && c <= 'f')
+        return c - 'a' + 0xA;
+    if (c >= 'A' && c <= 'F')
+        return c - 'A' + 0xA;
+    elog(ERROR, "invalid hexadecimal digit");
+    return 0;                    /* not reached */
+}
+
+/* is Unicode code point acceptable in database's encoding? */
+static void
+check_unicode_value(pg_wchar c, int pos, core_yyscan_t yyscanner)
+{
+    /* See also addunicode() in scan.l */
+    if (c == 0 || c > 0x10FFFF)
+        ereport(ERROR,
+                (errcode(ERRCODE_SYNTAX_ERROR),
+                 errmsg("invalid Unicode escape value"),
+                 scanner_errposition(pos, yyscanner)));
+
+    if (c > 0x7F && GetDatabaseEncoding() != PG_UTF8)
+        ereport(ERROR,
+                (errcode(ERRCODE_SYNTAX_ERROR),
+                 errmsg("Unicode escape values cannot be used for code point values above 007F when the server
encodingis not UTF8"), 
+                 scanner_errposition(pos, yyscanner)));
+}
+
+/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
+static bool
+check_uescapechar(unsigned char escape)
+{
+    if (isxdigit(escape)
+        || escape == '+'
+        || escape == '\''
+        || escape == '"'
+        || scanner_isspace(escape))
+        return false;
+    else
+        return true;
+}
+
+/* Process Unicode escapes in "str", producing a palloc'd plain string */
+static char *
+str_udeescape(char escape, char *str, int position,
+              core_yyscan_t yyscanner)
+{
+    char       *new,
+               *in,
+               *out;
+    int            str_length;
+    pg_wchar    pair_first = 0;
+
+    str_length = strlen(str);
+
+    /*
+     * This relies on the subtle assumption that a UTF-8 expansion cannot be
+     * longer than its escaped representation.
+     */
+    new = palloc(str_length + 1);
+
+    in = str;
+    out = new;
+    while (*in)
+    {
+        if (in[0] == escape)
+        {
+            if (in[1] == escape)
+            {
+                if (pair_first)
+                    goto invalid_pair;
+                *out++ = escape;
+                in += 2;
+            }
+            else if (isxdigit((unsigned char) in[1]) &&
+                     isxdigit((unsigned char) in[2]) &&
+                     isxdigit((unsigned char) in[3]) &&
+                     isxdigit((unsigned char) in[4]))
+            {
+                pg_wchar    unicode;
+
+                unicode = (hexval(in[1]) << 12) +
+                    (hexval(in[2]) << 8) +
+                    (hexval(in[3]) << 4) +
+                    hexval(in[4]);
+                check_unicode_value(unicode,
+                                    position + in - str + 3,    /* 3 for U&" */
+                                    yyscanner);
+                if (pair_first)
+                {
+                    if (is_utf16_surrogate_second(unicode))
+                    {
+                        unicode = surrogate_pair_to_codepoint(pair_first, unicode);
+                        pair_first = 0;
+                    }
+                    else
+                        goto invalid_pair;
+                }
+                else if (is_utf16_surrogate_second(unicode))
+                    goto invalid_pair;
+
+                if (is_utf16_surrogate_first(unicode))
+                    pair_first = unicode;
+                else
+                {
+                    unicode_to_utf8(unicode, (unsigned char *) out);
+                    out += pg_mblen(out);
+                }
+                in += 5;
+            }
+            else if (in[1] == '+' &&
+                     isxdigit((unsigned char) in[2]) &&
+                     isxdigit((unsigned char) in[3]) &&
+                     isxdigit((unsigned char) in[4]) &&
+                     isxdigit((unsigned char) in[5]) &&
+                     isxdigit((unsigned char) in[6]) &&
+                     isxdigit((unsigned char) in[7]))
+            {
+                pg_wchar    unicode;
+
+                unicode = (hexval(in[2]) << 20) +
+                    (hexval(in[3]) << 16) +
+                    (hexval(in[4]) << 12) +
+                    (hexval(in[5]) << 8) +
+                    (hexval(in[6]) << 4) +
+                    hexval(in[7]);
+                check_unicode_value(unicode,
+                                    position + in - str + 3,    /* 3 for U&" */
+                                    yyscanner);
+                if (pair_first)
+                {
+                    if (is_utf16_surrogate_second(unicode))
+                    {
+                        unicode = surrogate_pair_to_codepoint(pair_first, unicode);
+                        pair_first = 0;
+                    }
+                    else
+                        goto invalid_pair;
+                }
+                else if (is_utf16_surrogate_second(unicode))
+                    goto invalid_pair;
+
+                if (is_utf16_surrogate_first(unicode))
+                    pair_first = unicode;
+                else
+                {
+                    unicode_to_utf8(unicode, (unsigned char *) out);
+                    out += pg_mblen(out);
+                }
+                in += 8;
+            }
+            else
+                ereport(ERROR,
+                        (errcode(ERRCODE_SYNTAX_ERROR),
+                         errmsg("invalid Unicode escape value"),
+                         scanner_errposition(position + in - str + 3,    /* 3 for U&" */
+                                             yyscanner)));
+        }
+        else
+        {
+            if (pair_first)
+                goto invalid_pair;
+
+            *out++ = *in++;
+        }
+    }
+
+    /* unfinished surrogate pair? */
+    if (pair_first)
+        goto invalid_pair;
+
+    *out = '\0';
+
+    /*
+     * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII
+     * codes; but it's probably not worth the trouble, since this isn't likely
+     * to be a performance-critical path.
+     */
+    pg_verifymbstr(new, out - new, false);
+    return new;
+
+invalid_pair:
+    ereport(ERROR,
+            (errcode(ERRCODE_SYNTAX_ERROR),
+             errmsg("invalid Unicode surrogate pair"),
+             scanner_errposition(position + in - str + 3,    /* 3 for U&" */
+                                 yyscanner)));
+    return NULL;                /* keep compiler quiet */
+}
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index e1cae85..a96af2c 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -110,14 +110,9 @@ const uint16 ScanKeywordTokens[] = {
 static void addlit(char *ytext, int yleng, core_yyscan_t yyscanner);
 static void addlitchar(unsigned char ychar, core_yyscan_t yyscanner);
 static char *litbufdup(core_yyscan_t yyscanner);
-static char *litbuf_udeescape(unsigned char escape, core_yyscan_t yyscanner);
 static unsigned char unescape_single_char(unsigned char c, core_yyscan_t yyscanner);
 static int    process_integer_literal(const char *token, YYSTYPE *lval);
-static bool is_utf16_surrogate_first(pg_wchar c);
-static bool is_utf16_surrogate_second(pg_wchar c);
-static pg_wchar surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second);
 static void addunicode(pg_wchar c, yyscan_t yyscanner);
-static bool check_uescapechar(unsigned char escape);

 #define yyerror(msg)  scanner_yyerror(msg, yyscanner)

@@ -168,12 +163,11 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
- *  <xuiend> end of a quoted identifier with Unicode escapes, UESCAPE can follow
  *  <xus> quoted string with Unicode escapes
- *  <xusend> end of a quoted string with Unicode escapes, UESCAPE can follow
  *  <xeu> Unicode surrogate pair in extended quoted string
  *
  * Remember to add an <<EOF>> case whenever you add a new exclusive state!
@@ -185,12 +179,11 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
-%x xuiend
 %x xus
-%x xusend
 %x xeu

 /*
@@ -231,19 +224,18 @@ special_whitespace        ({space}+|{comment}{newline})
 horiz_whitespace        ({horiz_space}|{comment})
 whitespace_with_newline    ({horiz_whitespace}*{newline}{special_whitespace}*)

+quote            '
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue    {whitespace_with_newline}{quote}
+
 /*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
  */
-quote            '
-quotestop        {quote}{whitespace}*
-quotecontinue    {quote}{whitespace_with_newline}{quote}
-quotefail        {quote}{whitespace}*"-"
+quotecontinuefail    {whitespace}*"-"?

 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -304,21 +296,12 @@ xdstop            {dquote}
 xddouble        {dquote}{dquote}
 xdinside        [^"]+

-/* Unicode escapes */
-uescape            [uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
-/* error rule to avoid backup */
-uescapefail
[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
-
 /* Quoted identifier with Unicode escapes */
 xuistart        [uU]&{dquote}

 /* Quoted string with Unicode escapes */
 xusstart        [uU]&{quote}

-/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
-xustop1        {uescapefail}?
-xustop2        {uescape}
-
 /* error rule to avoid backup */
 xufailed        [uU]&

@@ -476,21 +459,10 @@ other            .
                     startlit();
                     addlitchar('b', yyscanner);
                 }
-<xb>{quotestop}    |
-<xb>{quotefail} {
-                    yyless(1);
-                    BEGIN(INITIAL);
-                    yylval->str = litbufdup(yyscanner);
-                    return BCONST;
-                }
 <xh>{xhinside}    |
 <xb>{xbinside}    {
                     addlit(yytext, yyleng, yyscanner);
                 }
-<xh>{quotecontinue}    |
-<xb>{quotecontinue}    {
-                    /* ignore */
-                }
 <xb><<EOF>>        { yyerror("unterminated bit string literal"); }

 {xhstart}        {
@@ -505,13 +477,6 @@ other            .
                     startlit();
                     addlitchar('x', yyscanner);
                 }
-<xh>{quotestop}    |
-<xh>{quotefail} {
-                    yyless(1);
-                    BEGIN(INITIAL);
-                    yylval->str = litbufdup(yyscanner);
-                    return XCONST;
-                }
 <xh><<EOF>>        { yyerror("unterminated hexadecimal string literal"); }

 {xnstart}        {
@@ -568,53 +533,67 @@ other            .
                     BEGIN(xus);
                     startlit();
                 }
-<xq,xe>{quotestop}    |
-<xq,xe>{quotefail} {
-                    yyless(1);
-                    BEGIN(INITIAL);
+
+<xb,xh,xq,xe,xus>{quote} {
                     /*
-                     * check that the data remains valid if it might have been
-                     * made invalid by unescaping any chars.
+                     * When we are scanning a quoted string and see an end
+                     * quote, we must look ahead for a possible continuation.
+                     * If we don't see one, we know the end quote was in fact
+                     * the end of the string.  To reduce the lexer table size,
+                     * we use a single "xqs" state to do the lookahead for all
+                     * types of strings.
                      */
-                    if (yyextra->saw_non_ascii)
-                        pg_verifymbstr(yyextra->literalbuf,
-                                       yyextra->literallen,
-                                       false);
-                    yylval->str = litbufdup(yyscanner);
-                    return SCONST;
-                }
-<xus>{quotestop} |
-<xus>{quotefail} {
-                    /* throw back all but the quote */
-                    yyless(1);
-                    /* xusend state looks for possible UESCAPE */
-                    BEGIN(xusend);
+                    yyextra->state_before_str_stop = YYSTATE;
+                    BEGIN(xqs);
                 }
-<xusend>{whitespace} {
-                    /* stay in xusend state over whitespace */
+<xqs>{quotecontinue} {
+                    /*
+                     * Found a quote continuation, so return to the in-quote
+                     * state and continue scanning the literal.
+                     */
+                    BEGIN(yyextra->state_before_str_stop);
                 }
-<xusend><<EOF>> |
-<xusend>{other} |
-<xusend>{xustop1} {
-                    /* no UESCAPE after the quote, throw back everything */
+<xqs>{quotecontinuefail} |
+<xqs><<EOF>> |
+<xqs>{other}    {
+                    /*
+                     * Failed to see a quote continuation.  Throw back
+                     * everything after the end quote, and handle the string
+                     * according to the state we were in previously.
+                     */
                     yyless(0);
                     BEGIN(INITIAL);
-                    yylval->str = litbuf_udeescape('\\', yyscanner);
-                    return SCONST;
-                }
-<xusend>{xustop2} {
-                    /* found UESCAPE after the end quote */
-                    BEGIN(INITIAL);
-                    if (!check_uescapechar(yytext[yyleng - 2]))
+
+                    switch (yyextra->state_before_str_stop)
                     {
-                        SET_YYLLOC();
-                        ADVANCE_YYLLOC(yyleng - 2);
-                        yyerror("invalid Unicode escape character");
+                        case xb:
+                            yylval->str = litbufdup(yyscanner);
+                            return BCONST;
+                        case xh:
+                            yylval->str = litbufdup(yyscanner);
+                            return XCONST;
+                        case xq:
+                            /* fallthrough */
+                        case xe:
+                            /*
+                             * Check that the data remains valid if it
+                             * might have been made invalid by unescaping
+                             * any chars.
+                             */
+                            if (yyextra->saw_non_ascii)
+                                pg_verifymbstr(yyextra->literalbuf,
+                                               yyextra->literallen,
+                                               false);
+                            yylval->str = litbufdup(yyscanner);
+                            return SCONST;
+                        case xus:
+                            yylval->str = litbufdup(yyscanner);
+                            return UCONST;
+                        default:
+                            yyerror("unhandled previous state in xqs");
                     }
-                    yylval->str = litbuf_udeescape(yytext[yyleng - 2],
-                                                   yyscanner);
-                    return SCONST;
                 }
+
 <xq,xe,xus>{xqdouble} {
                     addlitchar('\'', yyscanner);
                 }
@@ -693,9 +672,6 @@ other            .
                     if (c == '\0' || IS_HIGHBIT_SET(c))
                         yyextra->saw_non_ascii = true;
                 }
-<xq,xe,xus>{quotecontinue} {
-                    /* ignore */
-                }
 <xe>.            {
                     /* This is only needed for \ just before EOF */
                     addlitchar(yytext[0], yyscanner);
@@ -770,53 +746,14 @@ other            .
                     return IDENT;
                 }
 <xui>{dquote} {
-                    yyless(1);
-                    /* xuiend state looks for possible UESCAPE */
-                    BEGIN(xuiend);
-                }
-<xuiend>{whitespace} {
-                    /* stay in xuiend state over whitespace */
-                }
-<xuiend><<EOF>> |
-<xuiend>{other} |
-<xuiend>{xustop1} {
-                    /* no UESCAPE after the quote, throw back everything */
-                    char       *ident;
-                    int            identlen;
-
-                    yyless(0);
-
-                    BEGIN(INITIAL);
                     if (yyextra->literallen == 0)
                         yyerror("zero-length delimited identifier");
-                    ident = litbuf_udeescape('\\', yyscanner);
-                    identlen = strlen(ident);
-                    if (identlen >= NAMEDATALEN)
-                        truncate_identifier(ident, identlen, true);
-                    yylval->str = ident;
-                    return IDENT;
-                }
-<xuiend>{xustop2}    {
-                    /* found UESCAPE after the end quote */
-                    char       *ident;
-                    int            identlen;

                     BEGIN(INITIAL);
-                    if (yyextra->literallen == 0)
-                        yyerror("zero-length delimited identifier");
-                    if (!check_uescapechar(yytext[yyleng - 2]))
-                    {
-                        SET_YYLLOC();
-                        ADVANCE_YYLLOC(yyleng - 2);
-                        yyerror("invalid Unicode escape character");
-                    }
-                    ident = litbuf_udeescape(yytext[yyleng - 2], yyscanner);
-                    identlen = strlen(ident);
-                    if (identlen >= NAMEDATALEN)
-                        truncate_identifier(ident, identlen, true);
-                    yylval->str = ident;
-                    return IDENT;
+                    yylval->str = litbufdup(yyscanner);
+                    return UIDENT;
                 }
+
 <xd,xui>{xddouble}    {
                     addlitchar('"', yyscanner);
                 }
@@ -1288,55 +1225,12 @@ process_integer_literal(const char *token, YYSTYPE *lval)
     return ICONST;
 }

-static unsigned int
-hexval(unsigned char c)
-{
-    if (c >= '0' && c <= '9')
-        return c - '0';
-    if (c >= 'a' && c <= 'f')
-        return c - 'a' + 0xA;
-    if (c >= 'A' && c <= 'F')
-        return c - 'A' + 0xA;
-    elog(ERROR, "invalid hexadecimal digit");
-    return 0;                    /* not reached */
-}
-
-static void
-check_unicode_value(pg_wchar c, char *loc, core_yyscan_t yyscanner)
-{
-    if (GetDatabaseEncoding() == PG_UTF8)
-        return;
-
-    if (c > 0x7F)
-    {
-        ADVANCE_YYLLOC(loc - yyextra->literalbuf + 3);    /* 3 for U&" */
-        yyerror("Unicode escape values cannot be used for code point values above 007F when the server encoding is not
UTF8");
-    }
-}
-
-static bool
-is_utf16_surrogate_first(pg_wchar c)
-{
-    return (c >= 0xD800 && c <= 0xDBFF);
-}
-
-static bool
-is_utf16_surrogate_second(pg_wchar c)
-{
-    return (c >= 0xDC00 && c <= 0xDFFF);
-}
-
-static pg_wchar
-surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second)
-{
-    return ((first & 0x3FF) << 10) + 0x10000 + (second & 0x3FF);
-}
-
 static void
 addunicode(pg_wchar c, core_yyscan_t yyscanner)
 {
     char        buf[8];

+    /* See also check_unicode_value() in parser.c */
     if (c == 0 || c > 0x10FFFF)
         yyerror("invalid Unicode escape value");
     if (c > 0x7F)
@@ -1349,172 +1243,6 @@ addunicode(pg_wchar c, core_yyscan_t yyscanner)
     addlit(buf, pg_mblen(buf), yyscanner);
 }

-/* is 'escape' acceptable as Unicode escape character (UESCAPE syntax) ? */
-static bool
-check_uescapechar(unsigned char escape)
-{
-    if (isxdigit(escape)
-        || escape == '+'
-        || escape == '\''
-        || escape == '"'
-        || scanner_isspace(escape))
-    {
-        return false;
-    }
-    else
-        return true;
-}
-
-/* like litbufdup, but handle unicode escapes */
-static char *
-litbuf_udeescape(unsigned char escape, core_yyscan_t yyscanner)
-{
-    char       *new;
-    char       *litbuf,
-               *in,
-               *out;
-    pg_wchar    pair_first = 0;
-
-    /* Make literalbuf null-terminated to simplify the scanning loop */
-    litbuf = yyextra->literalbuf;
-    litbuf[yyextra->literallen] = '\0';
-
-    /*
-     * This relies on the subtle assumption that a UTF-8 expansion cannot be
-     * longer than its escaped representation.
-     */
-    new = palloc(yyextra->literallen + 1);
-
-    in = litbuf;
-    out = new;
-    while (*in)
-    {
-        if (in[0] == escape)
-        {
-            if (in[1] == escape)
-            {
-                if (pair_first)
-                {
-                    ADVANCE_YYLLOC(in - litbuf + 3);    /* 3 for U&" */
-                    yyerror("invalid Unicode surrogate pair");
-                }
-                *out++ = escape;
-                in += 2;
-            }
-            else if (isxdigit((unsigned char) in[1]) &&
-                     isxdigit((unsigned char) in[2]) &&
-                     isxdigit((unsigned char) in[3]) &&
-                     isxdigit((unsigned char) in[4]))
-            {
-                pg_wchar    unicode;
-
-                unicode = (hexval(in[1]) << 12) +
-                    (hexval(in[2]) << 8) +
-                    (hexval(in[3]) << 4) +
-                    hexval(in[4]);
-                check_unicode_value(unicode, in, yyscanner);
-                if (pair_first)
-                {
-                    if (is_utf16_surrogate_second(unicode))
-                    {
-                        unicode = surrogate_pair_to_codepoint(pair_first, unicode);
-                        pair_first = 0;
-                    }
-                    else
-                    {
-                        ADVANCE_YYLLOC(in - litbuf + 3);        /* 3 for U&" */
-                        yyerror("invalid Unicode surrogate pair");
-                    }
-                }
-                else if (is_utf16_surrogate_second(unicode))
-                    yyerror("invalid Unicode surrogate pair");
-
-                if (is_utf16_surrogate_first(unicode))
-                    pair_first = unicode;
-                else
-                {
-                    unicode_to_utf8(unicode, (unsigned char *) out);
-                    out += pg_mblen(out);
-                }
-                in += 5;
-            }
-            else if (in[1] == '+' &&
-                     isxdigit((unsigned char) in[2]) &&
-                     isxdigit((unsigned char) in[3]) &&
-                     isxdigit((unsigned char) in[4]) &&
-                     isxdigit((unsigned char) in[5]) &&
-                     isxdigit((unsigned char) in[6]) &&
-                     isxdigit((unsigned char) in[7]))
-            {
-                pg_wchar    unicode;
-
-                unicode = (hexval(in[2]) << 20) +
-                    (hexval(in[3]) << 16) +
-                    (hexval(in[4]) << 12) +
-                    (hexval(in[5]) << 8) +
-                    (hexval(in[6]) << 4) +
-                    hexval(in[7]);
-                check_unicode_value(unicode, in, yyscanner);
-                if (pair_first)
-                {
-                    if (is_utf16_surrogate_second(unicode))
-                    {
-                        unicode = surrogate_pair_to_codepoint(pair_first, unicode);
-                        pair_first = 0;
-                    }
-                    else
-                    {
-                        ADVANCE_YYLLOC(in - litbuf + 3);        /* 3 for U&" */
-                        yyerror("invalid Unicode surrogate pair");
-                    }
-                }
-                else if (is_utf16_surrogate_second(unicode))
-                    yyerror("invalid Unicode surrogate pair");
-
-                if (is_utf16_surrogate_first(unicode))
-                    pair_first = unicode;
-                else
-                {
-                    unicode_to_utf8(unicode, (unsigned char *) out);
-                    out += pg_mblen(out);
-                }
-                in += 8;
-            }
-            else
-            {
-                ADVANCE_YYLLOC(in - litbuf + 3);        /* 3 for U&" */
-                yyerror("invalid Unicode escape value");
-            }
-        }
-        else
-        {
-            if (pair_first)
-            {
-                ADVANCE_YYLLOC(in - litbuf + 3);        /* 3 for U&" */
-                yyerror("invalid Unicode surrogate pair");
-            }
-            *out++ = *in++;
-        }
-    }
-
-    /* unfinished surrogate pair? */
-    if (pair_first)
-    {
-        ADVANCE_YYLLOC(in - litbuf + 3);                /* 3 for U&" */
-        yyerror("invalid Unicode surrogate pair");
-    }
-
-    *out = '\0';
-
-    /*
-     * We could skip pg_verifymbstr if we didn't process any non-7-bit-ASCII
-     * codes; but it's probably not worth the trouble, since this isn't likely
-     * to be a performance-critical path.
-     */
-    pg_verifymbstr(new, out - new, false);
-    return new;
-}
-
 static unsigned char
 unescape_single_char(unsigned char c, core_yyscan_t yyscanner)
 {
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index 3e3e6c4..0c4cb9c 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -509,6 +509,27 @@ typedef uint32 (*utf_local_conversion_func) (uint32 code);


 /*
+ * Some handy functions for Unicode-specific tests.
+ */
+static inline bool
+is_utf16_surrogate_first(pg_wchar c)
+{
+    return (c >= 0xD800 && c <= 0xDBFF);
+}
+
+static inline bool
+is_utf16_surrogate_second(pg_wchar c)
+{
+    return (c >= 0xDC00 && c <= 0xDFFF);
+}
+
+static inline pg_wchar
+surrogate_pair_to_codepoint(pg_wchar first, pg_wchar second)
+{
+    return ((first & 0x3FF) << 10) + 0x10000 + (second & 0x3FF);
+}
+
+/*
  * These functions are considered part of libpq's exported API and
  * are also declared in libpq-fe.h.
  */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 00ace84..5893d31 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -416,6 +416,7 @@ PG_KEYWORD("truncate", TRUNCATE, UNRESERVED_KEYWORD)
 PG_KEYWORD("trusted", TRUSTED, UNRESERVED_KEYWORD)
 PG_KEYWORD("type", TYPE_P, UNRESERVED_KEYWORD)
 PG_KEYWORD("types", TYPES_P, UNRESERVED_KEYWORD)
+PG_KEYWORD("uescape", UESCAPE, UNRESERVED_KEYWORD)
 PG_KEYWORD("unbounded", UNBOUNDED, UNRESERVED_KEYWORD)
 PG_KEYWORD("uncommitted", UNCOMMITTED, UNRESERVED_KEYWORD)
 PG_KEYWORD("unencrypted", UNENCRYPTED, UNRESERVED_KEYWORD)
diff --git a/src/include/parser/scanner.h b/src/include/parser/scanner.h
index 731a2bd..571d5e2 100644
--- a/src/include/parser/scanner.h
+++ b/src/include/parser/scanner.h
@@ -48,7 +48,7 @@ typedef union core_YYSTYPE
  * However, those are not defined in this file, because bison insists on
  * defining them for itself.  The token codes used by the core scanner are
  * the ASCII characters plus these:
- *    %token <str>    IDENT FCONST SCONST BCONST XCONST Op
+ *    %token <str>    IDENT UIDENT FCONST SCONST UCONST BCONST XCONST Op
  *    %token <ival>    ICONST PARAM
  *    %token            TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
  *    %token            LESS_EQUALS GREATER_EQUALS NOT_EQUALS
@@ -99,6 +99,7 @@ typedef struct core_yy_extra_type
     int            literallen;        /* actual current string length */
     int            literalalloc;    /* current allocated buffer size */

+    int            state_before_str_stop;    /* start cond. before end quote */
     int            xcdepth;        /* depth of nesting in slash-star comments */
     char       *dolqstart;        /* current $foo$ quote start string */

diff --git a/src/interfaces/ecpg/preproc/ecpg.tokens b/src/interfaces/ecpg/preproc/ecpg.tokens
index 1d613af..749a914 100644
--- a/src/interfaces/ecpg/preproc/ecpg.tokens
+++ b/src/interfaces/ecpg/preproc/ecpg.tokens
@@ -24,4 +24,4 @@
                 S_TYPEDEF

 %token CSTRING CVARIABLE CPP_LINE IP
-%token DOLCONST ECONST NCONST UCONST UIDENT
+%token DOLCONST ECONST NCONST
diff --git a/src/interfaces/ecpg/preproc/ecpg.trailer b/src/interfaces/ecpg/preproc/ecpg.trailer
index f58b41e..efad0c0 100644
--- a/src/interfaces/ecpg/preproc/ecpg.trailer
+++ b/src/interfaces/ecpg/preproc/ecpg.trailer
@@ -1750,7 +1750,6 @@ ecpg_sconst:
             $$[strlen($1)+3]='\0';
             free($1);
         }
-        | UCONST    { $$ = $1; }
         | DOLCONST    { $$ = $1; }
         ;

@@ -1758,7 +1757,6 @@ ecpg_xconst:    XCONST        { $$ = make_name(); } ;

 ecpg_ident:    IDENT        { $$ = make_name(); }
         | CSTRING    { $$ = make3_str(mm_strdup("\""), $1, mm_strdup("\"")); }
-        | UIDENT    { $$ = $1; }
         ;

 quoted_ident_stringvar: name
diff --git a/src/interfaces/ecpg/preproc/parse.pl b/src/interfaces/ecpg/preproc/parse.pl
index 3619706..dc40b29 100644
--- a/src/interfaces/ecpg/preproc/parse.pl
+++ b/src/interfaces/ecpg/preproc/parse.pl
@@ -218,8 +218,8 @@ sub main
                 if ($a eq 'IDENT' && $prior eq '%nonassoc')
                 {

-                    # add two more tokens to the list
-                    $str = $str . "\n%nonassoc CSTRING\n%nonassoc UIDENT";
+                    # add one more tokens to the list
+                    $str = $str . "\n%nonassoc CSTRING";
                 }
                 $prior = $a;
             }
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index 454071a..3cdf928 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -232,7 +232,7 @@ static    void            check_raise_parameters(PLpgSQL_stmt_raise *stmt);
  * Some of these are not directly referenced in this file, but they must be
  * here anyway.
  */
-%token <str>    IDENT FCONST SCONST BCONST XCONST Op
+%token <str>    IDENT UIDENT FCONST SCONST UCONST BCONST XCONST Op
 %token <ival>    ICONST PARAM
 %token            TYPECAST DOT_DOT COLON_EQUALS EQUALS_GREATER
 %token            LESS_EQUALS GREATER_EQUALS NOT_EQUALS
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index 6d96843..0716e4f 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -48,17 +48,17 @@ SELECT 'tricky' AS U&"\" UESCAPE '!';
 (1 row)

 SELECT U&'wrong: \061';
-ERROR:  invalid Unicode escape value at or near "\061'"
+ERROR:  invalid Unicode escape value
 LINE 1: SELECT U&'wrong: \061';
                          ^
 SELECT U&'wrong: \+0061';
-ERROR:  invalid Unicode escape value at or near "\+0061'"
+ERROR:  invalid Unicode escape value
 LINE 1: SELECT U&'wrong: \+0061';
                          ^
 SELECT U&'wrong: +0061' UESCAPE '+';
-ERROR:  invalid Unicode escape character at or near "+'"
+ERROR:  invalid Unicode escape character at or near "'+'"
 LINE 1: SELECT U&'wrong: +0061' UESCAPE '+';
-                                         ^
+                                        ^
 SET standard_conforming_strings TO off;
 SELECT U&'d\0061t\+000061' AS U&"d\0061t\+000061";
 ERROR:  unsafe use of string constant with Unicode escapes

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: global / super barriers (for checksums)
Next
From: Jeremy Schneider
Date:
Subject: Re: Proposal: Global Index