Re: benchmarking Flex practices - Mailing list pgsql-hackers

From Tom Lane
Subject Re: benchmarking Flex practices
Date
Msg-id 26965.1562703338@sss.pgh.pa.us
Whole thread Raw
In response to Re: benchmarking Flex practices  (John Naylor <john.naylor@2ndquadrant.com>)
Responses Re: benchmarking Flex practices  (John Naylor <john.naylor@2ndquadrant.com>)
List pgsql-hackers
John Naylor <john.naylor@2ndquadrant.com> writes:
> [ v4 patches for trimming lexer table size ]

I reviewed this and it looks pretty solid.  One gripe I have is
that I think it's best to limit backup-prevention tokens such as
quotecontinuefail so that they match only exact prefixes of their
"success" tokens.  This seems clearer to me, and in at least some cases
it can save a few flex states.  The attached v5 patch does it like that
and gets us down to 22331 states (from 23696).  In some places it looks
like you did that to avoid writing an explicit "{other}" match rule for
an exclusive state, but I think it's better for readability and
separation of concerns to go ahead and have those explicit rules
(and it seems to make no difference table-size-wise).

I also made some cosmetic changes (mostly improving comments) and
smashed the patch series down to 1 patch, because I preferred to
review it that way and we're not really going to commit these
separately.

I did a little bit of portability testing, to the extent of verifying
that the oldest and newest Flex versions I have handy (2.5.33 and 2.6.4)
agree on the table size change and get through regression tests.  So
I think we should be good from that end.

We still need to propagate these changes into the psql and ecpg lexers,
but I assume you were waiting to agree on the core patch before touching
those.  If you're good with the changes I made here, have at it.

            regards, tom lane

diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index e1cae85..899da09 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -168,12 +168,14 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
  *  <xd> delimited identifiers (double-quoted identifiers)
  *  <xh> hexadecimal numeric string
  *  <xq> standard quoted strings
+ *  <xqs> quote stop (detect continued strings)
  *  <xe> extended quoted strings (support backslash escape sequences)
  *  <xdolq> $foo$ quoted strings
  *  <xui> quoted identifier with Unicode escapes
- *  <xuiend> end of a quoted identifier with Unicode escapes, UESCAPE can follow
  *  <xus> quoted string with Unicode escapes
- *  <xusend> end of a quoted string with Unicode escapes, UESCAPE can follow
+ *  <xuend> end of a quoted string or identifier with Unicode escapes,
+ *    UESCAPE can follow
+ *  <xuchar> expecting escape character literal after UESCAPE
  *  <xeu> Unicode surrogate pair in extended quoted string
  *
  * Remember to add an <<EOF>> case whenever you add a new exclusive state!
@@ -185,12 +187,13 @@ extern void core_yyset_column(int column_no, yyscan_t yyscanner);
 %x xd
 %x xh
 %x xq
+%x xqs
 %x xe
 %x xdolq
 %x xui
-%x xuiend
 %x xus
-%x xusend
+%x xuend
+%x xuchar
 %x xeu

 /*
@@ -231,19 +234,18 @@ special_whitespace        ({space}+|{comment}{newline})
 horiz_whitespace        ({horiz_space}|{comment})
 whitespace_with_newline    ({horiz_whitespace}*{newline}{special_whitespace}*)

+quote            '
+/* If we see {quote} then {quotecontinue}, the quoted string continues */
+quotecontinue    {whitespace_with_newline}{quote}
+
 /*
- * To ensure that {quotecontinue} can be scanned without having to back up
- * if the full pattern isn't matched, we include trailing whitespace in
- * {quotestop}.  This matches all cases where {quotecontinue} fails to match,
- * except for {quote} followed by whitespace and just one "-" (not two,
- * which would start a {comment}).  To cover that we have {quotefail}.
- * The actions for {quotestop} and {quotefail} must throw back characters
- * beyond the quote proper.
+ * {quotecontinuefail} is needed to avoid lexer backup when we fail to match
+ * {quotecontinue}.  It might seem that this could just be {whitespace}*,
+ * but if there's a dash after {whitespace_with_newline}, it must be consumed
+ * to see if there's another dash --- which would start a {comment} and thus
+ * allow continuation of the {quotecontinue} token.
  */
-quote            '
-quotestop        {quote}{whitespace}*
-quotecontinue    {quote}{whitespace_with_newline}{quote}
-quotefail        {quote}{whitespace}*"-"
+quotecontinuefail    {whitespace}*"-"?

 /* Bit string
  * It is tempting to scan the string for only those characters
@@ -304,10 +306,15 @@ xdstop            {dquote}
 xddouble        {dquote}{dquote}
 xdinside        [^"]+

-/* Unicode escapes */
-uescape            [uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']{quote}
+/* Optional UESCAPE after a quoted string or identifier with Unicode escapes */
+uescape            [uU][eE][sS][cC][aA][pP][eE]
+/* error rule to avoid backup */
+uescapefail        [uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
+
+/* escape character literal */
+uescchar        {quote}[^']{quote}
 /* error rule to avoid backup */
-uescapefail
[uU][eE][sS][cC][aA][pP][eE]{whitespace}*"-"|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}[^']|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*{quote}|[uU][eE][sS][cC][aA][pP][eE]{whitespace}*|[uU][eE][sS][cC][aA][pP]|[uU][eE][sS][cC][aA]|[uU][eE][sS][cC]|[uU][eE][sS]|[uU][eE]|[uU]
+uesccharfail    {quote}[^']|{quote}

 /* Quoted identifier with Unicode escapes */
 xuistart        [uU]&{dquote}
@@ -315,10 +322,6 @@ xuistart        [uU]&{dquote}
 /* Quoted string with Unicode escapes */
 xusstart        [uU]&{quote}

-/* Optional UESCAPE after a quoted string or identifier with Unicode escapes. */
-xustop1        {uescapefail}?
-xustop2        {uescape}
-
 /* error rule to avoid backup */
 xufailed        [uU]&

@@ -476,21 +479,10 @@ other            .
                     startlit();
                     addlitchar('b', yyscanner);
                 }
-<xb>{quotestop}    |
-<xb>{quotefail} {
-                    yyless(1);
-                    BEGIN(INITIAL);
-                    yylval->str = litbufdup(yyscanner);
-                    return BCONST;
-                }
 <xh>{xhinside}    |
 <xb>{xbinside}    {
                     addlit(yytext, yyleng, yyscanner);
                 }
-<xh>{quotecontinue}    |
-<xb>{quotecontinue}    {
-                    /* ignore */
-                }
 <xb><<EOF>>        { yyerror("unterminated bit string literal"); }

 {xhstart}        {
@@ -505,13 +497,6 @@ other            .
                     startlit();
                     addlitchar('x', yyscanner);
                 }
-<xh>{quotestop}    |
-<xh>{quotefail} {
-                    yyless(1);
-                    BEGIN(INITIAL);
-                    yylval->str = litbufdup(yyscanner);
-                    return XCONST;
-                }
 <xh><<EOF>>        { yyerror("unterminated hexadecimal string literal"); }

 {xnstart}        {
@@ -568,53 +553,71 @@ other            .
                     BEGIN(xus);
                     startlit();
                 }
-<xq,xe>{quotestop}    |
-<xq,xe>{quotefail} {
-                    yyless(1);
-                    BEGIN(INITIAL);
+
+<xb,xh,xq,xe,xus>{quote} {
                     /*
-                     * check that the data remains valid if it might have been
-                     * made invalid by unescaping any chars.
+                     * When we are scanning a quoted string and see an end
+                     * quote, we must look ahead for a possible continuation.
+                     * If we don't see one, we know the end quote was in fact
+                     * the end of the string.  To reduce the lexer table size,
+                     * we use a single "xqs" state to do the lookahead for all
+                     * types of strings.
                      */
-                    if (yyextra->saw_non_ascii)
-                        pg_verifymbstr(yyextra->literalbuf,
-                                       yyextra->literallen,
-                                       false);
-                    yylval->str = litbufdup(yyscanner);
-                    return SCONST;
-                }
-<xus>{quotestop} |
-<xus>{quotefail} {
-                    /* throw back all but the quote */
-                    yyless(1);
-                    /* xusend state looks for possible UESCAPE */
-                    BEGIN(xusend);
+                    yyextra->state_before_quote_stop = YYSTATE;
+                    BEGIN(xqs);
                 }
-<xusend>{whitespace} {
-                    /* stay in xusend state over whitespace */
+<xqs>{quotecontinue} {
+                    /*
+                     * Found a quote continuation, so return to the in-quote
+                     * state and continue scanning the literal.
+                     */
+                    BEGIN(yyextra->state_before_quote_stop);
                 }
-<xusend><<EOF>> |
-<xusend>{other} |
-<xusend>{xustop1} {
-                    /* no UESCAPE after the quote, throw back everything */
+<xqs>{quotecontinuefail} |
+<xqs>{other} |
+<xqs><<EOF>> {
+                    /*
+                     * Failed to see a quote continuation.  Throw back
+                     * everything after the end quote, and handle the string
+                     * according to the state we were in previously.
+                     */
                     yyless(0);
-                    BEGIN(INITIAL);
-                    yylval->str = litbuf_udeescape('\\', yyscanner);
-                    return SCONST;
-                }
-<xusend>{xustop2} {
-                    /* found UESCAPE after the end quote */
-                    BEGIN(INITIAL);
-                    if (!check_uescapechar(yytext[yyleng - 2]))
+
+                    switch (yyextra->state_before_quote_stop)
                     {
-                        SET_YYLLOC();
-                        ADVANCE_YYLLOC(yyleng - 2);
-                        yyerror("invalid Unicode escape character");
+                        case xb:
+                            BEGIN(INITIAL);
+                            yylval->str = litbufdup(yyscanner);
+                            return BCONST;
+                        case xh:
+                            BEGIN(INITIAL);
+                            yylval->str = litbufdup(yyscanner);
+                            return XCONST;
+                        case xe:
+                            /* fallthrough */
+                        case xq:
+                            BEGIN(INITIAL);
+
+                            /*
+                             * Check that the data remains valid if it
+                             * might have been made invalid by unescaping
+                             * any chars.
+                             */
+                            if (yyextra->saw_non_ascii)
+                                pg_verifymbstr(yyextra->literalbuf,
+                                               yyextra->literallen,
+                                               false);
+                            yylval->str = litbufdup(yyscanner);
+                            return SCONST;
+                        case xus:
+                            /* xuend state looks for possible UESCAPE */
+                            BEGIN(xuend);
+                            break;
+                        default:
+                            yyerror("unhandled previous state in xqs");
                     }
-                    yylval->str = litbuf_udeescape(yytext[yyleng - 2],
-                                                   yyscanner);
-                    return SCONST;
                 }
+
 <xq,xe,xus>{xqdouble} {
                     addlitchar('\'', yyscanner);
                 }
@@ -693,9 +696,6 @@ other            .
                     if (c == '\0' || IS_HIGHBIT_SET(c))
                         yyextra->saw_non_ascii = true;
                 }
-<xq,xe,xus>{quotecontinue} {
-                    /* ignore */
-                }
 <xe>.            {
                     /* This is only needed for \ just before EOF */
                     addlitchar(yytext[0], yyscanner);
@@ -770,53 +770,89 @@ other            .
                     return IDENT;
                 }
 <xui>{dquote} {
-                    yyless(1);
-                    /* xuiend state looks for possible UESCAPE */
-                    BEGIN(xuiend);
+                    /* xuend state looks for possible UESCAPE */
+                    yyextra->state_before_quote_stop = YYSTATE;
+                    BEGIN(xuend);
                 }
-<xuiend>{whitespace} {
-                    /* stay in xuiend state over whitespace */
+
+<xuend,xuchar>{whitespace} {
+                    /* stay in xuend or xuchar state over whitespace */
                 }
-<xuiend><<EOF>> |
-<xuiend>{other} |
-<xuiend>{xustop1} {
+<xuend>{uescapefail} |
+<xuend>{other} |
+<xuend><<EOF>> {
                     /* no UESCAPE after the quote, throw back everything */
-                    char       *ident;
-                    int            identlen;
-
                     yyless(0);

-                    BEGIN(INITIAL);
-                    if (yyextra->literallen == 0)
-                        yyerror("zero-length delimited identifier");
-                    ident = litbuf_udeescape('\\', yyscanner);
-                    identlen = strlen(ident);
-                    if (identlen >= NAMEDATALEN)
-                        truncate_identifier(ident, identlen, true);
-                    yylval->str = ident;
-                    return IDENT;
+                    if (yyextra->state_before_quote_stop == xus)
+                    {
+                        BEGIN(INITIAL);
+                        yylval->str = litbuf_udeescape('\\', yyscanner);
+                        return SCONST;
+                    }
+                    else if (yyextra->state_before_quote_stop == xui)
+                    {
+                        char       *ident;
+                        int            identlen;
+
+                        BEGIN(INITIAL);
+                        if (yyextra->literallen == 0)
+                            yyerror("zero-length delimited identifier");
+                        ident = litbuf_udeescape('\\', yyscanner);
+                        identlen = strlen(ident);
+                        if (identlen >= NAMEDATALEN)
+                            truncate_identifier(ident, identlen, true);
+                        yylval->str = ident;
+                        return IDENT;
+                    }
+                    else
+                        yyerror("unhandled previous state in xuend");
                 }
-<xuiend>{xustop2}    {
+<xuend>{uescape} {
                     /* found UESCAPE after the end quote */
-                    char       *ident;
-                    int            identlen;
-
-                    BEGIN(INITIAL);
-                    if (yyextra->literallen == 0)
-                        yyerror("zero-length delimited identifier");
+                    BEGIN(xuchar);
+                }
+<xuchar>{uescchar} {
+                    /* found escape character literal after UESCAPE */
                     if (!check_uescapechar(yytext[yyleng - 2]))
                     {
                         SET_YYLLOC();
                         ADVANCE_YYLLOC(yyleng - 2);
                         yyerror("invalid Unicode escape character");
                     }
-                    ident = litbuf_udeescape(yytext[yyleng - 2], yyscanner);
-                    identlen = strlen(ident);
-                    if (identlen >= NAMEDATALEN)
-                        truncate_identifier(ident, identlen, true);
-                    yylval->str = ident;
-                    return IDENT;
+
+                    if (yyextra->state_before_quote_stop == xus)
+                    {
+                        BEGIN(INITIAL);
+                        yylval->str = litbuf_udeescape(yytext[yyleng - 2],
+                                                       yyscanner);
+                        return SCONST;
+                    }
+                    else if (yyextra->state_before_quote_stop == xui)
+                    {
+                        char       *ident;
+                        int            identlen;
+
+                        BEGIN(INITIAL);
+                        if (yyextra->literallen == 0)
+                            yyerror("zero-length delimited identifier");
+                        ident = litbuf_udeescape(yytext[yyleng - 2], yyscanner);
+                        identlen = strlen(ident);
+                        if (identlen >= NAMEDATALEN)
+                            truncate_identifier(ident, identlen, true);
+                        yylval->str = ident;
+                        return IDENT;
+                    }
+                    else
+                        yyerror("unhandled previous state in xuchar");
+                }
+<xuchar>{uesccharfail} |
+<xuchar>{other} |
+<xuchar><<EOF>> {
+                    SET_YYLLOC();
+                    yyerror("missing or invalid Unicode escape character");
                 }
+
 <xd,xui>{xddouble}    {
                     addlitchar('"', yyscanner);
                 }
diff --git a/src/include/parser/scanner.h b/src/include/parser/scanner.h
index 731a2bd..72c2a28 100644
--- a/src/include/parser/scanner.h
+++ b/src/include/parser/scanner.h
@@ -99,6 +99,7 @@ typedef struct core_yy_extra_type
     int            literallen;        /* actual current string length */
     int            literalalloc;    /* current allocated buffer size */

+    int            state_before_quote_stop;    /* start cond. before end quote */
     int            xcdepth;        /* depth of nesting in slash-star comments */
     char       *dolqstart;        /* current $foo$ quote start string */


pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and KeyManagement Service (KMS)
Next
From: Joe Conway
Date:
Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and KeyManagement Service (KMS)