Re: Speeding up the Postgres lexer - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Speeding up the Postgres lexer
Date
Msg-id 200505231642.j4NGgA315728@candle.pha.pa.us
Whole thread Raw
In response to Speeding up the Postgres lexer  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
This seems fine.  I don't think the lexer changes enough for us to have
issues with new cases.  I think adding some comments to explain why we
are doing it is enough, and perhaps a test case that can be reproduced
later for testing.

---------------------------------------------------------------------------

Tom Lane wrote:
> I was doing some profiles recently that showed that on simple statements
> (like INSERT with a few column values) the basic parser (flex/bison) was
> taking up a noticeable percentage of the total CPU time.  We already
> have done some things to speed up the lexer, like use -CF option for
> large-but-fast tables.  I read in the flex manual about how you can make
> it faster yet if you can avoid the need for backup states.  It's not
> terribly well documented, but I eventually worked out how to do it, and
> find that it seems to make the lexer about a third faster.  Basically
> what you have to do is provide alternate rules that allow the lexer to
> match *some* rule on any prefix of a complete token.  For instance the
> rule for {real} implies that given
>     123.456E+
> if the next character is not a digit, then the {real} rule fails, and
> the lexer has to backtrack to
>     123.456
> which it can recognize as a {decimal}.  You can add a rule that does
> match in this situation, throws back the two unwanted characters with
> yyless(), and returns the number as the {decimal} rule would do.
> The next call will carry on lexing the "E" as an identifier, etc.
> (Note: this is going to end up being a syntax error because the SQL
> grammar never allows a number to immediately precede an identifier;
> but that's not the lexer's concern, and so I don't think that we should
> make the extra rule throw an error immediately.)
> 
> What I'm wondering is whether this is really worth doing or not.
> There are currently just two parts of the lexer rules that are affected
> --- the {real} rule illustrated above, and the rules that allow quoted
> strings to be split across lines as the SQL spec requires.  But the
> patches are still pretty ugly, and what's really annoying is that there
> doesn't seem to be any way to get flex to complain if someone later
> makes a change that breaks the no-backup-cases property again.
> 
> A prototype patch (completely undocumented :-() is attached.  Any
> thoughts about whether it's worth pressing forward with?
> 
>             regards, tom lane
> 
> *** src/backend/parser/scan.l.orig    Fri Mar 11 21:15:20 2005
> --- src/backend/parser/scan.l    Mon May 23 01:05:54 2005
> ***************
> *** 148,163 ****
>    * validate the contents.
>    */
>   xbstart            [bB]{quote}
> ! xbstop            {quote}
>   xbinside        [^']*
>   xbcat            {quote}{whitespace_with_newline}{quote}
>   
>   /* Hexadecimal number
>    */
>   xhstart            [xX]{quote}
> ! xhstop            {quote}
>   xhinside        [^']*
>   xhcat            {quote}{whitespace_with_newline}{quote}
>   
>   /* National character
>    */
> --- 148,165 ----
>    * validate the contents.
>    */
>   xbstart            [bB]{quote}
> ! xbstop            {quote}{whitespace}*
>   xbinside        [^']*
>   xbcat            {quote}{whitespace_with_newline}{quote}
> + xbconfused        {quote}{whitespace}*"-"
>   
>   /* Hexadecimal number
>    */
>   xhstart            [xX]{quote}
> ! xhstop            {quote}{whitespace}*
>   xhinside        [^']*
>   xhcat            {quote}{whitespace_with_newline}{quote}
> + xhconfused        {quote}{whitespace}*"-"
>   
>   /* National character
>    */
> ***************
> *** 169,180 ****
>    */
>   quote            '
>   xqstart            {quote}
> ! xqstop            {quote}
>   xqdouble        {quote}{quote}
>   xqinside        [^\\']+
>   xqescape        [\\][^0-7]
>   xqoctesc        [\\][0-7]{1,3}
>   xqcat            {quote}{whitespace_with_newline}{quote}
>   
>   /* $foo$ style quotes ("dollar quoting")
>    * The quoted string starts with $foo$ where "foo" is an optional string
> --- 171,183 ----
>    */
>   quote            '
>   xqstart            {quote}
> ! xqstop            {quote}{whitespace}*
>   xqdouble        {quote}{quote}
>   xqinside        [^\\']+
>   xqescape        [\\][^0-7]
>   xqoctesc        [\\][0-7]{1,3}
>   xqcat            {quote}{whitespace_with_newline}{quote}
> + xqconfused        {quote}{whitespace}*"-"
>   
>   /* $foo$ style quotes ("dollar quoting")
>    * The quoted string starts with $foo$ where "foo" is an optional string
> ***************
> *** 185,190 ****
> --- 188,194 ----
>   dolq_start        [A-Za-z\200-\377_]
>   dolq_cont        [A-Za-z\200-\377_0-9]
>   dolqdelim        \$({dolq_start}{dolq_cont}*)?\$
> + dolqfailed        \${dolq_start}{dolq_cont}*
>   dolqinside        [^$]+
>   
>   /* Double quote
> ***************
> *** 247,253 ****
>   
>   integer            {digit}+
>   decimal            (({digit}*\.{digit}+)|({digit}+\.{digit}*))
> ! real            ((({digit}*\.{digit}+)|({digit}+\.{digit}*)|({digit}+))([Ee][-+]?{digit}+))
>   
>   param            \${integer}
>   
> --- 251,259 ----
>   
>   integer            {digit}+
>   decimal            (({digit}*\.{digit}+)|({digit}+\.{digit}*))
> ! real            ({integer}|{decimal})[Ee][-+]?{digit}+
> ! realfail1        ({integer}|{decimal})[Ee]
> ! realfail2        ({integer}|{decimal})[Ee][-+]
>   
>   param            \${integer}
>   
> ***************
> *** 310,315 ****
> --- 316,325 ----
>                       /* ignore */
>                   }
>   
> + <xc>\*+            {
> +                     /* ignore */
> +                 }
> + 
>   <xc><<EOF>>        { yyerror("unterminated /* comment"); }
>   
>   {xbstart}        {
> ***************
> *** 329,334 ****
> --- 339,350 ----
>                       yylval.str = litbufdup();
>                       return BCONST;
>                   }
> + <xb>{xbconfused} {
> +                     yyless(yyleng-1);
> +                     BEGIN(INITIAL);
> +                     yylval.str = litbufdup();
> +                     return BCONST;
> +                 }
>   <xh>{xhinside}    |
>   <xb>{xbinside}    {
>                       addlit(yytext, yyleng);
> ***************
> *** 356,361 ****
> --- 372,383 ----
>                       yylval.str = litbufdup();
>                       return XCONST;
>                   }
> + <xh>{xhconfused} {
> +                     yyless(yyleng-1);
> +                     BEGIN(INITIAL);
> +                     yylval.str = litbufdup();
> +                     return XCONST;
> +                 }
>   <xh><<EOF>>        { yyerror("unterminated hexadecimal string literal"); }
>   
>   {xnstart}        {
> ***************
> *** 385,390 ****
> --- 407,418 ----
>                       yylval.str = litbufdup();
>                       return SCONST;
>                   }
> + <xq>{xqconfused} {
> +                     yyless(yyleng-1);
> +                     BEGIN(INITIAL);
> +                     yylval.str = litbufdup();
> +                     return SCONST;
> +                 }
>   <xq>{xqdouble}  {
>                       addlitchar('\'');
>                   }
> ***************
> *** 413,418 ****
> --- 441,447 ----
>                       BEGIN(xdolq);
>                       startlit();
>                   }
> + {dolqfailed}    { yyerror("invalid token"); }
>   <xdolq>{dolqdelim} {
>                       if (strcmp(yytext, dolqstart) == 0)
>                       {
> ***************
> *** 435,440 ****
> --- 464,472 ----
>   <xdolq>{dolqinside} {
>                       addlit(yytext, yyleng);
>                   }
> + <xdolq>{dolqfailed} {
> +                     addlit(yytext, yyleng);
> +                 }
>   <xdolq>.        {
>                       /* This is only needed for $ inside the quoted text */
>                       addlitchar(yytext[0]);
> ***************
> *** 576,582 ****
>                       yylval.str = pstrdup(yytext);
>                       return FCONST;
>                   }
> ! 
>   
>   {identifier}    {
>                       const ScanKeyword *keyword;
> --- 608,623 ----
>                       yylval.str = pstrdup(yytext);
>                       return FCONST;
>                   }
> ! {realfail1}        {
> !                     yyless(yyleng-1);
> !                     yylval.str = pstrdup(yytext);
> !                     return FCONST;
> !                 }
> ! {realfail2}        {
> !                     yyless(yyleng-2);
> !                     yylval.str = pstrdup(yytext);
> !                     return FCONST;
> !                 }
>   
>   {identifier}    {
>                       const ScanKeyword *keyword;
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
>                http://archives.postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Speeding up the Postgres lexer
Next
From: Stef
Date:
Subject: Obtaining Firing Statement clause in (pl/perlu) Trigger Function