Speeding up the Postgres lexer - Mailing list pgsql-hackers

I was doing some profiles recently that showed that on simple statements
(like INSERT with a few column values) the basic parser (flex/bison) was
taking up a noticeable percentage of the total CPU time.  We already
have done some things to speed up the lexer, like use -CF option for
large-but-fast tables.  I read in the flex manual about how you can make
it faster yet if you can avoid the need for backup states.  It's not
terribly well documented, but I eventually worked out how to do it, and
find that it seems to make the lexer about a third faster.  Basically
what you have to do is provide alternate rules that allow the lexer to
match *some* rule on any prefix of a complete token.  For instance the
rule for {real} implies that given123.456E+
if the next character is not a digit, then the {real} rule fails, and
the lexer has to backtrack to123.456
which it can recognize as a {decimal}.  You can add a rule that does
match in this situation, throws back the two unwanted characters with
yyless(), and returns the number as the {decimal} rule would do.
The next call will carry on lexing the "E" as an identifier, etc.
(Note: this is going to end up being a syntax error because the SQL
grammar never allows a number to immediately precede an identifier;
but that's not the lexer's concern, and so I don't think that we should
make the extra rule throw an error immediately.)

What I'm wondering is whether this is really worth doing or not.
There are currently just two parts of the lexer rules that are affected
--- the {real} rule illustrated above, and the rules that allow quoted
strings to be split across lines as the SQL spec requires.  But the
patches are still pretty ugly, and what's really annoying is that there
doesn't seem to be any way to get flex to complain if someone later
makes a change that breaks the no-backup-cases property again.

A prototype patch (completely undocumented :-() is attached.  Any
thoughts about whether it's worth pressing forward with?
        regards, tom lane

*** src/backend/parser/scan.l.orig    Fri Mar 11 21:15:20 2005
--- src/backend/parser/scan.l    Mon May 23 01:05:54 2005
***************
*** 148,163 ****  * validate the contents.  */ xbstart            [bB]{quote}
! xbstop            {quote} xbinside        [^']* xbcat            {quote}{whitespace_with_newline}{quote}  /*
Hexadecimalnumber  */ xhstart            [xX]{quote}
 
! xhstop            {quote} xhinside        [^']* xhcat            {quote}{whitespace_with_newline}{quote}  /* National
character */
 
--- 148,165 ----  * validate the contents.  */ xbstart            [bB]{quote}
! xbstop            {quote}{whitespace}* xbinside        [^']* xbcat
{quote}{whitespace_with_newline}{quote}
+ xbconfused        {quote}{whitespace}*"-"  /* Hexadecimal number  */ xhstart            [xX]{quote}
! xhstop            {quote}{whitespace}* xhinside        [^']* xhcat
{quote}{whitespace_with_newline}{quote}
+ xhconfused        {quote}{whitespace}*"-"  /* National character  */
***************
*** 169,180 ****  */ quote            ' xqstart            {quote}
! xqstop            {quote} xqdouble        {quote}{quote} xqinside        [^\\']+ xqescape        [\\][^0-7] xqoctesc
     [\\][0-7]{1,3} xqcat            {quote}{whitespace_with_newline}{quote}  /* $foo$ style quotes ("dollar quoting")
*The quoted string starts with $foo$ where "foo" is an optional string
 
--- 171,183 ----  */ quote            ' xqstart            {quote}
! xqstop            {quote}{whitespace}* xqdouble        {quote}{quote} xqinside        [^\\']+ xqescape
[\\][^0-7]xqoctesc        [\\][0-7]{1,3} xqcat            {quote}{whitespace_with_newline}{quote}
 
+ xqconfused        {quote}{whitespace}*"-"  /* $foo$ style quotes ("dollar quoting")  * The quoted string starts with
$foo$where "foo" is an optional string
 
***************
*** 185,190 ****
--- 188,194 ---- dolq_start        [A-Za-z\200-\377_] dolq_cont        [A-Za-z\200-\377_0-9] dolqdelim
\$({dolq_start}{dolq_cont}*)?\$
+ dolqfailed        \${dolq_start}{dolq_cont}* dolqinside        [^$]+  /* Double quote
***************
*** 247,253 ****  integer            {digit}+ decimal            (({digit}*\.{digit}+)|({digit}+\.{digit}*))
! real            ((({digit}*\.{digit}+)|({digit}+\.{digit}*)|({digit}+))([Ee][-+]?{digit}+))  param
\${integer}
 
--- 251,259 ----  integer            {digit}+ decimal            (({digit}*\.{digit}+)|({digit}+\.{digit}*))
! real            ({integer}|{decimal})[Ee][-+]?{digit}+
! realfail1        ({integer}|{decimal})[Ee]
! realfail2        ({integer}|{decimal})[Ee][-+]  param            \${integer} 
***************
*** 310,315 ****
--- 316,325 ----                     /* ignore */                 } 
+ <xc>\*+            {
+                     /* ignore */
+                 }
+  <xc><<EOF>>        { yyerror("unterminated /* comment"); }  {xbstart}        {
***************
*** 329,334 ****
--- 339,350 ----                     yylval.str = litbufdup();                     return BCONST;                 }
+ <xb>{xbconfused} {
+                     yyless(yyleng-1);
+                     BEGIN(INITIAL);
+                     yylval.str = litbufdup();
+                     return BCONST;
+                 } <xh>{xhinside}    | <xb>{xbinside}    {                     addlit(yytext, yyleng);
***************
*** 356,361 ****
--- 372,383 ----                     yylval.str = litbufdup();                     return XCONST;                 }
+ <xh>{xhconfused} {
+                     yyless(yyleng-1);
+                     BEGIN(INITIAL);
+                     yylval.str = litbufdup();
+                     return XCONST;
+                 } <xh><<EOF>>        { yyerror("unterminated hexadecimal string literal"); }  {xnstart}        {
***************
*** 385,390 ****
--- 407,418 ----                     yylval.str = litbufdup();                     return SCONST;                 }
+ <xq>{xqconfused} {
+                     yyless(yyleng-1);
+                     BEGIN(INITIAL);
+                     yylval.str = litbufdup();
+                     return SCONST;
+                 } <xq>{xqdouble}  {                     addlitchar('\'');                 }
***************
*** 413,418 ****
--- 441,447 ----                     BEGIN(xdolq);                     startlit();                 }
+ {dolqfailed}    { yyerror("invalid token"); } <xdolq>{dolqdelim} {                     if (strcmp(yytext, dolqstart)
==0)                     {
 
***************
*** 435,440 ****
--- 464,472 ---- <xdolq>{dolqinside} {                     addlit(yytext, yyleng);                 }
+ <xdolq>{dolqfailed} {
+                     addlit(yytext, yyleng);
+                 } <xdolq>.        {                     /* This is only needed for $ inside the quoted text */
            addlitchar(yytext[0]);
 
***************
*** 576,582 ****                     yylval.str = pstrdup(yytext);                     return FCONST;
}
!   {identifier}    {                     const ScanKeyword *keyword;
--- 608,623 ----                     yylval.str = pstrdup(yytext);                     return FCONST;
}
! {realfail1}        {
!                     yyless(yyleng-1);
!                     yylval.str = pstrdup(yytext);
!                     return FCONST;
!                 }
! {realfail2}        {
!                     yyless(yyleng-2);
!                     yylval.str = pstrdup(yytext);
!                     return FCONST;
!                 }  {identifier}    {                     const ScanKeyword *keyword;


pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: inet increment w/ int8
Next
From: Bruce Momjian
Date:
Subject: Re: Speeding up the Postgres lexer