Thread: Speeding up the Postgres lexer

Speeding up the Postgres lexer

From
Tom Lane
Date:
I was doing some profiles recently that showed that on simple statements
(like INSERT with a few column values) the basic parser (flex/bison) was
taking up a noticeable percentage of the total CPU time.  We already
have done some things to speed up the lexer, like use -CF option for
large-but-fast tables.  I read in the flex manual about how you can make
it faster yet if you can avoid the need for backup states.  It's not
terribly well documented, but I eventually worked out how to do it, and
find that it seems to make the lexer about a third faster.  Basically
what you have to do is provide alternate rules that allow the lexer to
match *some* rule on any prefix of a complete token.  For instance the
rule for {real} implies that given123.456E+
if the next character is not a digit, then the {real} rule fails, and
the lexer has to backtrack to123.456
which it can recognize as a {decimal}.  You can add a rule that does
match in this situation, throws back the two unwanted characters with
yyless(), and returns the number as the {decimal} rule would do.
The next call will carry on lexing the "E" as an identifier, etc.
(Note: this is going to end up being a syntax error because the SQL
grammar never allows a number to immediately precede an identifier;
but that's not the lexer's concern, and so I don't think that we should
make the extra rule throw an error immediately.)

What I'm wondering is whether this is really worth doing or not.
There are currently just two parts of the lexer rules that are affected
--- the {real} rule illustrated above, and the rules that allow quoted
strings to be split across lines as the SQL spec requires.  But the
patches are still pretty ugly, and what's really annoying is that there
doesn't seem to be any way to get flex to complain if someone later
makes a change that breaks the no-backup-cases property again.

A prototype patch (completely undocumented :-() is attached.  Any
thoughts about whether it's worth pressing forward with?
        regards, tom lane

*** src/backend/parser/scan.l.orig    Fri Mar 11 21:15:20 2005
--- src/backend/parser/scan.l    Mon May 23 01:05:54 2005
***************
*** 148,163 ****  * validate the contents.  */ xbstart            [bB]{quote}
! xbstop            {quote} xbinside        [^']* xbcat            {quote}{whitespace_with_newline}{quote}  /*
Hexadecimalnumber  */ xhstart            [xX]{quote}
 
! xhstop            {quote} xhinside        [^']* xhcat            {quote}{whitespace_with_newline}{quote}  /* National
character */
 
--- 148,165 ----  * validate the contents.  */ xbstart            [bB]{quote}
! xbstop            {quote}{whitespace}* xbinside        [^']* xbcat
{quote}{whitespace_with_newline}{quote}
+ xbconfused        {quote}{whitespace}*"-"  /* Hexadecimal number  */ xhstart            [xX]{quote}
! xhstop            {quote}{whitespace}* xhinside        [^']* xhcat
{quote}{whitespace_with_newline}{quote}
+ xhconfused        {quote}{whitespace}*"-"  /* National character  */
***************
*** 169,180 ****  */ quote            ' xqstart            {quote}
! xqstop            {quote} xqdouble        {quote}{quote} xqinside        [^\\']+ xqescape        [\\][^0-7] xqoctesc
     [\\][0-7]{1,3} xqcat            {quote}{whitespace_with_newline}{quote}  /* $foo$ style quotes ("dollar quoting")
*The quoted string starts with $foo$ where "foo" is an optional string
 
--- 171,183 ----  */ quote            ' xqstart            {quote}
! xqstop            {quote}{whitespace}* xqdouble        {quote}{quote} xqinside        [^\\']+ xqescape
[\\][^0-7]xqoctesc        [\\][0-7]{1,3} xqcat            {quote}{whitespace_with_newline}{quote}
 
+ xqconfused        {quote}{whitespace}*"-"  /* $foo$ style quotes ("dollar quoting")  * The quoted string starts with
$foo$where "foo" is an optional string
 
***************
*** 185,190 ****
--- 188,194 ---- dolq_start        [A-Za-z\200-\377_] dolq_cont        [A-Za-z\200-\377_0-9] dolqdelim
\$({dolq_start}{dolq_cont}*)?\$
+ dolqfailed        \${dolq_start}{dolq_cont}* dolqinside        [^$]+  /* Double quote
***************
*** 247,253 ****  integer            {digit}+ decimal            (({digit}*\.{digit}+)|({digit}+\.{digit}*))
! real            ((({digit}*\.{digit}+)|({digit}+\.{digit}*)|({digit}+))([Ee][-+]?{digit}+))  param
\${integer}
 
--- 251,259 ----  integer            {digit}+ decimal            (({digit}*\.{digit}+)|({digit}+\.{digit}*))
! real            ({integer}|{decimal})[Ee][-+]?{digit}+
! realfail1        ({integer}|{decimal})[Ee]
! realfail2        ({integer}|{decimal})[Ee][-+]  param            \${integer} 
***************
*** 310,315 ****
--- 316,325 ----                     /* ignore */                 } 
+ <xc>\*+            {
+                     /* ignore */
+                 }
+  <xc><<EOF>>        { yyerror("unterminated /* comment"); }  {xbstart}        {
***************
*** 329,334 ****
--- 339,350 ----                     yylval.str = litbufdup();                     return BCONST;                 }
+ <xb>{xbconfused} {
+                     yyless(yyleng-1);
+                     BEGIN(INITIAL);
+                     yylval.str = litbufdup();
+                     return BCONST;
+                 } <xh>{xhinside}    | <xb>{xbinside}    {                     addlit(yytext, yyleng);
***************
*** 356,361 ****
--- 372,383 ----                     yylval.str = litbufdup();                     return XCONST;                 }
+ <xh>{xhconfused} {
+                     yyless(yyleng-1);
+                     BEGIN(INITIAL);
+                     yylval.str = litbufdup();
+                     return XCONST;
+                 } <xh><<EOF>>        { yyerror("unterminated hexadecimal string literal"); }  {xnstart}        {
***************
*** 385,390 ****
--- 407,418 ----                     yylval.str = litbufdup();                     return SCONST;                 }
+ <xq>{xqconfused} {
+                     yyless(yyleng-1);
+                     BEGIN(INITIAL);
+                     yylval.str = litbufdup();
+                     return SCONST;
+                 } <xq>{xqdouble}  {                     addlitchar('\'');                 }
***************
*** 413,418 ****
--- 441,447 ----                     BEGIN(xdolq);                     startlit();                 }
+ {dolqfailed}    { yyerror("invalid token"); } <xdolq>{dolqdelim} {                     if (strcmp(yytext, dolqstart)
==0)                     {
 
***************
*** 435,440 ****
--- 464,472 ---- <xdolq>{dolqinside} {                     addlit(yytext, yyleng);                 }
+ <xdolq>{dolqfailed} {
+                     addlit(yytext, yyleng);
+                 } <xdolq>.        {                     /* This is only needed for $ inside the quoted text */
            addlitchar(yytext[0]);
 
***************
*** 576,582 ****                     yylval.str = pstrdup(yytext);                     return FCONST;
}
!   {identifier}    {                     const ScanKeyword *keyword;
--- 608,623 ----                     yylval.str = pstrdup(yytext);                     return FCONST;
}
! {realfail1}        {
!                     yyless(yyleng-1);
!                     yylval.str = pstrdup(yytext);
!                     return FCONST;
!                 }
! {realfail2}        {
!                     yyless(yyleng-2);
!                     yylval.str = pstrdup(yytext);
!                     return FCONST;
!                 }  {identifier}    {                     const ScanKeyword *keyword;


Re: Speeding up the Postgres lexer

From
Bruce Momjian
Date:
This seems fine.  I don't think the lexer changes enough for us to have
issues with new cases.  I think adding some comments to explain why we
are doing it is enough, and perhaps a test case that can be reproduced
later for testing.

---------------------------------------------------------------------------

Tom Lane wrote:
> I was doing some profiles recently that showed that on simple statements
> (like INSERT with a few column values) the basic parser (flex/bison) was
> taking up a noticeable percentage of the total CPU time.  We already
> have done some things to speed up the lexer, like use -CF option for
> large-but-fast tables.  I read in the flex manual about how you can make
> it faster yet if you can avoid the need for backup states.  It's not
> terribly well documented, but I eventually worked out how to do it, and
> find that it seems to make the lexer about a third faster.  Basically
> what you have to do is provide alternate rules that allow the lexer to
> match *some* rule on any prefix of a complete token.  For instance the
> rule for {real} implies that given
>     123.456E+
> if the next character is not a digit, then the {real} rule fails, and
> the lexer has to backtrack to
>     123.456
> which it can recognize as a {decimal}.  You can add a rule that does
> match in this situation, throws back the two unwanted characters with
> yyless(), and returns the number as the {decimal} rule would do.
> The next call will carry on lexing the "E" as an identifier, etc.
> (Note: this is going to end up being a syntax error because the SQL
> grammar never allows a number to immediately precede an identifier;
> but that's not the lexer's concern, and so I don't think that we should
> make the extra rule throw an error immediately.)
> 
> What I'm wondering is whether this is really worth doing or not.
> There are currently just two parts of the lexer rules that are affected
> --- the {real} rule illustrated above, and the rules that allow quoted
> strings to be split across lines as the SQL spec requires.  But the
> patches are still pretty ugly, and what's really annoying is that there
> doesn't seem to be any way to get flex to complain if someone later
> makes a change that breaks the no-backup-cases property again.
> 
> A prototype patch (completely undocumented :-() is attached.  Any
> thoughts about whether it's worth pressing forward with?
> 
>             regards, tom lane
> 
> *** src/backend/parser/scan.l.orig    Fri Mar 11 21:15:20 2005
> --- src/backend/parser/scan.l    Mon May 23 01:05:54 2005
> ***************
> *** 148,163 ****
>    * validate the contents.
>    */
>   xbstart            [bB]{quote}
> ! xbstop            {quote}
>   xbinside        [^']*
>   xbcat            {quote}{whitespace_with_newline}{quote}
>   
>   /* Hexadecimal number
>    */
>   xhstart            [xX]{quote}
> ! xhstop            {quote}
>   xhinside        [^']*
>   xhcat            {quote}{whitespace_with_newline}{quote}
>   
>   /* National character
>    */
> --- 148,165 ----
>    * validate the contents.
>    */
>   xbstart            [bB]{quote}
> ! xbstop            {quote}{whitespace}*
>   xbinside        [^']*
>   xbcat            {quote}{whitespace_with_newline}{quote}
> + xbconfused        {quote}{whitespace}*"-"
>   
>   /* Hexadecimal number
>    */
>   xhstart            [xX]{quote}
> ! xhstop            {quote}{whitespace}*
>   xhinside        [^']*
>   xhcat            {quote}{whitespace_with_newline}{quote}
> + xhconfused        {quote}{whitespace}*"-"
>   
>   /* National character
>    */
> ***************
> *** 169,180 ****
>    */
>   quote            '
>   xqstart            {quote}
> ! xqstop            {quote}
>   xqdouble        {quote}{quote}
>   xqinside        [^\\']+
>   xqescape        [\\][^0-7]
>   xqoctesc        [\\][0-7]{1,3}
>   xqcat            {quote}{whitespace_with_newline}{quote}
>   
>   /* $foo$ style quotes ("dollar quoting")
>    * The quoted string starts with $foo$ where "foo" is an optional string
> --- 171,183 ----
>    */
>   quote            '
>   xqstart            {quote}
> ! xqstop            {quote}{whitespace}*
>   xqdouble        {quote}{quote}
>   xqinside        [^\\']+
>   xqescape        [\\][^0-7]
>   xqoctesc        [\\][0-7]{1,3}
>   xqcat            {quote}{whitespace_with_newline}{quote}
> + xqconfused        {quote}{whitespace}*"-"
>   
>   /* $foo$ style quotes ("dollar quoting")
>    * The quoted string starts with $foo$ where "foo" is an optional string
> ***************
> *** 185,190 ****
> --- 188,194 ----
>   dolq_start        [A-Za-z\200-\377_]
>   dolq_cont        [A-Za-z\200-\377_0-9]
>   dolqdelim        \$({dolq_start}{dolq_cont}*)?\$
> + dolqfailed        \${dolq_start}{dolq_cont}*
>   dolqinside        [^$]+
>   
>   /* Double quote
> ***************
> *** 247,253 ****
>   
>   integer            {digit}+
>   decimal            (({digit}*\.{digit}+)|({digit}+\.{digit}*))
> ! real            ((({digit}*\.{digit}+)|({digit}+\.{digit}*)|({digit}+))([Ee][-+]?{digit}+))
>   
>   param            \${integer}
>   
> --- 251,259 ----
>   
>   integer            {digit}+
>   decimal            (({digit}*\.{digit}+)|({digit}+\.{digit}*))
> ! real            ({integer}|{decimal})[Ee][-+]?{digit}+
> ! realfail1        ({integer}|{decimal})[Ee]
> ! realfail2        ({integer}|{decimal})[Ee][-+]
>   
>   param            \${integer}
>   
> ***************
> *** 310,315 ****
> --- 316,325 ----
>                       /* ignore */
>                   }
>   
> + <xc>\*+            {
> +                     /* ignore */
> +                 }
> + 
>   <xc><<EOF>>        { yyerror("unterminated /* comment"); }
>   
>   {xbstart}        {
> ***************
> *** 329,334 ****
> --- 339,350 ----
>                       yylval.str = litbufdup();
>                       return BCONST;
>                   }
> + <xb>{xbconfused} {
> +                     yyless(yyleng-1);
> +                     BEGIN(INITIAL);
> +                     yylval.str = litbufdup();
> +                     return BCONST;
> +                 }
>   <xh>{xhinside}    |
>   <xb>{xbinside}    {
>                       addlit(yytext, yyleng);
> ***************
> *** 356,361 ****
> --- 372,383 ----
>                       yylval.str = litbufdup();
>                       return XCONST;
>                   }
> + <xh>{xhconfused} {
> +                     yyless(yyleng-1);
> +                     BEGIN(INITIAL);
> +                     yylval.str = litbufdup();
> +                     return XCONST;
> +                 }
>   <xh><<EOF>>        { yyerror("unterminated hexadecimal string literal"); }
>   
>   {xnstart}        {
> ***************
> *** 385,390 ****
> --- 407,418 ----
>                       yylval.str = litbufdup();
>                       return SCONST;
>                   }
> + <xq>{xqconfused} {
> +                     yyless(yyleng-1);
> +                     BEGIN(INITIAL);
> +                     yylval.str = litbufdup();
> +                     return SCONST;
> +                 }
>   <xq>{xqdouble}  {
>                       addlitchar('\'');
>                   }
> ***************
> *** 413,418 ****
> --- 441,447 ----
>                       BEGIN(xdolq);
>                       startlit();
>                   }
> + {dolqfailed}    { yyerror("invalid token"); }
>   <xdolq>{dolqdelim} {
>                       if (strcmp(yytext, dolqstart) == 0)
>                       {
> ***************
> *** 435,440 ****
> --- 464,472 ----
>   <xdolq>{dolqinside} {
>                       addlit(yytext, yyleng);
>                   }
> + <xdolq>{dolqfailed} {
> +                     addlit(yytext, yyleng);
> +                 }
>   <xdolq>.        {
>                       /* This is only needed for $ inside the quoted text */
>                       addlitchar(yytext[0]);
> ***************
> *** 576,582 ****
>                       yylval.str = pstrdup(yytext);
>                       return FCONST;
>                   }
> ! 
>   
>   {identifier}    {
>                       const ScanKeyword *keyword;
> --- 608,623 ----
>                       yylval.str = pstrdup(yytext);
>                       return FCONST;
>                   }
> ! {realfail1}        {
> !                     yyless(yyleng-1);
> !                     yylval.str = pstrdup(yytext);
> !                     return FCONST;
> !                 }
> ! {realfail2}        {
> !                     yyless(yyleng-2);
> !                     yylval.str = pstrdup(yytext);
> !                     return FCONST;
> !                 }
>   
>   {identifier}    {
>                       const ScanKeyword *keyword;
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
>                http://archives.postgresql.org
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Speeding up the Postgres lexer

From
Simon Riggs
Date:
On Mon, 2005-05-23 at 12:31 -0400, Tom Lane wrote:
> doesn't seem to be any way to get flex to complain if someone later
> makes a change that breaks the no-backup-cases property again.

After some digging, there is a  -b option will generate a file called
lex.backup if any backup-states exist. The file is said to contain
information that would help you remove backup-states. 

It seems straightforward to test for the existence of that file in the
build process? Or perhaps add a special test state --enable-flextest to
perform the test during the build.

Best Regards, Simon Riggs



Re: Speeding up the Postgres lexer

From
Andrew Dunstan
Date:

Tom Lane wrote:

>[snip - flex is slowed down by backtracking  - how to fix ]
>
>What I'm wondering is whether this is really worth doing or not.
>There are currently just two parts of the lexer rules that are affected
>--- the {real} rule illustrated above, and the rules that allow quoted
>strings to be split across lines as the SQL spec requires.  But the
>patches are still pretty ugly, and what's really annoying is that there
>doesn't seem to be any way to get flex to complain if someone later
>makes a change that breaks the no-backup-cases property again.
>
>  
>


I would be more concerned if there were not reasonable alternatives to 
many bulk parsed inserts (COPY, prepared statement).

But I do think it's worth it, even so ... not all client interfaces 
support prepared statements (notoriously PHP, although I understand KL 
has sent patches to fix that) and not all inserts are suitable for COPY.

cheers

andrew


Re: Speeding up the Postgres lexer

From
Christopher Kings-Lynne
Date:
> What I'm wondering is whether this is really worth doing or not.
> There are currently just two parts of the lexer rules that are affected
> --- the {real} rule illustrated above, and the rules that allow quoted
> strings to be split across lines as the SQL spec requires.  But the
> patches are still pretty ugly, and what's really annoying is that there
> doesn't seem to be any way to get flex to complain if someone later
> makes a change that breaks the no-backup-cases property again.

I was just thinking that if there's not a switch, it's prone to error again.

However, the lexer isn't touched anywhere near as much as the grammar is  right?  So just put a large
comment/warning/reminderat the top to 
 
test for non-backup states.

I'm definitely in favour of a 1/3 speedup of the lexer.

Chris


Re: Speeding up the Postgres lexer

From
Christopher Kings-Lynne
Date:
> But I do think it's worth it, even so ... not all client interfaces 
> support prepared statements (notoriously PHP, although I understand KL 
> has sent patches to fix that) and not all inserts are suitable for COPY.

There is now pg_prepare/pg_execute/pg_query_params in PHP, however you 
could always have just used straight SQL PREPARE and EXECUTE commands.

Chris


Re: Speeding up the Postgres lexer

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Mon, 2005-05-23 at 12:31 -0400, Tom Lane wrote:
>> doesn't seem to be any way to get flex to complain if someone later
>> makes a change that breaks the no-backup-cases property again.

> After some digging, there is a  -b option will generate a file called
> lex.backup if any backup-states exist. The file is said to contain
> information that would help you remove backup-states. 

Right, reading that file is how I learned to fix it ...

> It seems straightforward to test for the existence of that file in the
> build process? Or perhaps add a special test state --enable-flextest to
> perform the test during the build.

Well, the problem is that the "success" state isn't that the file
doesn't exist or is empty or anything so easy as that.  What you
want is for it to say
No backing up.

and it seems a tad too ugly to code such a test into the makefiles.
I'd not want to bet that the success message is spelled the same way
in every flex version, or is locale-independent.

Based on the comments so far in this thread, I'll go ahead and commit
the patch, with some comments attached of course --- in particular a big
head comment to run flex with -b and see that lex.backup says something
to this effect.
        regards, tom lane


Re: Speeding up the Postgres lexer

From
Christopher Kings-Lynne
Date:
> Based on the comments so far in this thread, I'll go ahead and commit
> the patch, with some comments attached of course --- in particular a big
> head comment to run flex with -b and see that lex.backup says something
> to this effect.

Add it to the release check-list.

Chris


Re: Speeding up the Postgres lexer

From
Tom Lane
Date:
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:
> However, the lexer isn't touched anywhere near as much as the grammar is 
>   right?

Yeah --- if you look at the CVS history, changes that affect the flex
rules (and not just the text of the C-code actions) are really rare
these days.  If there were a lot of churn I'd think this proposal was
completely impractical, but there's not much anymore.
        regards, tom lane


Re: Speeding up the Postgres lexer

From
Simon Riggs
Date:
On Mon, 2005-05-23 at 22:43 -0400, Tom Lane wrote:
> Based on the comments so far in this thread, I'll go ahead and commit
> the patch, with some comments attached of course --- in particular a big
> head comment to run flex with -b and see that lex.backup says something
> to this effect.

flex counts the number of times it needs to back up in a variable called
num_backing_up. It tests this to see whether to generate code
differently. ISTM you could have a -x switch to make it generate an
error if num_backing_up > 0.

Would we use the -x switch if we had it? If so I'll submit a patch here
first, then on to GNU. It'll take a while before it comes back round to
us though.

Best Regards, Simon Riggs



Re: Speeding up the Postgres lexer

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Would we use the -x switch if we had it?

Dunno.  Depending on such a thing would require depending on a new flex
version, and seeing that the flex guys haven't put out a new release
since the badly broken 2.5.31 more than 2 years ago, I wouldn't hold
my breath waiting for one we can use.
        regards, tom lane


Re: Speeding up the Postgres lexer

From
Neil Conway
Date:
Tom Lane wrote:
> Dunno.  Depending on such a thing would require depending on a new flex
> version, and seeing that the flex guys haven't put out a new release
> since the badly broken 2.5.31 more than 2 years ago, I wouldn't hold
> my breath waiting for one we can use.

It should be easy enough to check for the presence of the flag at 
configure-time and only use it if flex implements it. Prior to reading 
Simon's message I was actually thinking of hacking up the same thing :), 
so I agree it is probably worth doing.

-Neil


Re: Speeding up the Postgres lexer

From
Simon Riggs
Date:
On Wed, 2005-05-25 at 11:28 +1000, Neil Conway wrote:
> Tom Lane wrote:
> > Dunno.  Depending on such a thing would require depending on a new flex
> > version, and seeing that the flex guys haven't put out a new release
> > since the badly broken 2.5.31 more than 2 years ago, I wouldn't hold
> > my breath waiting for one we can use.
> 
> It should be easy enough to check for the presence of the flag at 
> configure-time and only use it if flex implements it. Prior to reading 
> Simon's message I was actually thinking of hacking up the same thing :), 
> so I agree it is probably worth doing.

Neil,

I'll be doing this after the beta freeze is announced. If you wanted to
do this before then, I wouldn't stop you.

Best Regards, Simon Riggs