Thread: Speeding up the Postgres lexer
I was doing some profiles recently that showed that on simple statements (like INSERT with a few column values) the basic parser (flex/bison) was taking up a noticeable percentage of the total CPU time. We already have done some things to speed up the lexer, like use -CF option for large-but-fast tables. I read in the flex manual about how you can make it faster yet if you can avoid the need for backup states. It's not terribly well documented, but I eventually worked out how to do it, and find that it seems to make the lexer about a third faster. Basically what you have to do is provide alternate rules that allow the lexer to match *some* rule on any prefix of a complete token. For instance the rule for {real} implies that given123.456E+ if the next character is not a digit, then the {real} rule fails, and the lexer has to backtrack to123.456 which it can recognize as a {decimal}. You can add a rule that does match in this situation, throws back the two unwanted characters with yyless(), and returns the number as the {decimal} rule would do. The next call will carry on lexing the "E" as an identifier, etc. (Note: this is going to end up being a syntax error because the SQL grammar never allows a number to immediately precede an identifier; but that's not the lexer's concern, and so I don't think that we should make the extra rule throw an error immediately.) What I'm wondering is whether this is really worth doing or not. There are currently just two parts of the lexer rules that are affected --- the {real} rule illustrated above, and the rules that allow quoted strings to be split across lines as the SQL spec requires. But the patches are still pretty ugly, and what's really annoying is that there doesn't seem to be any way to get flex to complain if someone later makes a change that breaks the no-backup-cases property again. A prototype patch (completely undocumented :-() is attached. Any thoughts about whether it's worth pressing forward with? regards, tom lane *** src/backend/parser/scan.l.orig Fri Mar 11 21:15:20 2005 --- src/backend/parser/scan.l Mon May 23 01:05:54 2005 *************** *** 148,163 **** * validate the contents. */ xbstart [bB]{quote} ! xbstop {quote} xbinside [^']* xbcat {quote}{whitespace_with_newline}{quote} /* Hexadecimalnumber */ xhstart [xX]{quote} ! xhstop {quote} xhinside [^']* xhcat {quote}{whitespace_with_newline}{quote} /* National character */ --- 148,165 ---- * validate the contents. */ xbstart [bB]{quote} ! xbstop {quote}{whitespace}* xbinside [^']* xbcat {quote}{whitespace_with_newline}{quote} + xbconfused {quote}{whitespace}*"-" /* Hexadecimal number */ xhstart [xX]{quote} ! xhstop {quote}{whitespace}* xhinside [^']* xhcat {quote}{whitespace_with_newline}{quote} + xhconfused {quote}{whitespace}*"-" /* National character */ *************** *** 169,180 **** */ quote ' xqstart {quote} ! xqstop {quote} xqdouble {quote}{quote} xqinside [^\\']+ xqescape [\\][^0-7] xqoctesc [\\][0-7]{1,3} xqcat {quote}{whitespace_with_newline}{quote} /* $foo$ style quotes ("dollar quoting") *The quoted string starts with $foo$ where "foo" is an optional string --- 171,183 ---- */ quote ' xqstart {quote} ! xqstop {quote}{whitespace}* xqdouble {quote}{quote} xqinside [^\\']+ xqescape [\\][^0-7]xqoctesc [\\][0-7]{1,3} xqcat {quote}{whitespace_with_newline}{quote} + xqconfused {quote}{whitespace}*"-" /* $foo$ style quotes ("dollar quoting") * The quoted string starts with $foo$where "foo" is an optional string *************** *** 185,190 **** --- 188,194 ---- dolq_start [A-Za-z\200-\377_] dolq_cont [A-Za-z\200-\377_0-9] dolqdelim \$({dolq_start}{dolq_cont}*)?\$ + dolqfailed \${dolq_start}{dolq_cont}* dolqinside [^$]+ /* Double quote *************** *** 247,253 **** integer {digit}+ decimal (({digit}*\.{digit}+)|({digit}+\.{digit}*)) ! real ((({digit}*\.{digit}+)|({digit}+\.{digit}*)|({digit}+))([Ee][-+]?{digit}+)) param \${integer} --- 251,259 ---- integer {digit}+ decimal (({digit}*\.{digit}+)|({digit}+\.{digit}*)) ! real ({integer}|{decimal})[Ee][-+]?{digit}+ ! realfail1 ({integer}|{decimal})[Ee] ! realfail2 ({integer}|{decimal})[Ee][-+] param \${integer} *************** *** 310,315 **** --- 316,325 ---- /* ignore */ } + <xc>\*+ { + /* ignore */ + } + <xc><<EOF>> { yyerror("unterminated /* comment"); } {xbstart} { *************** *** 329,334 **** --- 339,350 ---- yylval.str = litbufdup(); return BCONST; } + <xb>{xbconfused} { + yyless(yyleng-1); + BEGIN(INITIAL); + yylval.str = litbufdup(); + return BCONST; + } <xh>{xhinside} | <xb>{xbinside} { addlit(yytext, yyleng); *************** *** 356,361 **** --- 372,383 ---- yylval.str = litbufdup(); return XCONST; } + <xh>{xhconfused} { + yyless(yyleng-1); + BEGIN(INITIAL); + yylval.str = litbufdup(); + return XCONST; + } <xh><<EOF>> { yyerror("unterminated hexadecimal string literal"); } {xnstart} { *************** *** 385,390 **** --- 407,418 ---- yylval.str = litbufdup(); return SCONST; } + <xq>{xqconfused} { + yyless(yyleng-1); + BEGIN(INITIAL); + yylval.str = litbufdup(); + return SCONST; + } <xq>{xqdouble} { addlitchar('\''); } *************** *** 413,418 **** --- 441,447 ---- BEGIN(xdolq); startlit(); } + {dolqfailed} { yyerror("invalid token"); } <xdolq>{dolqdelim} { if (strcmp(yytext, dolqstart) ==0) { *************** *** 435,440 **** --- 464,472 ---- <xdolq>{dolqinside} { addlit(yytext, yyleng); } + <xdolq>{dolqfailed} { + addlit(yytext, yyleng); + } <xdolq>. { /* This is only needed for $ inside the quoted text */ addlitchar(yytext[0]); *************** *** 576,582 **** yylval.str = pstrdup(yytext); return FCONST; } ! {identifier} { const ScanKeyword *keyword; --- 608,623 ---- yylval.str = pstrdup(yytext); return FCONST; } ! {realfail1} { ! yyless(yyleng-1); ! yylval.str = pstrdup(yytext); ! return FCONST; ! } ! {realfail2} { ! yyless(yyleng-2); ! yylval.str = pstrdup(yytext); ! return FCONST; ! } {identifier} { const ScanKeyword *keyword;
This seems fine. I don't think the lexer changes enough for us to have issues with new cases. I think adding some comments to explain why we are doing it is enough, and perhaps a test case that can be reproduced later for testing. --------------------------------------------------------------------------- Tom Lane wrote: > I was doing some profiles recently that showed that on simple statements > (like INSERT with a few column values) the basic parser (flex/bison) was > taking up a noticeable percentage of the total CPU time. We already > have done some things to speed up the lexer, like use -CF option for > large-but-fast tables. I read in the flex manual about how you can make > it faster yet if you can avoid the need for backup states. It's not > terribly well documented, but I eventually worked out how to do it, and > find that it seems to make the lexer about a third faster. Basically > what you have to do is provide alternate rules that allow the lexer to > match *some* rule on any prefix of a complete token. For instance the > rule for {real} implies that given > 123.456E+ > if the next character is not a digit, then the {real} rule fails, and > the lexer has to backtrack to > 123.456 > which it can recognize as a {decimal}. You can add a rule that does > match in this situation, throws back the two unwanted characters with > yyless(), and returns the number as the {decimal} rule would do. > The next call will carry on lexing the "E" as an identifier, etc. > (Note: this is going to end up being a syntax error because the SQL > grammar never allows a number to immediately precede an identifier; > but that's not the lexer's concern, and so I don't think that we should > make the extra rule throw an error immediately.) > > What I'm wondering is whether this is really worth doing or not. > There are currently just two parts of the lexer rules that are affected > --- the {real} rule illustrated above, and the rules that allow quoted > strings to be split across lines as the SQL spec requires. But the > patches are still pretty ugly, and what's really annoying is that there > doesn't seem to be any way to get flex to complain if someone later > makes a change that breaks the no-backup-cases property again. > > A prototype patch (completely undocumented :-() is attached. Any > thoughts about whether it's worth pressing forward with? > > regards, tom lane > > *** src/backend/parser/scan.l.orig Fri Mar 11 21:15:20 2005 > --- src/backend/parser/scan.l Mon May 23 01:05:54 2005 > *************** > *** 148,163 **** > * validate the contents. > */ > xbstart [bB]{quote} > ! xbstop {quote} > xbinside [^']* > xbcat {quote}{whitespace_with_newline}{quote} > > /* Hexadecimal number > */ > xhstart [xX]{quote} > ! xhstop {quote} > xhinside [^']* > xhcat {quote}{whitespace_with_newline}{quote} > > /* National character > */ > --- 148,165 ---- > * validate the contents. > */ > xbstart [bB]{quote} > ! xbstop {quote}{whitespace}* > xbinside [^']* > xbcat {quote}{whitespace_with_newline}{quote} > + xbconfused {quote}{whitespace}*"-" > > /* Hexadecimal number > */ > xhstart [xX]{quote} > ! xhstop {quote}{whitespace}* > xhinside [^']* > xhcat {quote}{whitespace_with_newline}{quote} > + xhconfused {quote}{whitespace}*"-" > > /* National character > */ > *************** > *** 169,180 **** > */ > quote ' > xqstart {quote} > ! xqstop {quote} > xqdouble {quote}{quote} > xqinside [^\\']+ > xqescape [\\][^0-7] > xqoctesc [\\][0-7]{1,3} > xqcat {quote}{whitespace_with_newline}{quote} > > /* $foo$ style quotes ("dollar quoting") > * The quoted string starts with $foo$ where "foo" is an optional string > --- 171,183 ---- > */ > quote ' > xqstart {quote} > ! xqstop {quote}{whitespace}* > xqdouble {quote}{quote} > xqinside [^\\']+ > xqescape [\\][^0-7] > xqoctesc [\\][0-7]{1,3} > xqcat {quote}{whitespace_with_newline}{quote} > + xqconfused {quote}{whitespace}*"-" > > /* $foo$ style quotes ("dollar quoting") > * The quoted string starts with $foo$ where "foo" is an optional string > *************** > *** 185,190 **** > --- 188,194 ---- > dolq_start [A-Za-z\200-\377_] > dolq_cont [A-Za-z\200-\377_0-9] > dolqdelim \$({dolq_start}{dolq_cont}*)?\$ > + dolqfailed \${dolq_start}{dolq_cont}* > dolqinside [^$]+ > > /* Double quote > *************** > *** 247,253 **** > > integer {digit}+ > decimal (({digit}*\.{digit}+)|({digit}+\.{digit}*)) > ! real ((({digit}*\.{digit}+)|({digit}+\.{digit}*)|({digit}+))([Ee][-+]?{digit}+)) > > param \${integer} > > --- 251,259 ---- > > integer {digit}+ > decimal (({digit}*\.{digit}+)|({digit}+\.{digit}*)) > ! real ({integer}|{decimal})[Ee][-+]?{digit}+ > ! realfail1 ({integer}|{decimal})[Ee] > ! realfail2 ({integer}|{decimal})[Ee][-+] > > param \${integer} > > *************** > *** 310,315 **** > --- 316,325 ---- > /* ignore */ > } > > + <xc>\*+ { > + /* ignore */ > + } > + > <xc><<EOF>> { yyerror("unterminated /* comment"); } > > {xbstart} { > *************** > *** 329,334 **** > --- 339,350 ---- > yylval.str = litbufdup(); > return BCONST; > } > + <xb>{xbconfused} { > + yyless(yyleng-1); > + BEGIN(INITIAL); > + yylval.str = litbufdup(); > + return BCONST; > + } > <xh>{xhinside} | > <xb>{xbinside} { > addlit(yytext, yyleng); > *************** > *** 356,361 **** > --- 372,383 ---- > yylval.str = litbufdup(); > return XCONST; > } > + <xh>{xhconfused} { > + yyless(yyleng-1); > + BEGIN(INITIAL); > + yylval.str = litbufdup(); > + return XCONST; > + } > <xh><<EOF>> { yyerror("unterminated hexadecimal string literal"); } > > {xnstart} { > *************** > *** 385,390 **** > --- 407,418 ---- > yylval.str = litbufdup(); > return SCONST; > } > + <xq>{xqconfused} { > + yyless(yyleng-1); > + BEGIN(INITIAL); > + yylval.str = litbufdup(); > + return SCONST; > + } > <xq>{xqdouble} { > addlitchar('\''); > } > *************** > *** 413,418 **** > --- 441,447 ---- > BEGIN(xdolq); > startlit(); > } > + {dolqfailed} { yyerror("invalid token"); } > <xdolq>{dolqdelim} { > if (strcmp(yytext, dolqstart) == 0) > { > *************** > *** 435,440 **** > --- 464,472 ---- > <xdolq>{dolqinside} { > addlit(yytext, yyleng); > } > + <xdolq>{dolqfailed} { > + addlit(yytext, yyleng); > + } > <xdolq>. { > /* This is only needed for $ inside the quoted text */ > addlitchar(yytext[0]); > *************** > *** 576,582 **** > yylval.str = pstrdup(yytext); > return FCONST; > } > ! > > {identifier} { > const ScanKeyword *keyword; > --- 608,623 ---- > yylval.str = pstrdup(yytext); > return FCONST; > } > ! {realfail1} { > ! yyless(yyleng-1); > ! yylval.str = pstrdup(yytext); > ! return FCONST; > ! } > ! {realfail2} { > ! yyless(yyleng-2); > ! yylval.str = pstrdup(yytext); > ! return FCONST; > ! } > > {identifier} { > const ScanKeyword *keyword; > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Mon, 2005-05-23 at 12:31 -0400, Tom Lane wrote: > doesn't seem to be any way to get flex to complain if someone later > makes a change that breaks the no-backup-cases property again. After some digging, there is a -b option will generate a file called lex.backup if any backup-states exist. The file is said to contain information that would help you remove backup-states. It seems straightforward to test for the existence of that file in the build process? Or perhaps add a special test state --enable-flextest to perform the test during the build. Best Regards, Simon Riggs
Tom Lane wrote: >[snip - flex is slowed down by backtracking - how to fix ] > >What I'm wondering is whether this is really worth doing or not. >There are currently just two parts of the lexer rules that are affected >--- the {real} rule illustrated above, and the rules that allow quoted >strings to be split across lines as the SQL spec requires. But the >patches are still pretty ugly, and what's really annoying is that there >doesn't seem to be any way to get flex to complain if someone later >makes a change that breaks the no-backup-cases property again. > > > I would be more concerned if there were not reasonable alternatives to many bulk parsed inserts (COPY, prepared statement). But I do think it's worth it, even so ... not all client interfaces support prepared statements (notoriously PHP, although I understand KL has sent patches to fix that) and not all inserts are suitable for COPY. cheers andrew
> What I'm wondering is whether this is really worth doing or not. > There are currently just two parts of the lexer rules that are affected > --- the {real} rule illustrated above, and the rules that allow quoted > strings to be split across lines as the SQL spec requires. But the > patches are still pretty ugly, and what's really annoying is that there > doesn't seem to be any way to get flex to complain if someone later > makes a change that breaks the no-backup-cases property again. I was just thinking that if there's not a switch, it's prone to error again. However, the lexer isn't touched anywhere near as much as the grammar is right? So just put a large comment/warning/reminderat the top to test for non-backup states. I'm definitely in favour of a 1/3 speedup of the lexer. Chris
> But I do think it's worth it, even so ... not all client interfaces > support prepared statements (notoriously PHP, although I understand KL > has sent patches to fix that) and not all inserts are suitable for COPY. There is now pg_prepare/pg_execute/pg_query_params in PHP, however you could always have just used straight SQL PREPARE and EXECUTE commands. Chris
Simon Riggs <simon@2ndquadrant.com> writes: > On Mon, 2005-05-23 at 12:31 -0400, Tom Lane wrote: >> doesn't seem to be any way to get flex to complain if someone later >> makes a change that breaks the no-backup-cases property again. > After some digging, there is a -b option will generate a file called > lex.backup if any backup-states exist. The file is said to contain > information that would help you remove backup-states. Right, reading that file is how I learned to fix it ... > It seems straightforward to test for the existence of that file in the > build process? Or perhaps add a special test state --enable-flextest to > perform the test during the build. Well, the problem is that the "success" state isn't that the file doesn't exist or is empty or anything so easy as that. What you want is for it to say No backing up. and it seems a tad too ugly to code such a test into the makefiles. I'd not want to bet that the success message is spelled the same way in every flex version, or is locale-independent. Based on the comments so far in this thread, I'll go ahead and commit the patch, with some comments attached of course --- in particular a big head comment to run flex with -b and see that lex.backup says something to this effect. regards, tom lane
> Based on the comments so far in this thread, I'll go ahead and commit > the patch, with some comments attached of course --- in particular a big > head comment to run flex with -b and see that lex.backup says something > to this effect. Add it to the release check-list. Chris
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes: > However, the lexer isn't touched anywhere near as much as the grammar is > right? Yeah --- if you look at the CVS history, changes that affect the flex rules (and not just the text of the C-code actions) are really rare these days. If there were a lot of churn I'd think this proposal was completely impractical, but there's not much anymore. regards, tom lane
On Mon, 2005-05-23 at 22:43 -0400, Tom Lane wrote: > Based on the comments so far in this thread, I'll go ahead and commit > the patch, with some comments attached of course --- in particular a big > head comment to run flex with -b and see that lex.backup says something > to this effect. flex counts the number of times it needs to back up in a variable called num_backing_up. It tests this to see whether to generate code differently. ISTM you could have a -x switch to make it generate an error if num_backing_up > 0. Would we use the -x switch if we had it? If so I'll submit a patch here first, then on to GNU. It'll take a while before it comes back round to us though. Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Would we use the -x switch if we had it? Dunno. Depending on such a thing would require depending on a new flex version, and seeing that the flex guys haven't put out a new release since the badly broken 2.5.31 more than 2 years ago, I wouldn't hold my breath waiting for one we can use. regards, tom lane
Tom Lane wrote: > Dunno. Depending on such a thing would require depending on a new flex > version, and seeing that the flex guys haven't put out a new release > since the badly broken 2.5.31 more than 2 years ago, I wouldn't hold > my breath waiting for one we can use. It should be easy enough to check for the presence of the flag at configure-time and only use it if flex implements it. Prior to reading Simon's message I was actually thinking of hacking up the same thing :), so I agree it is probably worth doing. -Neil
On Wed, 2005-05-25 at 11:28 +1000, Neil Conway wrote: > Tom Lane wrote: > > Dunno. Depending on such a thing would require depending on a new flex > > version, and seeing that the flex guys haven't put out a new release > > since the badly broken 2.5.31 more than 2 years ago, I wouldn't hold > > my breath waiting for one we can use. > > It should be easy enough to check for the presence of the flag at > configure-time and only use it if flex implements it. Prior to reading > Simon's message I was actually thinking of hacking up the same thing :), > so I agree it is probably worth doing. Neil, I'll be doing this after the beta freeze is announced. If you wanted to do this before then, I wouldn't stop you. Best Regards, Simon Riggs