Re: Speeding up the Postgres lexer - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: Speeding up the Postgres lexer |
Date | |
Msg-id | 200505231642.j4NGgA315728@candle.pha.pa.us Whole thread Raw |
In response to | Speeding up the Postgres lexer (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
This seems fine. I don't think the lexer changes enough for us to have issues with new cases. I think adding some comments to explain why we are doing it is enough, and perhaps a test case that can be reproduced later for testing. --------------------------------------------------------------------------- Tom Lane wrote: > I was doing some profiles recently that showed that on simple statements > (like INSERT with a few column values) the basic parser (flex/bison) was > taking up a noticeable percentage of the total CPU time. We already > have done some things to speed up the lexer, like use -CF option for > large-but-fast tables. I read in the flex manual about how you can make > it faster yet if you can avoid the need for backup states. It's not > terribly well documented, but I eventually worked out how to do it, and > find that it seems to make the lexer about a third faster. Basically > what you have to do is provide alternate rules that allow the lexer to > match *some* rule on any prefix of a complete token. For instance the > rule for {real} implies that given > 123.456E+ > if the next character is not a digit, then the {real} rule fails, and > the lexer has to backtrack to > 123.456 > which it can recognize as a {decimal}. You can add a rule that does > match in this situation, throws back the two unwanted characters with > yyless(), and returns the number as the {decimal} rule would do. > The next call will carry on lexing the "E" as an identifier, etc. > (Note: this is going to end up being a syntax error because the SQL > grammar never allows a number to immediately precede an identifier; > but that's not the lexer's concern, and so I don't think that we should > make the extra rule throw an error immediately.) > > What I'm wondering is whether this is really worth doing or not. > There are currently just two parts of the lexer rules that are affected > --- the {real} rule illustrated above, and the rules that allow quoted > strings to be split across lines as the SQL spec requires. But the > patches are still pretty ugly, and what's really annoying is that there > doesn't seem to be any way to get flex to complain if someone later > makes a change that breaks the no-backup-cases property again. > > A prototype patch (completely undocumented :-() is attached. Any > thoughts about whether it's worth pressing forward with? > > regards, tom lane > > *** src/backend/parser/scan.l.orig Fri Mar 11 21:15:20 2005 > --- src/backend/parser/scan.l Mon May 23 01:05:54 2005 > *************** > *** 148,163 **** > * validate the contents. > */ > xbstart [bB]{quote} > ! xbstop {quote} > xbinside [^']* > xbcat {quote}{whitespace_with_newline}{quote} > > /* Hexadecimal number > */ > xhstart [xX]{quote} > ! xhstop {quote} > xhinside [^']* > xhcat {quote}{whitespace_with_newline}{quote} > > /* National character > */ > --- 148,165 ---- > * validate the contents. > */ > xbstart [bB]{quote} > ! xbstop {quote}{whitespace}* > xbinside [^']* > xbcat {quote}{whitespace_with_newline}{quote} > + xbconfused {quote}{whitespace}*"-" > > /* Hexadecimal number > */ > xhstart [xX]{quote} > ! xhstop {quote}{whitespace}* > xhinside [^']* > xhcat {quote}{whitespace_with_newline}{quote} > + xhconfused {quote}{whitespace}*"-" > > /* National character > */ > *************** > *** 169,180 **** > */ > quote ' > xqstart {quote} > ! xqstop {quote} > xqdouble {quote}{quote} > xqinside [^\\']+ > xqescape [\\][^0-7] > xqoctesc [\\][0-7]{1,3} > xqcat {quote}{whitespace_with_newline}{quote} > > /* $foo$ style quotes ("dollar quoting") > * The quoted string starts with $foo$ where "foo" is an optional string > --- 171,183 ---- > */ > quote ' > xqstart {quote} > ! xqstop {quote}{whitespace}* > xqdouble {quote}{quote} > xqinside [^\\']+ > xqescape [\\][^0-7] > xqoctesc [\\][0-7]{1,3} > xqcat {quote}{whitespace_with_newline}{quote} > + xqconfused {quote}{whitespace}*"-" > > /* $foo$ style quotes ("dollar quoting") > * The quoted string starts with $foo$ where "foo" is an optional string > *************** > *** 185,190 **** > --- 188,194 ---- > dolq_start [A-Za-z\200-\377_] > dolq_cont [A-Za-z\200-\377_0-9] > dolqdelim \$({dolq_start}{dolq_cont}*)?\$ > + dolqfailed \${dolq_start}{dolq_cont}* > dolqinside [^$]+ > > /* Double quote > *************** > *** 247,253 **** > > integer {digit}+ > decimal (({digit}*\.{digit}+)|({digit}+\.{digit}*)) > ! real ((({digit}*\.{digit}+)|({digit}+\.{digit}*)|({digit}+))([Ee][-+]?{digit}+)) > > param \${integer} > > --- 251,259 ---- > > integer {digit}+ > decimal (({digit}*\.{digit}+)|({digit}+\.{digit}*)) > ! real ({integer}|{decimal})[Ee][-+]?{digit}+ > ! realfail1 ({integer}|{decimal})[Ee] > ! realfail2 ({integer}|{decimal})[Ee][-+] > > param \${integer} > > *************** > *** 310,315 **** > --- 316,325 ---- > /* ignore */ > } > > + <xc>\*+ { > + /* ignore */ > + } > + > <xc><<EOF>> { yyerror("unterminated /* comment"); } > > {xbstart} { > *************** > *** 329,334 **** > --- 339,350 ---- > yylval.str = litbufdup(); > return BCONST; > } > + <xb>{xbconfused} { > + yyless(yyleng-1); > + BEGIN(INITIAL); > + yylval.str = litbufdup(); > + return BCONST; > + } > <xh>{xhinside} | > <xb>{xbinside} { > addlit(yytext, yyleng); > *************** > *** 356,361 **** > --- 372,383 ---- > yylval.str = litbufdup(); > return XCONST; > } > + <xh>{xhconfused} { > + yyless(yyleng-1); > + BEGIN(INITIAL); > + yylval.str = litbufdup(); > + return XCONST; > + } > <xh><<EOF>> { yyerror("unterminated hexadecimal string literal"); } > > {xnstart} { > *************** > *** 385,390 **** > --- 407,418 ---- > yylval.str = litbufdup(); > return SCONST; > } > + <xq>{xqconfused} { > + yyless(yyleng-1); > + BEGIN(INITIAL); > + yylval.str = litbufdup(); > + return SCONST; > + } > <xq>{xqdouble} { > addlitchar('\''); > } > *************** > *** 413,418 **** > --- 441,447 ---- > BEGIN(xdolq); > startlit(); > } > + {dolqfailed} { yyerror("invalid token"); } > <xdolq>{dolqdelim} { > if (strcmp(yytext, dolqstart) == 0) > { > *************** > *** 435,440 **** > --- 464,472 ---- > <xdolq>{dolqinside} { > addlit(yytext, yyleng); > } > + <xdolq>{dolqfailed} { > + addlit(yytext, yyleng); > + } > <xdolq>. { > /* This is only needed for $ inside the quoted text */ > addlitchar(yytext[0]); > *************** > *** 576,582 **** > yylval.str = pstrdup(yytext); > return FCONST; > } > ! > > {identifier} { > const ScanKeyword *keyword; > --- 608,623 ---- > yylval.str = pstrdup(yytext); > return FCONST; > } > ! {realfail1} { > ! yyless(yyleng-1); > ! yylval.str = pstrdup(yytext); > ! return FCONST; > ! } > ! {realfail2} { > ! yyless(yyleng-2); > ! yylval.str = pstrdup(yytext); > ! return FCONST; > ! } > > {identifier} { > const ScanKeyword *keyword; > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
pgsql-hackers by date: