Speeding up the Postgres lexer - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Speeding up the Postgres lexer |
Date | |
Msg-id | 8652.1116865895@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Speeding up the Postgres lexer
Re: Speeding up the Postgres lexer Re: Speeding up the Postgres lexer Re: Speeding up the Postgres lexer |
List | pgsql-hackers |
I was doing some profiles recently that showed that on simple statements (like INSERT with a few column values) the basic parser (flex/bison) was taking up a noticeable percentage of the total CPU time. We already have done some things to speed up the lexer, like use -CF option for large-but-fast tables. I read in the flex manual about how you can make it faster yet if you can avoid the need for backup states. It's not terribly well documented, but I eventually worked out how to do it, and find that it seems to make the lexer about a third faster. Basically what you have to do is provide alternate rules that allow the lexer to match *some* rule on any prefix of a complete token. For instance the rule for {real} implies that given123.456E+ if the next character is not a digit, then the {real} rule fails, and the lexer has to backtrack to123.456 which it can recognize as a {decimal}. You can add a rule that does match in this situation, throws back the two unwanted characters with yyless(), and returns the number as the {decimal} rule would do. The next call will carry on lexing the "E" as an identifier, etc. (Note: this is going to end up being a syntax error because the SQL grammar never allows a number to immediately precede an identifier; but that's not the lexer's concern, and so I don't think that we should make the extra rule throw an error immediately.) What I'm wondering is whether this is really worth doing or not. There are currently just two parts of the lexer rules that are affected --- the {real} rule illustrated above, and the rules that allow quoted strings to be split across lines as the SQL spec requires. But the patches are still pretty ugly, and what's really annoying is that there doesn't seem to be any way to get flex to complain if someone later makes a change that breaks the no-backup-cases property again. A prototype patch (completely undocumented :-() is attached. Any thoughts about whether it's worth pressing forward with? regards, tom lane *** src/backend/parser/scan.l.orig Fri Mar 11 21:15:20 2005 --- src/backend/parser/scan.l Mon May 23 01:05:54 2005 *************** *** 148,163 **** * validate the contents. */ xbstart [bB]{quote} ! xbstop {quote} xbinside [^']* xbcat {quote}{whitespace_with_newline}{quote} /* Hexadecimalnumber */ xhstart [xX]{quote} ! xhstop {quote} xhinside [^']* xhcat {quote}{whitespace_with_newline}{quote} /* National character */ --- 148,165 ---- * validate the contents. */ xbstart [bB]{quote} ! xbstop {quote}{whitespace}* xbinside [^']* xbcat {quote}{whitespace_with_newline}{quote} + xbconfused {quote}{whitespace}*"-" /* Hexadecimal number */ xhstart [xX]{quote} ! xhstop {quote}{whitespace}* xhinside [^']* xhcat {quote}{whitespace_with_newline}{quote} + xhconfused {quote}{whitespace}*"-" /* National character */ *************** *** 169,180 **** */ quote ' xqstart {quote} ! xqstop {quote} xqdouble {quote}{quote} xqinside [^\\']+ xqescape [\\][^0-7] xqoctesc [\\][0-7]{1,3} xqcat {quote}{whitespace_with_newline}{quote} /* $foo$ style quotes ("dollar quoting") *The quoted string starts with $foo$ where "foo" is an optional string --- 171,183 ---- */ quote ' xqstart {quote} ! xqstop {quote}{whitespace}* xqdouble {quote}{quote} xqinside [^\\']+ xqescape [\\][^0-7]xqoctesc [\\][0-7]{1,3} xqcat {quote}{whitespace_with_newline}{quote} + xqconfused {quote}{whitespace}*"-" /* $foo$ style quotes ("dollar quoting") * The quoted string starts with $foo$where "foo" is an optional string *************** *** 185,190 **** --- 188,194 ---- dolq_start [A-Za-z\200-\377_] dolq_cont [A-Za-z\200-\377_0-9] dolqdelim \$({dolq_start}{dolq_cont}*)?\$ + dolqfailed \${dolq_start}{dolq_cont}* dolqinside [^$]+ /* Double quote *************** *** 247,253 **** integer {digit}+ decimal (({digit}*\.{digit}+)|({digit}+\.{digit}*)) ! real ((({digit}*\.{digit}+)|({digit}+\.{digit}*)|({digit}+))([Ee][-+]?{digit}+)) param \${integer} --- 251,259 ---- integer {digit}+ decimal (({digit}*\.{digit}+)|({digit}+\.{digit}*)) ! real ({integer}|{decimal})[Ee][-+]?{digit}+ ! realfail1 ({integer}|{decimal})[Ee] ! realfail2 ({integer}|{decimal})[Ee][-+] param \${integer} *************** *** 310,315 **** --- 316,325 ---- /* ignore */ } + <xc>\*+ { + /* ignore */ + } + <xc><<EOF>> { yyerror("unterminated /* comment"); } {xbstart} { *************** *** 329,334 **** --- 339,350 ---- yylval.str = litbufdup(); return BCONST; } + <xb>{xbconfused} { + yyless(yyleng-1); + BEGIN(INITIAL); + yylval.str = litbufdup(); + return BCONST; + } <xh>{xhinside} | <xb>{xbinside} { addlit(yytext, yyleng); *************** *** 356,361 **** --- 372,383 ---- yylval.str = litbufdup(); return XCONST; } + <xh>{xhconfused} { + yyless(yyleng-1); + BEGIN(INITIAL); + yylval.str = litbufdup(); + return XCONST; + } <xh><<EOF>> { yyerror("unterminated hexadecimal string literal"); } {xnstart} { *************** *** 385,390 **** --- 407,418 ---- yylval.str = litbufdup(); return SCONST; } + <xq>{xqconfused} { + yyless(yyleng-1); + BEGIN(INITIAL); + yylval.str = litbufdup(); + return SCONST; + } <xq>{xqdouble} { addlitchar('\''); } *************** *** 413,418 **** --- 441,447 ---- BEGIN(xdolq); startlit(); } + {dolqfailed} { yyerror("invalid token"); } <xdolq>{dolqdelim} { if (strcmp(yytext, dolqstart) ==0) { *************** *** 435,440 **** --- 464,472 ---- <xdolq>{dolqinside} { addlit(yytext, yyleng); } + <xdolq>{dolqfailed} { + addlit(yytext, yyleng); + } <xdolq>. { /* This is only needed for $ inside the quoted text */ addlitchar(yytext[0]); *************** *** 576,582 **** yylval.str = pstrdup(yytext); return FCONST; } ! {identifier} { const ScanKeyword *keyword; --- 608,623 ---- yylval.str = pstrdup(yytext); return FCONST; } ! {realfail1} { ! yyless(yyleng-1); ! yylval.str = pstrdup(yytext); ! return FCONST; ! } ! {realfail2} { ! yyless(yyleng-2); ! yylval.str = pstrdup(yytext); ! return FCONST; ! } {identifier} { const ScanKeyword *keyword;
pgsql-hackers by date: