Have we out-grown Flex? - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Have we out-grown Flex? |
Date | |
Msg-id | CAEYLb_VN38OSY4Mhq9ZxsbdnsNoBzynHBjMNkNmN=bKMHYjaJQ@mail.gmail.com Whole thread Raw |
Responses |
Re: Have we out-grown Flex?
Re: Have we out-grown Flex? Re: Have we out-grown Flex? Re: Have we out-grown Flex? Re: Have we out-grown Flex? |
List | pgsql-hackers |
Quite apart from the practical difficulties that we have with Flex (the fact that the authors are non-responsive and possibly retired, that annoying compiler warning, and the fact that we are forced to maintain our own Windows binaries of 2.5.35), it has some notable disadvantages. I am aware that the MySQL people use their own hand-coded lexical analyzer named sql_lex.cc, which provides a yacc interface, while avoiding using any lexical analyzer generator whatsoever. They can't have done this just for fun, and no doubt this goes some way to explaining their continued performance advantage for very simple queries. I have heard people complain about Postgres parser overhead for "pgbench -S" style use-cases where it is unreasonably high, and I've heard them do so more than once. I suspect that we pay a high price for our table-based lexical analyser's use of a table data structure to maintain transition information, and the MySQL guys do not. It seems exceedingly likely that our lexer is way more sophisticated than theirs, which likely makes it infeasible to emulate their approach, and certainly makes it unattractive. However, I think there might be a better way. There is an LGPL-licensed project named Quex (hey, GNU bison is GPL), which can generate C (and C++) code, and I am in contact with its primary author, Frank-Rene Schäfer. http://quex.sourceforge.net/ While it fits the bill as a replacement for Flex if ever we were compelled to drop Flex, it also has a number of interesting advantages over it that warrant further investigation. As Wikipedia puts it: "Quex uses traditional steps of Thompson construction to create non-deterministic finite automatons from regular expressions, conversion to a deterministic finite automaton and then Hopcroft optimization to reduce the number of states to a minimum. Those mechanisms, though, have been adapted to deal with character sets rather than single characters. By means of this the calculation time can be significantly reduced. Since the Unicode character set consists of many more code points than plain ASCII, those optimizations are necessary in order to produce lexical analysers in a reasonable amount of time." Apparently, in practical terms, reductions of more than 1/2 and even as high as 2/3 in total lexing time have been observed when switching from Flex, and Quex uses the syntax of Flex for its description of regular expressions, which makes a conversion research project seem like a worthwhile undertaking. I am told that this will probably take less than a month, and perhaps as little as two weeks given the broad compatibility of Flex and Quex. This undertaking has the enthusiastic support of Frank-Rene, which is in refreshing contrast to the complete inaccessibility of the Flex developers and the dead silence on the flex users' mailing list. I'm certainly not suggesting that this isn't something that, in order to be adopted, will have to be carefully considered from many angles, of which the performance of the resulting lexer is only one, but at the same time I dare say that this approach is the "low-hanging fruit" of optimising for the use-case where parser overhead is unreasonably high. Am I barking up the wrong tree? At the very least, hopefully these simple observations will provoke thought - Flex/Lex is not the only tool for generating yacc-compatible lexical analysers, and it may not be the best one for our current needs. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
pgsql-hackers by date: