Have we out-grown Flex? - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Have we out-grown Flex?
Date
Msg-id CAEYLb_VN38OSY4Mhq9ZxsbdnsNoBzynHBjMNkNmN=bKMHYjaJQ@mail.gmail.com
Whole thread Raw
Responses Re: Have we out-grown Flex?
Re: Have we out-grown Flex?
Re: Have we out-grown Flex?
Re: Have we out-grown Flex?
Re: Have we out-grown Flex?
List pgsql-hackers
Quite apart from the practical difficulties that we have with Flex
(the fact that the authors are non-responsive and possibly retired,
that annoying compiler warning, and the fact that we are forced to
maintain our own Windows binaries of 2.5.35), it has some notable
disadvantages. I am aware that the MySQL people use their own
hand-coded lexical analyzer named sql_lex.cc, which provides a yacc
interface, while avoiding using any lexical analyzer generator
whatsoever. They can't have done this just for fun, and no doubt this
goes some way to explaining their continued performance advantage for
very simple queries. I have heard people complain about Postgres
parser overhead for "pgbench -S" style use-cases where it is
unreasonably high, and I've heard them do so more than once.

I suspect that we pay a high price for our table-based lexical
analyser's use of a table data structure to maintain transition
information, and the MySQL guys do not. It seems exceedingly likely
that our lexer is way more sophisticated than theirs, which likely
makes it infeasible to emulate their approach, and certainly makes it
unattractive. However, I think there might be a better way. There is
an LGPL-licensed project named Quex (hey, GNU bison is GPL), which can
generate C (and C++) code, and I am in contact with its primary
author, Frank-Rene Schäfer.

http://quex.sourceforge.net/

While it fits the bill as a replacement for Flex if ever we were
compelled to drop Flex, it also has a number of interesting advantages
over it that warrant further investigation. As Wikipedia puts it:

"Quex uses traditional steps of Thompson construction to create
non-deterministic finite automatons from regular expressions,
conversion to a deterministic finite automaton and then Hopcroft
optimization to reduce the number of states to a minimum. Those
mechanisms, though, have been adapted to deal with character sets
rather than single characters. By means of this the calculation time
can be significantly reduced. Since the Unicode character set consists
of many more code points than plain ASCII, those optimizations are
necessary in order to produce lexical analysers in a reasonable amount
of time."

Apparently, in practical terms, reductions of more than 1/2 and even
as high as 2/3 in total lexing time have been observed when switching
from Flex, and Quex uses the syntax of Flex for its description of
regular expressions, which makes a conversion research project seem
like a worthwhile undertaking. I am told that this will probably take
less than a month, and perhaps as little as two weeks given the broad
compatibility of Flex and Quex. This undertaking has the enthusiastic
support of Frank-Rene, which is in refreshing contrast to the complete
inaccessibility of the Flex developers and the dead silence on the
flex users' mailing list.

I'm certainly not suggesting that this isn't something that, in order
to be adopted, will have to be carefully considered from many angles,
of which the performance of the resulting lexer is only one, but at
the same time I dare say that this approach is the "low-hanging fruit"
of optimising for the use-case where parser overhead is unreasonably
high. Am I barking up the wrong tree? At the very least, hopefully
these simple observations will provoke thought - Flex/Lex is not the
only tool for generating yacc-compatible lexical analysers, and it may
not be the best one for our current needs.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


pgsql-hackers by date:

Previous
From: Hannu Krosing
Date:
Subject: Re: JSON in 9.2 - Could we have just one to_json() function instead of two separate versions ?
Next
From: Peter Geoghegan
Date:
Subject: Re: proposal: additional error fields