Re: scanner/parser minimization - Mailing list pgsql-hackers

From Greg Stark
Subject Re: scanner/parser minimization
Date
Msg-id CAM-w4HOECQzDCUUOjBfTtpW-LG+S-U2aOtYNy0PsO6eKqNAx_Q@mail.gmail.com
Whole thread Raw
In response to Re: scanner/parser minimization  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: scanner/parser minimization  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Regarding yytransition I think the problem is we're using flex to
implement keyword recognition which is usually not what it's used for.
Usually people use flex to handle syntax things like quoting and
numeric formats. All identifiers are handled by flex as equivalent.
Then the last step in the scanner for identifiers is to look up the
identifier in a hash table and return the keyword token if it's a
keyword. That would massively simplify the scanner tables.

This hash table can be heavily optimized because it's a static lookup.
c.f. http://www.gnu.org/software/gperf/

In theory this is more expensive since it needs to do a strcmp in
addition to scanning the identifier to determine whether the token
ends. But I suspect in practice the smaller tables might outweight
that cost.


On Thu, Feb 28, 2013 at 9:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I believe however that it's possible to extract an idea of which
> tokens the parser believes it can see next at any given parse state.
> (I've seen code for this somewhere on the net, but am too lazy to go
> searching for it again right now.)  So we could imagine a rule along
> the lines of "if IDENT is allowed as a next token, and $KEYWORD is
> not, then return IDENT not the keyword's own token".

That's a pretty cool idea. I'm afraid it might be kind of slow to
produce that list and load it into a hash table for every ident token
though. I suppose if you can request it only if the ident is a keyword
of some type then it wouldn't actually kick in often.

That would also imply we could simply use IDENT everywhere we
currently have col_name_keyword or type_function_name. Just by having
a rule that accepts a keyword that would implicitly make it not be
accepted as an IDENT when unquoted in that location. That might make
the documentation a bit trickier and it would make it harder for users
to make their code forward-compatible. It would also make it harder
for hackers to determine when we've accidentally narrowed the
allowable identifiers for more cases than they expect.

-- 
greg



pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: Materialized views WIP patch
Next
From: Tom Lane
Date:
Subject: Re: scanner/parser minimization