small Bison optimization: remove single character literal tokens - Mailing list pgsql-hackers

From John Naylor
Subject small Bison optimization: remove single character literal tokens
Date
Msg-id CACPNZCvG3rYfw-L-b9eaxtdDvgkOwz4U3TFriUVBOhKcuOBRkw@mail.gmail.com
Whole thread Raw
Responses Re: small Bison optimization: remove single character literal tokens
List pgsql-hackers
Hi all,

Traditionally, when Bison generates the header file, it starts
numbering named tokens at 258, so that numbers below 256 can be used
for character literals. Then, during parsing, the external token
number or character literal is mapped to Bison's internal token number
via the yytranslate[] array.

The newly-released Bison 3.5 has the option "%define api.token.raw",
which causes Bison to write out the same ("raw") token numbers it
would use internally, and thus skip building the yytranslate[] array
as well as the code to access it. To make use of this, there cannot be
any single character literals in the grammar, otherwise Bison will
refuse to build.

Attached is a draft patch to make the core grammar forward-compatible
with this option by using FULL_NAMES for all valid single character
tokens. Benchmarking raw parsing with "%define api.token.raw" enabled
shows ~1.7-1.9% improvement compared with not setting the option. Not
much, but doing one less array access per token reduces cache
pollution and saves a few kB of binary size as well.

It'll be years before Bison 3.5 is common in the wild, but not doing a
useless array access seems like an obvious thing to aspire towards, so
we may as well start now.

Despite the simplicity of the basic idea, there are some subtleties to consider:

Since the PL/pgSQL grammar shares the core scanner, all the single
char tokens had to change there as well. Likewise, if "%define
api.token.raw" is set in one grammar, it must be set in both. I've
added a comment to that effect.

For ECPG, it would work well enough to change over to named tokens
only for the single chars used in the core grammar, leaving just '{'
and '}' as ECPG-only literals. However, I thought it better to be
consistent in that regard, so I made these named tokens as well. This
adds a wrinkle in that ECPG now needs one {self} pattern to match its
own single character tokens and also a test for membership in the set
of {self} chars in the core scanner, needed for the {operator} rule.
It seems the simplest thing is to add '{' and '}' to the core's {self}
pattern, making the two cases the same. That would preclude ever using
those characters as operators, but I think it's unlikely anyone would
want to use them as such. If that's undesirable, that can be worked
around with additional complexity.

Previously, {dolqfailed} returned '$', which is not a valid token in
either the core or PL/pgSQL grammar that I can see. I don't see how
this can be anything but a syntax error, so I changed it to throw the
error within the scanner with "unterminated dollar quote" for the
message, although it doesn't seem all that helpful. I added a test for
this to demonstrate, but maybe there's a better place for it.

Maintenance-wise, it would be a bit tricky to enforce the new
convention as a "rule", since there is currently no easy way to test
if any new grammar additions add any single character literals unless
someone were to build with "%define api.token.raw". Perhaps that's not
a huge deal, but it's something to consider. More concerning is if
someone were to absent-mindedly add a rule to the scanner with "return
yytext[0]", since that would now be interpreted as the wrong token,
causing hard-to-diagnose bugs. There's no obvious way to catch that
independently from how Bison is configured. Maybe we could have the
Makefile grep the scanner for likely expressions that return a single
character.

The bulk of the patch was created mechanically, so comments referring
to single char literals have changed as well, leading to some
inconsistencies. I left it like this, pending bike-shedding.

Speaking of bike-shedding, I didn't give a whole lot of thought
towards the new token names, and just used what came to mind first.
OCTOTHORPE is probably better known as NUMBER_SIGN, for example.

I will add this to the next commitfest.

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: Improvement to psql's connection defaults
Next
From: Daniel Gustafsson
Date:
Subject: Re: Online checksums patch - once again