small Bison optimization: remove single character literal tokens - Mailing list pgsql-hackers
From | John Naylor |
---|---|
Subject | small Bison optimization: remove single character literal tokens |
Date | |
Msg-id | CACPNZCvG3rYfw-L-b9eaxtdDvgkOwz4U3TFriUVBOhKcuOBRkw@mail.gmail.com Whole thread Raw |
Responses |
Re: small Bison optimization: remove single character literal tokens
|
List | pgsql-hackers |
Hi all, Traditionally, when Bison generates the header file, it starts numbering named tokens at 258, so that numbers below 256 can be used for character literals. Then, during parsing, the external token number or character literal is mapped to Bison's internal token number via the yytranslate[] array. The newly-released Bison 3.5 has the option "%define api.token.raw", which causes Bison to write out the same ("raw") token numbers it would use internally, and thus skip building the yytranslate[] array as well as the code to access it. To make use of this, there cannot be any single character literals in the grammar, otherwise Bison will refuse to build. Attached is a draft patch to make the core grammar forward-compatible with this option by using FULL_NAMES for all valid single character tokens. Benchmarking raw parsing with "%define api.token.raw" enabled shows ~1.7-1.9% improvement compared with not setting the option. Not much, but doing one less array access per token reduces cache pollution and saves a few kB of binary size as well. It'll be years before Bison 3.5 is common in the wild, but not doing a useless array access seems like an obvious thing to aspire towards, so we may as well start now. Despite the simplicity of the basic idea, there are some subtleties to consider: Since the PL/pgSQL grammar shares the core scanner, all the single char tokens had to change there as well. Likewise, if "%define api.token.raw" is set in one grammar, it must be set in both. I've added a comment to that effect. For ECPG, it would work well enough to change over to named tokens only for the single chars used in the core grammar, leaving just '{' and '}' as ECPG-only literals. However, I thought it better to be consistent in that regard, so I made these named tokens as well. This adds a wrinkle in that ECPG now needs one {self} pattern to match its own single character tokens and also a test for membership in the set of {self} chars in the core scanner, needed for the {operator} rule. It seems the simplest thing is to add '{' and '}' to the core's {self} pattern, making the two cases the same. That would preclude ever using those characters as operators, but I think it's unlikely anyone would want to use them as such. If that's undesirable, that can be worked around with additional complexity. Previously, {dolqfailed} returned '$', which is not a valid token in either the core or PL/pgSQL grammar that I can see. I don't see how this can be anything but a syntax error, so I changed it to throw the error within the scanner with "unterminated dollar quote" for the message, although it doesn't seem all that helpful. I added a test for this to demonstrate, but maybe there's a better place for it. Maintenance-wise, it would be a bit tricky to enforce the new convention as a "rule", since there is currently no easy way to test if any new grammar additions add any single character literals unless someone were to build with "%define api.token.raw". Perhaps that's not a huge deal, but it's something to consider. More concerning is if someone were to absent-mindedly add a rule to the scanner with "return yytext[0]", since that would now be interpreted as the wrong token, causing hard-to-diagnose bugs. There's no obvious way to catch that independently from how Bison is configured. Maybe we could have the Makefile grep the scanner for likely expressions that return a single character. The bulk of the patch was created mechanically, so comments referring to single char literals have changed as well, leading to some inconsistencies. I left it like this, pending bike-shedding. Speaking of bike-shedding, I didn't give a whole lot of thought towards the new token names, and just used what came to mind first. OCTOTHORPE is probably better known as NUMBER_SIGN, for example. I will add this to the next commitfest. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: