Hook for extensible parsing. - Mailing list pgsql-hackers
From | Julien Rouhaud |
---|---|
Subject | Hook for extensible parsing. |
Date | |
Msg-id | 20210501072458.adqjoaqnmhg4l34l@nol Whole thread Raw |
Responses |
Re: Hook for extensible parsing.
Re: Hook for extensible parsing. |
List | pgsql-hackers |
Hi, Being able to extend core parser has been requested multiple times, and AFAICT all previous attempts were rejected not because this isn't wanted but because the proposed implementations required plugins to reimplement all of the core grammar with their own changes, as bison generated parsers aren't extensible. I'd like to propose an alternative approach, which is to allow multiple parsers to coexist, and let third-party parsers optionally fallback on the core parsers. I'm sending this now as a follow-up of [1] and to avoid duplicated efforts, as multiple people are interested in that topic. Obviously, since this is only about parsing, all modules can only implement some kind of syntactic sugar, as they have to produce valid parsetrees, but this could be a first step to later allow custom nodes and let plugins implement e.g. new UTILITY commands. So, this approach should allow different custom parser implementations: 1 implement only a few new commands on top of core grammar. For instance, an extension could add support for CREATE [PHYSICAL | LOGICAL] REPLICATION SLOT and rewrite that to a SelectStmt on top of the extisting function, or add a CREATE HYPOTHETICAL INDEX, which would internally add a new option in IndexStmt->options, to be intercepted in processUtility and bypass its execution with the extension approach instead. 2 implement a totally different grammar for a different language. In case of error, just silently fallback to core parser (or another hook) so both parsers can still be used. Any language could be parsed as long as you can produce a valid postgres parsetree. 3 implement a superuser of core grammar and replace core parser entirely. This could arguably be done like the 1st case, but the idea is to avoid to possibly parse the same input string twice, or to forbid the core parser if that's somehow wanted. I'm attaching some POC patches that implement this approach to start a discussion. I split the infrastructure part in 2 patches to make it easier to review, and I'm also adding 2 other patches with a small parser implementation to be able to test the infrastructure. Here are some more details on the patches and implementation details: 0001 simply adds a parser hook, which is called instead of raw_parser. This is enough to make multiple parser coexist with one exception: multi-statement query string. If multiple statements are provided, then all of them will be parsed using the same grammar, which obviously won't work if they are written for different grammars. 0002 implements a lame "sqlol" parser, based on LOLCODE syntax, with only the ability to produce "select [col, ] col FROM table" parsetree, for testing purpose. I chose it to ensure that everything works properly even with a totally different grammar that has different keywords, which doesn't even ends statements with a semicolon but a plain keyword. 0003 is where the real modifications are done to allow multi-statement string to be parsed using different grammar. It implements a new MODE_SINGLE_QUERY mode, which is used when a parser_hook is present. In that case, pg_parse_query() will only parse part of the query string and loop until everything is parsed (or some error happens). pg_parse_query() will instruct plugins to parse a query at a time. They're free to ignore that mode if they want to implement the 3rd mode. If so, they should either return multiple RawStmt, a single RawStmt with a 0 or strlen(query_string) stmt_len, or error out. Otherwise, they will implement either mode 1 or 2, and they should always return a List containing a single RawStmt with properly set stmt_len, even if the underlying statement is NULL. This is required to properly skip valid strings that don't contain a statements, and pg_parse_query() will skip RawStmt that don't contain an underlying statement. It also teaches the core parser to do the same, by optionally start parsing somewhere in the input string and stop parsing once a valid statement is found. Note that the whole input string is provided to the parsers in order to report correct cursor position, so all token can get a correct location. This means that raw_parser() signature needs an additional offset to know where the parsing should start. Finally, 0004 modifies the sqlol parser to implement the MODE_SINGLE_QUERY mode, adds grammar for creating views and adds some regression test to validate proper parsing and error location reporting with multi-statements input string. As far as I can tell it's all working as expected but I may have missed some usecases. The regression tests still work with the additional parser configured. The only difference is for pg_stat_statements, as in MODE_SINGLE_QUERY the trailing semicolon has to be included in the statement, since other grammars may understand semicolons differently. The obvious drawback is that it can cause overhead as the same input can be parsed multiple time. This could be avoided with plugins implementing a GUC to enable/disable their parser, so it's only active by default for some users/database, or requires to be enabled interactively by the client app. Also, the error messages can also be unhelpful for cases 1 and 2. If the custom parser doesn't error out, it means that the syntax errors will be raised by the core parser based on the core grammar, which will likely point out an unrelated problem. Some of that can be avoided by letting the custom parsers raise errors when they know for sure it's parsing what it's supposed to parse (there's an example of that in the sqlol parser for qualified_name parsing, as it can only happen once some specific keywords already matched). For the rest of the errors, the only option I can think of is another GUC to let custom parsers always raise an error (or raise a warning) to help people debug their queries. I'll park this patch in the next commitfest so it can be discussed when pg15 development starts. [1]: https://www.postgresql.org/message-id/20210315164336.ak32whndsxna5mjf@nol
Attachment
pgsql-hackers by date: