Hook for extensible parsing. - Mailing list pgsql-hackers

From Julien Rouhaud
Subject Hook for extensible parsing.
Date
Msg-id 20210501072458.adqjoaqnmhg4l34l@nol
Whole thread Raw
Responses Re: Hook for extensible parsing.
Re: Hook for extensible parsing.
List pgsql-hackers
Hi,

Being able to extend core parser has been requested multiple times, and AFAICT
all previous attempts were rejected not because this isn't wanted but because
the proposed implementations required plugins to reimplement all of the core
grammar with their own changes, as bison generated parsers aren't extensible.

I'd like to propose an alternative approach, which is to allow multiple parsers
to coexist, and let third-party parsers optionally fallback on the core
parsers.  I'm sending this now as a follow-up of [1] and to avoid duplicated
efforts, as multiple people are interested in that topic.

Obviously, since this is only about parsing, all modules can only implement
some kind of syntactic sugar, as they have to produce valid parsetrees, but
this could be a first step to later allow custom nodes and let plugins
implement e.g. new UTILITY commands.

So, this approach should allow different custom parser implementations:

1 implement only a few new commands on top of core grammar.  For instance, an
  extension could add support for CREATE [PHYSICAL | LOGICAL] REPLICATION SLOT
  and rewrite that to a SelectStmt on top of the extisting function, or add a
  CREATE HYPOTHETICAL INDEX, which would internally add a new option in
  IndexStmt->options, to be intercepted in processUtility and bypass its
  execution with the extension approach instead.

2 implement a totally different grammar for a different language.  In case of
  error, just silently fallback to core parser (or another hook) so both
  parsers can still be used.  Any language could be parsed as long as you can
  produce a valid postgres parsetree.

3 implement a superuser of core grammar and replace core parser entirely.  This
  could arguably be done like the 1st case, but the idea is to avoid to
  possibly parse the same input string twice, or to forbid the core parser if
  that's somehow wanted.


I'm attaching some POC patches that implement this approach to start a
discussion.  I split the infrastructure part in 2 patches to make it easier to
review, and I'm also adding 2 other patches with a small parser implementation
to be able to test the infrastructure.  Here are some more details on the
patches and implementation details:

0001 simply adds a parser hook, which is called instead of raw_parser.  This is
enough to make multiple parser coexist with one exception: multi-statement
query string.  If multiple statements are provided, then all of them will be
parsed using the same grammar, which obviously won't work if they are written
for different grammars.

0002 implements a lame "sqlol" parser, based on LOLCODE syntax, with only the
ability to produce "select [col, ] col FROM table" parsetree, for testing
purpose.  I chose it to ensure that everything works properly even with a
totally different grammar that has different keywords, which doesn't even ends
statements with a semicolon but a plain keyword.

0003 is where the real modifications are done to allow multi-statement string
to be parsed using different grammar.  It implements a new MODE_SINGLE_QUERY
mode, which is used when a parser_hook is present.  In that case,
pg_parse_query() will only parse part of the query string and loop until
everything is parsed (or some error happens).

pg_parse_query() will instruct plugins to parse a query at a time.  They're
free to ignore that mode if they want to implement the 3rd mode.  If so, they
should either return multiple RawStmt, a single RawStmt with a 0 or
strlen(query_string) stmt_len, or error out.  Otherwise, they will implement
either mode 1 or 2, and they should always return a List containing a single
RawStmt with properly set stmt_len, even if the underlying statement is NULL.
This is required to properly skip valid strings that don't contain a
statements, and pg_parse_query() will skip RawStmt that don't contain an
underlying statement.

It also teaches the core parser to do the same, by optionally start parsing
somewhere in the input string and stop parsing once a valid statement is found.

Note that the whole input string is provided to the parsers in order to report
correct cursor position, so all token can get a correct location.  This means
that raw_parser() signature needs an additional offset to know where the
parsing should start.

Finally, 0004 modifies the sqlol parser to implement the MODE_SINGLE_QUERY
mode, adds grammar for creating views and adds some regression test to validate
proper parsing and error location reporting with multi-statements input string.

As far as I can tell it's all working as expected but I may have missed some
usecases.  The regression tests still work with the additional parser
configured.  The only difference is for pg_stat_statements, as in
MODE_SINGLE_QUERY the trailing semicolon has to be included in the statement,
since other grammars may understand semicolons differently.

The obvious drawback is that it can cause overhead as the same input can be
parsed multiple time.  This could be avoided with plugins implementing a GUC to
enable/disable their parser, so it's only active by default for some
users/database, or requires to be enabled interactively by the client app.

Also, the error messages can also be unhelpful for cases 1 and 2.  If the
custom parser doesn't error out, it means that the syntax errors will be raised
by the core parser based on the core grammar, which will likely point out an
unrelated problem.  Some of that can be avoided by letting the custom parsers
raise errors when they know for sure it's parsing what it's supposed to parse
(there's an example of that in the sqlol parser for qualified_name parsing, as
it can only happen once some specific keywords already matched).  For the rest
of the errors, the only option I can think of is another GUC to let custom
parsers always raise an error (or raise a warning) to help people debug their
queries.

I'll park this patch in the next commitfest so it can be discussed when pg15
development starts.

[1]: https://www.postgresql.org/message-id/20210315164336.ak32whndsxna5mjf@nol

Attachment

pgsql-hackers by date:

Previous
From: vignesh C
Date:
Subject: Re: Identify missing publications from publisher while create/alter subscription.
Next
From: Andrey Borodin
Date:
Subject: Re: Incorrect snapshots while promoting hot standby node when 2PC is used