Thread: Hook for extensible parsing.
Hi, Being able to extend core parser has been requested multiple times, and AFAICT all previous attempts were rejected not because this isn't wanted but because the proposed implementations required plugins to reimplement all of the core grammar with their own changes, as bison generated parsers aren't extensible. I'd like to propose an alternative approach, which is to allow multiple parsers to coexist, and let third-party parsers optionally fallback on the core parsers. I'm sending this now as a follow-up of [1] and to avoid duplicated efforts, as multiple people are interested in that topic. Obviously, since this is only about parsing, all modules can only implement some kind of syntactic sugar, as they have to produce valid parsetrees, but this could be a first step to later allow custom nodes and let plugins implement e.g. new UTILITY commands. So, this approach should allow different custom parser implementations: 1 implement only a few new commands on top of core grammar. For instance, an extension could add support for CREATE [PHYSICAL | LOGICAL] REPLICATION SLOT and rewrite that to a SelectStmt on top of the extisting function, or add a CREATE HYPOTHETICAL INDEX, which would internally add a new option in IndexStmt->options, to be intercepted in processUtility and bypass its execution with the extension approach instead. 2 implement a totally different grammar for a different language. In case of error, just silently fallback to core parser (or another hook) so both parsers can still be used. Any language could be parsed as long as you can produce a valid postgres parsetree. 3 implement a superuser of core grammar and replace core parser entirely. This could arguably be done like the 1st case, but the idea is to avoid to possibly parse the same input string twice, or to forbid the core parser if that's somehow wanted. I'm attaching some POC patches that implement this approach to start a discussion. I split the infrastructure part in 2 patches to make it easier to review, and I'm also adding 2 other patches with a small parser implementation to be able to test the infrastructure. Here are some more details on the patches and implementation details: 0001 simply adds a parser hook, which is called instead of raw_parser. This is enough to make multiple parser coexist with one exception: multi-statement query string. If multiple statements are provided, then all of them will be parsed using the same grammar, which obviously won't work if they are written for different grammars. 0002 implements a lame "sqlol" parser, based on LOLCODE syntax, with only the ability to produce "select [col, ] col FROM table" parsetree, for testing purpose. I chose it to ensure that everything works properly even with a totally different grammar that has different keywords, which doesn't even ends statements with a semicolon but a plain keyword. 0003 is where the real modifications are done to allow multi-statement string to be parsed using different grammar. It implements a new MODE_SINGLE_QUERY mode, which is used when a parser_hook is present. In that case, pg_parse_query() will only parse part of the query string and loop until everything is parsed (or some error happens). pg_parse_query() will instruct plugins to parse a query at a time. They're free to ignore that mode if they want to implement the 3rd mode. If so, they should either return multiple RawStmt, a single RawStmt with a 0 or strlen(query_string) stmt_len, or error out. Otherwise, they will implement either mode 1 or 2, and they should always return a List containing a single RawStmt with properly set stmt_len, even if the underlying statement is NULL. This is required to properly skip valid strings that don't contain a statements, and pg_parse_query() will skip RawStmt that don't contain an underlying statement. It also teaches the core parser to do the same, by optionally start parsing somewhere in the input string and stop parsing once a valid statement is found. Note that the whole input string is provided to the parsers in order to report correct cursor position, so all token can get a correct location. This means that raw_parser() signature needs an additional offset to know where the parsing should start. Finally, 0004 modifies the sqlol parser to implement the MODE_SINGLE_QUERY mode, adds grammar for creating views and adds some regression test to validate proper parsing and error location reporting with multi-statements input string. As far as I can tell it's all working as expected but I may have missed some usecases. The regression tests still work with the additional parser configured. The only difference is for pg_stat_statements, as in MODE_SINGLE_QUERY the trailing semicolon has to be included in the statement, since other grammars may understand semicolons differently. The obvious drawback is that it can cause overhead as the same input can be parsed multiple time. This could be avoided with plugins implementing a GUC to enable/disable their parser, so it's only active by default for some users/database, or requires to be enabled interactively by the client app. Also, the error messages can also be unhelpful for cases 1 and 2. If the custom parser doesn't error out, it means that the syntax errors will be raised by the core parser based on the core grammar, which will likely point out an unrelated problem. Some of that can be avoided by letting the custom parsers raise errors when they know for sure it's parsing what it's supposed to parse (there's an example of that in the sqlol parser for qualified_name parsing, as it can only happen once some specific keywords already matched). For the rest of the errors, the only option I can think of is another GUC to let custom parsers always raise an error (or raise a warning) to help people debug their queries. I'll park this patch in the next commitfest so it can be discussed when pg15 development starts. [1]: https://www.postgresql.org/message-id/20210315164336.ak32whndsxna5mjf@nol
Attachment
On Sat, May 01, 2021 at 03:24:58PM +0800, Julien Rouhaud wrote: > > I'm attaching some POC patches that implement this approach to start a > discussion. I just noticed that the cfbot fails with the v1 patch. Attached v2 that should fix that.
Attachment
On Sun, Jun 06, 2021 at 02:50:19PM +0800, Julien Rouhaud wrote: > On Sat, May 01, 2021 at 03:24:58PM +0800, Julien Rouhaud wrote: > > > > I'm attaching some POC patches that implement this approach to start a > > discussion. > > I just noticed that the cfbot fails with the v1 patch. Attached v2 that should > fix that. The cfbot then revealed a missing dependency in the makefile to generate the contrib parser, which triggers in make check-world without a previous make -C contrib. Thanks a lot to Thomas Munro for getting me the logfile from the failed cfbot run and the fix!
Attachment
On Tue, Jun 08, 2021 at 12:16:48PM +0800, Julien Rouhaud wrote: > On Sun, Jun 06, 2021 at 02:50:19PM +0800, Julien Rouhaud wrote: > > On Sat, May 01, 2021 at 03:24:58PM +0800, Julien Rouhaud wrote: > > > > > > I'm attaching some POC patches that implement this approach to start a > > > discussion. The regression tests weren't stable, v4 fixes that.
Attachment
On Sat, Jun 12, 2021 at 4:29 AM Julien Rouhaud <rjuju123@gmail.com> wrote: > I'd like to propose an alternative approach, which is to allow multiple parsers > to coexist, and let third-party parsers optionally fallback on the core > parsers. I'm sending this now as a follow-up of [1] and to avoid duplicated > efforts, as multiple people are interested in that topic. The patches all build properly and pass all regressions tests. > pg_parse_query() will instruct plugins to parse a query at a time. They're > free to ignore that mode if they want to implement the 3rd mode. If so, they > should either return multiple RawStmt, a single RawStmt with a 0 or > strlen(query_string) stmt_len, or error out. Otherwise, they will implement > either mode 1 or 2, and they should always return a List containing a single > RawStmt with properly set stmt_len, even if the underlying statement is NULL. > This is required to properly skip valid strings that don't contain a > statements, and pg_parse_query() will skip RawStmt that don't contain an > underlying statement. Wouldn't we want to only loop through the individual statements if parser_hook exists? The current patch seems to go through the new code path regardless of the hook being grabbed.
Thanks for the review Jim! On Wed, Jul 7, 2021 at 3:26 AM Jim Mlodgenski <jimmy76@gmail.com> wrote: > > On Sat, Jun 12, 2021 at 4:29 AM Julien Rouhaud <rjuju123@gmail.com> wrote: > > The patches all build properly and pass all regressions tests. Note that the cfbot reports a compilation error on windows. That's on the grammar extension part, so I'm really really interested in trying to fix that for now, as it's mostly a quick POC to demonstrate how one could implement a different grammar and validate that everything works as expected. Also, if this patch is eventually committed and having some code to experience the hook is wanted it would probably be better to have a very naive parser (based on a few strcmp() calls or something like that) to validate the behavior rather than having a real parser. > > pg_parse_query() will instruct plugins to parse a query at a time. They're > > free to ignore that mode if they want to implement the 3rd mode. If so, they > > should either return multiple RawStmt, a single RawStmt with a 0 or > > strlen(query_string) stmt_len, or error out. Otherwise, they will implement > > either mode 1 or 2, and they should always return a List containing a single > > RawStmt with properly set stmt_len, even if the underlying statement is NULL. > > This is required to properly skip valid strings that don't contain a > > statements, and pg_parse_query() will skip RawStmt that don't contain an > > underlying statement. > > Wouldn't we want to only loop through the individual statements if parser_hook > exists? The current patch seems to go through the new code path regardless > of the hook being grabbed. I did think about it, but I eventually chose to write it this way. Having a different code path for the no-hook situation won't make the with-hook code any easier (it should only remove some check for the hook in some places that have 2 or 3 other checks already). On the other hand, having a single code path avoid some (minimal) code duplication, and also ensure that the main loop is actively tested even without the hook being set. That's not 100% coverage, but it's better than nothing. Performance wise, it shouldn't make any noticeable difference for the no-hook case.
On Wed, Jul 7, 2021 at 5:26 AM Julien Rouhaud <rjuju123@gmail.com> wrote: > > Also, if this patch is eventually committed and having some code to > experience the hook is wanted it would probably be better to have a > very naive parser (based on a few strcmp() calls or something like > that) to validate the behavior rather than having a real parser. > The test module is very useful to show how to use the hook but it isn't very useful to the general user like most other things in contrib. It probably fits better in src/test/modules
On Wed, Jul 7, 2021 at 8:45 PM Jim Mlodgenski <jimmy76@gmail.com> wrote: > > The test module is very useful to show how to use the hook but it isn't > very useful to the general user like most other things in contrib. It probably > fits better in src/test/modules I agree that it's not useful at all to eventually have it as a contrib, but it's somewhat convenient at this stage to be able to easily test the hook, possibly with different behavior. But as I said, if there's an agreement on the approach and the implementation, I don't think that it would make sense to keep it even in the src/test/modules. A full bison parser, even with a limited grammar, will have about 99% of noise when it comes to demonstrate how the hook is supposed to work, which basically is having a "single query" parser or a "full input string" parser. I'm not even convinced that flex/bison will be the preferred choice for someone who wants to implement a custom parser. I tried to add really thorough comments in the various parts of the patch to make it clear how to do that and how the system will react depending on what a hook does. I also added some protection to catch inconsistent hook implementation. I think that's the best way to help external parser authors to implement what they want, and I'll be happy to improve the comments if necessary. But if eventually people would like to have a real parser in the tree, for testing or guidance, I will of course take care of doing the required changes and moving the demo parser in src/test/modules.
On Sat, Jun 12, 2021 at 1:59 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Tue, Jun 08, 2021 at 12:16:48PM +0800, Julien Rouhaud wrote: > > On Sun, Jun 06, 2021 at 02:50:19PM +0800, Julien Rouhaud wrote: > > > On Sat, May 01, 2021 at 03:24:58PM +0800, Julien Rouhaud wrote: > > > > > > > > I'm attaching some POC patches that implement this approach to start a > > > > discussion. > > The regression tests weren't stable, v4 fixes that. 1) CFBOT showed the following compilation errors in windows: "C:\projects\postgresql\pgsql.sln" (default target) (1) -> "C:\projects\postgresql\sqlol.vcxproj" (default target) (69) -> (ClCompile target) -> c1 : fatal error C1083: Cannot open source file: 'contrib/sqlol/sqlol_gram.c': No such file or directory [C:\projects\postgresql\sqlol.vcxproj] c:\projects\postgresql\contrib\sqlol\sqlol_gramparse.h(25): fatal error C1083: Cannot open include file: 'sqlol_gram.h': No such file or directory (contrib/sqlol/sqlol.c) [C:\projects\postgresql\sqlol.vcxproj] c:\projects\postgresql\contrib\sqlol\sqlol_gramparse.h(25): fatal error C1083: Cannot open include file: 'sqlol_gram.h': No such file or directory (contrib/sqlol/sqlol_keywords.c) [C:\projects\postgresql\sqlol.vcxproj] c1 : fatal error C1083: Cannot open source file: 'contrib/sqlol/sqlol_scan.c': No such file or directory [C:\projects\postgresql\sqlol.vcxproj] 0 Warning(s) 4 Error(s) 6123 6124Time Elapsed 00:05:40.23 6125 2) There was one small whitespace error with the patch: git am v4-0002-Add-a-sqlol-parser.patch Applying: Add a sqlol parser. .git/rebase-apply/patch:818: new blank line at EOF. + warning: 1 line adds whitespace errors. Regards, Vignesh
On Thu, Jul 22, 2021 at 12:01:34PM +0530, vignesh C wrote: > > 1) CFBOT showed the following compilation errors in windows: Thanks for looking at it. I'm aware of this issue on windows, but as mentioned in the thread the new contrib is there to demonstrate how the new infrastructure works. If there were some interest in pushing the patch, I don't think that we would add a full bison parser, whether it's in contrib or test modules. So unless there's a clear indication from a committer that we would want to integrate such a parser, or if someone is interested in reviewing the patch and only has a windows machine, I don't plan to spend time trying to fix a windows only problem for something that will disappear anyway. > 2) There was one small whitespace error with the patch: > git am v4-0002-Add-a-sqlol-parser.patch > Applying: Add a sqlol parser. > .git/rebase-apply/patch:818: new blank line at EOF. > + > warning: 1 line adds whitespace errors. Indeed, there's a trailing empty line in contrib/sqlol/sqlol_keywords.c. I fixed it locally, but as I said this module is most certainly going to disappear so I'm not sending an updating patch right now.
On Thu, Jul 22, 2021 at 03:04:19PM +0800, Julien Rouhaud wrote: > On Thu, Jul 22, 2021 at 12:01:34PM +0530, vignesh C wrote: > > > > 1) CFBOT showed the following compilation errors in windows: > > Thanks for looking at it. I'm aware of this issue on windows, but as mentioned > in the thread the new contrib is there to demonstrate how the new > infrastructure works. If there were some interest in pushing the patch, I > don't think that we would add a full bison parser, whether it's in contrib or > test modules. > > So unless there's a clear indication from a committer that we would want to > integrate such a parser, or if someone is interested in reviewing the patch and > only has a windows machine, I don't plan to spend time trying to fix a windows > only problem for something that will disappear anyway. I'm not sure what changed in the Windows part of the cfbot, but somehow it's not hitting any compilation error anymore and all the tests are now green.
v5 attached, fixing conflict with 639a86e36a (Remove Value node struct)
Attachment
On Sat, 1 May 2021 at 08:24, Julien Rouhaud <rjuju123@gmail.com> wrote: > Being able to extend core parser has been requested multiple times, and AFAICT > all previous attempts were rejected not because this isn't wanted but because > the proposed implementations required plugins to reimplement all of the core > grammar with their own changes, as bison generated parsers aren't extensible. > > I'd like to propose an alternative approach, which is to allow multiple parsers > to coexist, and let third-party parsers optionally fallback on the core > parsers. Yes, that approach has been discussed by many people, most recently around the idea to create a fast-path grammar to make the most frequently used SQL parse faster. > 0002 implements a lame "sqlol" parser, based on LOLCODE syntax, with only the > ability to produce "select [col, ] col FROM table" parsetree, for testing > purpose. I chose it to ensure that everything works properly even with a > totally different grammar that has different keywords, which doesn't even ends > statements with a semicolon but a plain keyword. The general rule has always been that we don't just put hooks in, we always require an in-core use for those hooks. I was reminded of that myself recently. What we need is something in core that actually makes use of this. The reason for that is not politics, but a simple test of whether the feature makes sense AND includes all required bells and whistles to be useful in the real world. Core doesn't need a LOL parser and I don't think we should commit that. If we do this, I think it should have CREATE LANGUAGE support, so that each plugin can be seen as an in-core object and allow security around which users can execute which language types, allow us to switch between languages and have default languages for specific users or databases. -- Simon Riggs http://www.EnterpriseDB.com/
On Wed, Sep 15, 2021 at 9:25 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > > The general rule has always been that we don't just put hooks in, we > always require an in-core use for those hooks. I was reminded of that > myself recently. > That's not historically what has happened. There are several hooks with no in core use such as emit_log_hook and ExplainOneQuery_hook. The recent openssl_tls_init_hook only has a usage in src/test/modules > What we need is something in core that actually makes use of this. The > reason for that is not politics, but a simple test of whether the > feature makes sense AND includes all required bells and whistles to be > useful in the real world. > Agreed. There should be something in src/test/modules to exercise this but probably more to flush things out. Maybe extending adminpack to use this so if enabled, it can use syntax like: FILE READ 'foo.txt' > Core doesn't need a LOL parser and I don't think we should commit that. > +1 > If we do this, I think it should have CREATE LANGUAGE support, so that > each plugin can be seen as an in-core object and allow security around > which users can execute which language types, allow us to switch > between languages and have default languages for specific users or > databases. > This hook allows extension developers to supplement syntax in addition to adding a whole new language allowing the extension to appear more native to the end user. For example, pglogical could use this to add syntax to do a CREATE NODE instead of calling the function create_node. Adding CREATE LANGUAGE support around this would just be for a narrow set of use cases where a new language is added.
On Wed, Sep 15, 2021 at 10:14 PM Jim Mlodgenski <jimmy76@gmail.com> wrote: > > On Wed, Sep 15, 2021 at 9:25 AM Simon Riggs > <simon.riggs@enterprisedb.com> wrote: > > > > The general rule has always been that we don't just put hooks in, we > > always require an in-core use for those hooks. I was reminded of that > > myself recently. > > > That's not historically what has happened. There are several hooks with > no in core use such as emit_log_hook and ExplainOneQuery_hook. The recent > openssl_tls_init_hook only has a usage in src/test/modules Yes, I also think that it's not a strict requirement that all hooks have a caller in the core, even if it's obviously better if that's the case. > > What we need is something in core that actually makes use of this. The > > reason for that is not politics, but a simple test of whether the > > feature makes sense AND includes all required bells and whistles to be > > useful in the real world. > > > Agreed. There should be something in src/test/modules to exercise this > but probably more to flush things out. Maybe extending adminpack to use > this so if enabled, it can use syntax like: > FILE READ 'foo.txt' For this hook, maintaining a real alternative parser seems like way too much trouble to justify an in-core user. The fact that many people have asked for such a feature over the year should be enough to justify the use case. We could try to invent some artificial need like the one you suggest for adminpack, but it also feels like a waste of resources. As far as I'm concerned a naive strcmp-based parser in src/test/modules should be enough to validate that the hook is working, there's no need for more. In any case if the only requirement for it to be committed is to write a real parser, whether in contrib or src/test/modules, I'll be happy to do it. > > Core doesn't need a LOL parser and I don't think we should commit that. > > > +1 I entirely agree, and I repeatedly mentioned in that thread that I did *not* want to add this parser in core. The only purpose of patches 0002 and 0004 is to make the third-party bison based parser requirements less abstract, and demonstrate that this approach can successfully make two *radically different* parsers cohabit. > > If we do this, I think it should have CREATE LANGUAGE support, so that > > each plugin can be seen as an in-core object and allow security around > > which users can execute which language types, allow us to switch > > between languages and have default languages for specific users or > > databases. > > > This hook allows extension developers to supplement syntax in addition > to adding a whole new language allowing the extension to appear more > native to the end user. For example, pglogical could use this to add > syntax to do a CREATE NODE instead of calling the function create_node. > Adding CREATE LANGUAGE support around this would just be for a narrow > set of use cases where a new language is added. Yes, this hook can be used to implement multiple things as I mentioned my initial email. Additionally, if this is eventually committed I'd like to add support for CREATE HYPOTHETICAL INDEX grammar in hypopg. Such a parser would only support one command (that extends an existing one), so it can't really be called a language. Of course if would be better to have the core parser accept a CREATE [ HYPOTHETICAL ] INDEX and setup a flag so that third-parrty module can intercept this utility command, but until that happens I could provide that syntactic sugar for my users as long as I'm motivated enough to write this parser. Also, a hook based approach is still compatible with per database / role configuration. It can be done either via specific session_preload_libraries, or via a custom GUC if for some reason the module requires to be in shared_preload_libraries.
On Wed, Sep 15, 2021 at 02:25:17PM +0100, Simon Riggs wrote: > On Sat, 1 May 2021 at 08:24, Julien Rouhaud <rjuju123@gmail.com> wrote: > > > Being able to extend core parser has been requested multiple times, and AFAICT > > all previous attempts were rejected not because this isn't wanted but because > > the proposed implementations required plugins to reimplement all of the core > > grammar with their own changes, as bison generated parsers aren't extensible. > > > > I'd like to propose an alternative approach, which is to allow multiple parsers > > to coexist, and let third-party parsers optionally fallback on the core > > parsers. > > Yes, that approach has been discussed by many people, most recently > around the idea to create a fast-path grammar to make the most > frequently used SQL parse faster. > > > 0002 implements a lame "sqlol" parser, based on LOLCODE syntax, with only the > > ability to produce "select [col, ] col FROM table" parsetree, for testing > > purpose. I chose it to ensure that everything works properly even with a > > totally different grammar that has different keywords, which doesn't even ends > > statements with a semicolon but a plain keyword. > > The general rule has always been that we don't just put hooks in, we > always require an in-core use for those hooks. I was reminded of that > myself recently. > > What we need is something in core that actually makes use of this. The > reason for that is not politics, but a simple test of whether the > feature makes sense AND includes all required bells and whistles to be > useful in the real world. > > Core doesn't need a LOL parser and I don't think we should commit that. It doesn't, but it very likely needs something people can use when they create a new table AM, and that we should use the hook in core to implement the heap* table AM to make sure the thing is working at DDL time. > If we do this, I think it should have CREATE LANGUAGE support, so > that each plugin can be seen as an in-core object and allow security > around which users can execute which language types, allow us to > switch between languages and have default languages for specific > users or databases. That's a great idea, but I must be missing something important as it relates to parser hooks. Could you connect those a little more explicitly? Best, David. * It's not actually a heap in the sense that the term is normally used in computing. I'd love to find out how it got to have this name and document same so others aren't also left wondering. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
st 15. 9. 2021 v 16:55 odesílatel Julien Rouhaud <rjuju123@gmail.com> napsal:
On Wed, Sep 15, 2021 at 10:14 PM Jim Mlodgenski <jimmy76@gmail.com> wrote:
>
> On Wed, Sep 15, 2021 at 9:25 AM Simon Riggs
> <simon.riggs@enterprisedb.com> wrote:
> >
> > The general rule has always been that we don't just put hooks in, we
> > always require an in-core use for those hooks. I was reminded of that
> > myself recently.
> >
> That's not historically what has happened. There are several hooks with
> no in core use such as emit_log_hook and ExplainOneQuery_hook. The recent
> openssl_tls_init_hook only has a usage in src/test/modules
Yes, I also think that it's not a strict requirement that all hooks
have a caller in the core, even if it's obviously better if that's the
case.
> > What we need is something in core that actually makes use of this. The
> > reason for that is not politics, but a simple test of whether the
> > feature makes sense AND includes all required bells and whistles to be
> > useful in the real world.
> >
> Agreed. There should be something in src/test/modules to exercise this
> but probably more to flush things out. Maybe extending adminpack to use
> this so if enabled, it can use syntax like:
> FILE READ 'foo.txt'
For this hook, maintaining a real alternative parser seems like way
too much trouble to justify an in-core user. The fact that many
people have asked for such a feature over the year should be enough to
justify the use case. We could try to invent some artificial need
like the one you suggest for adminpack, but it also feels like a waste
of resources.
As far as I'm concerned a naive strcmp-based parser in
src/test/modules should be enough to validate that the hook is
working, there's no need for more. In any case if the only
requirement for it to be committed is to write a real parser, whether
in contrib or src/test/modules, I'll be happy to do it.
> > Core doesn't need a LOL parser and I don't think we should commit that.
> >
> +1
I entirely agree, and I repeatedly mentioned in that thread that I did
*not* want to add this parser in core. The only purpose of patches
0002 and 0004 is to make the third-party bison based parser
requirements less abstract, and demonstrate that this approach can
successfully make two *radically different* parsers cohabit.
> > If we do this, I think it should have CREATE LANGUAGE support, so that
> > each plugin can be seen as an in-core object and allow security around
> > which users can execute which language types, allow us to switch
> > between languages and have default languages for specific users or
> > databases.
> >
> This hook allows extension developers to supplement syntax in addition
> to adding a whole new language allowing the extension to appear more
> native to the end user. For example, pglogical could use this to add
> syntax to do a CREATE NODE instead of calling the function create_node.
> Adding CREATE LANGUAGE support around this would just be for a narrow
> set of use cases where a new language is added.
Yes, this hook can be used to implement multiple things as I mentioned
my initial email. Additionally, if this is eventually committed I'd
like to add support for CREATE HYPOTHETICAL INDEX grammar in hypopg.
Such a parser would only support one command (that extends an existing
one), so it can't really be called a language. Of course if would be
better to have the core parser accept a CREATE [ HYPOTHETICAL ] INDEX
and setup a flag so that third-parrty module can intercept this
utility command, but until that happens I could provide that syntactic
sugar for my users as long as I'm motivated enough to write this
parser.
There were nice stream databases, but that ended because maintaining a fork is too expensive. And without direct SQL (without possibility of parser enhancing), the commands based on function call API were not readable and workable flexible like SQL. Sometimes we really don't want to replace PostgreSQL, but just enhance the main interface for extensions.
Also, a hook based approach is still compatible with per database /
role configuration. It can be done either via specific
session_preload_libraries, or via a custom GUC if for some reason the
module requires to be in shared_preload_libraries.
Jim Mlodgenski <jimmy76@gmail.com> writes: > On Wed, Sep 15, 2021 at 9:25 AM Simon Riggs > <simon.riggs@enterprisedb.com> wrote: >> The general rule has always been that we don't just put hooks in, we >> always require an in-core use for those hooks. I was reminded of that >> myself recently. > That's not historically what has happened. There are several hooks with > no in core use such as emit_log_hook and ExplainOneQuery_hook. Yeah. I think the proper expectation is that there be a sufficiently worked-out example to convince us that the proposed hooks have real-world usefulness, and are not missing any basic requirements to make them do something useful. Whether the example ends up in our tree is a case-by-case decision. In the case at hand, what's troubling me is that I don't see any particular use in merely substituting a new bison grammar, if it still has to produce parse trees that the rest of the system will understand. Yeah, you could make some very simple surface-syntax changes that way, but it doesn't seem like you could do anything interesting (like, say, support Oracle-style outer join syntax). AFAICS, to get to a useful feature, you'd then need to invent an extensible Node system (which'd be hugely invasive if it's feasible at all), and then probably more things on top of that. So I'm not convinced that you've demonstrated any real-world usefulness. regards, tom lane
On Wed, Sep 15, 2021 at 11:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > In the case at hand, what's troubling me is that I don't see any > particular use in merely substituting a new bison grammar, if it > still has to produce parse trees that the rest of the system will > understand. Yeah, you could make some very simple surface-syntax > changes that way, but it doesn't seem like you could do anything > interesting (like, say, support Oracle-style outer join syntax). > AFAICS, to get to a useful feature, you'd then need to invent an > extensible Node system (which'd be hugely invasive if it's feasible > at all), and then probably more things on top of that. So I'm not > convinced that you've demonstrated any real-world usefulness. I agree that this patchset can only implement syntactic sugars, nothing more (although for utility command you can do a bit more than that). But that's already something people can use, mostly for migration to postgres use cases probably. I'm not sure why you couldn't implement an Oracle-style outer join with such a hook? The requirement is that the parser can't leak any node that the rest of the system doesn't know about, but you can do what you want inside the parser. And as far as I can see we already have an extensible node since bcac23de73b, so it seems to me that there's enough infrastructure to handle this kind of use case. The main downside is that you'll have to make a first pass to transform your "custom raw statement" into a valid RawStmt in your parser, and the system will do another one to transform it in a Query. But apart from that it should work. Am I missing something?
Julien Rouhaud <rjuju123@gmail.com> writes: > I'm not sure why you couldn't implement an Oracle-style outer join > with such a hook? Try it. > The requirement is that the parser can't leak any > node that the rest of the system doesn't know about, but you can do > what you want inside the parser. That's not what the patch actually does, though. It only replaces the grammar, not semantic analysis. So you couldn't associate the (+)-decorated WHERE clause with the appropriate join. (And no, I will not accept that it's okay to perform catalog lookups in the grammar to get around that. See comment at the head of gram.y.) In general, I'm having a hard time believing that anything very interesting can be done at only the grammar level without changing the parse analysis phase. That's not unrelated to the restriction that the grammar can't do catalog accesses. Maybe with some fundamental restructuring, we could get around that issue ... but this patch isn't doing any fundamental restructuring, it's just putting a hook where it's easy to do so. We've often found that such hooks aren't as useful as they initially seem. regards, tom lane
On Thu, Sep 16, 2021 at 12:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > The requirement is that the parser can't leak any > > node that the rest of the system doesn't know about, but you can do > > what you want inside the parser. > > That's not what the patch actually does, though. It only replaces > the grammar, not semantic analysis. So you couldn't associate the > (+)-decorated WHERE clause with the appropriate join. (And no, > I will not accept that it's okay to perform catalog lookups in > the grammar to get around that. See comment at the head of gram.y.) I never said that one should do catalog lookup for that? What I said is that you can do a specific semantic analysis pass in the hook if you know that you can have extensible nodes in your parsetree, and you can do that with that hook unless I'm missing something? Yes that's not ideal, but I don't see how it can be worse than writing some middleware that parses the query, rewrite it to postgres style sql on the fly so that postgres can parse it again. I'm also not sure how the semantic analysis could be made generally extensible, if possible at all, so that's the best I can propose. If that approach is a deal breaker then fine I can accept it.
On Thu, Sep 16, 2021 at 1:23 AM Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Thu, Sep 16, 2021 at 12:57 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > The requirement is that the parser can't leak any > > > node that the rest of the system doesn't know about, but you can do > > > what you want inside the parser. > > > > That's not what the patch actually does, though. It only replaces > > the grammar, not semantic analysis. So you couldn't associate the > > (+)-decorated WHERE clause with the appropriate join. (And no, > > I will not accept that it's okay to perform catalog lookups in > > the grammar to get around that. See comment at the head of gram.y.) > > I never said that one should do catalog lookup for that? What I said > is that you can do a specific semantic analysis pass in the hook if > you know that you can have extensible nodes in your parsetree, and you > can do that with that hook unless I'm missing something? Ah, now that I think more about it I think that you're talking about unqualified fields? I was naively assuming that those wouldn't be allowed by oracle, but I guess that wishful thinking.
Hi, On 2021-09-15 12:57:00 -0400, Tom Lane wrote: > That's not what the patch actually does, though. It only replaces > the grammar, not semantic analysis. So you couldn't associate the > (+)-decorated WHERE clause with the appropriate join. (And no, > I will not accept that it's okay to perform catalog lookups in > the grammar to get around that. See comment at the head of gram.y.) > In general, I'm having a hard time believing that anything very > interesting can be done at only the grammar level without changing > the parse analysis phase. That's not unrelated to the restriction > that the grammar can't do catalog accesses. Maybe with some fundamental > restructuring, we could get around that issue ... but this patch isn't > doing any fundamental restructuring, it's just putting a hook where it's > easy to do so. We've often found that such hooks aren't as useful as > they initially seem. Agreed - it doesn't make sense to me to have a hook that only replaces raw parsing, without also hooking into parse-analysis. ISTM that the least a patchset going for a parser hook would have to do is to do sufficient restructuring so that one could hook together into both raw parsing and analysis. It could still be two callbacks, but perhaps we'd ensure that they're both set. Greetings, Andres Freund
On Wed, Sep 15, 2021 at 3:55 PM Andres Freund <andres@anarazel.de> wrote:
On 2021-09-15 12:57:00 -0400, Tom Lane wrote:
> That's not what the patch actually does, though. It only replaces
> the grammar, not semantic analysis. So you couldn't associate the
> (+)-decorated WHERE clause with the appropriate join. (And no,
> I will not accept that it's okay to perform catalog lookups in
> the grammar to get around that. See comment at the head of gram.y.)
> In general, I'm having a hard time believing that anything very
> interesting can be done at only the grammar level without changing
> the parse analysis phase. That's not unrelated to the restriction
> that the grammar can't do catalog accesses. Maybe with some fundamental
> restructuring, we could get around that issue ... but this patch isn't
> doing any fundamental restructuring, it's just putting a hook where it's
> easy to do so. We've often found that such hooks aren't as useful as
> they initially seem.
Agreed - it doesn't make sense to me to have a hook that only replaces raw
parsing, without also hooking into parse-analysis. ISTM that the least a
patchset going for a parser hook would have to do is to do sufficient
restructuring so that one could hook together into both raw parsing and
analysis. It could still be two callbacks, but perhaps we'd ensure that
they're both set.
This is a bad example as it doesn't require semantic analysis from Postgres. While most of the tools out there tend to do simple replacement, this can be done within a custom parser by simply walking its own AST, evaluating join conditions against the expression, and rewriting the join accordingly. Or, do you have an example that couldn't be done this way within a custom parser?
Jonah H. Harris
Hi, On 2021-09-15 16:35:53 -0400, Jonah H. Harris wrote: > On Wed, Sep 15, 2021 at 3:55 PM Andres Freund <andres@anarazel.de> wrote: > > On 2021-09-15 12:57:00 -0400, Tom Lane wrote: > > Agreed - it doesn't make sense to me to have a hook that only replaces raw > > parsing, without also hooking into parse-analysis. ISTM that the least a > > patchset going for a parser hook would have to do is to do sufficient > > restructuring so that one could hook together into both raw parsing and > > analysis. It could still be two callbacks, but perhaps we'd ensure that > > they're both set. > > > > This is a bad example as it doesn't require semantic analysis from > Postgres. "it"? I assume you mean a different type of join? If so, I'm highly doubtful - without semantic analysis you can't really handle column references. > While most of the tools out there tend to do simple replacement, > this can be done within a custom parser by simply walking its own AST, > evaluating join conditions against the expression, and rewriting the join > accordingly. Or, do you have an example that couldn't be done this way > within a custom parser? You cannot just "evaluate conditions" in a raw parse tree... You don't even know what things are functions, columns etc, nor to what relation a column belongs. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Agreed - it doesn't make sense to me to have a hook that only replaces raw > parsing, without also hooking into parse-analysis. ISTM that the least a > patchset going for a parser hook would have to do is to do sufficient > restructuring so that one could hook together into both raw parsing and > analysis. It could still be two callbacks, but perhaps we'd ensure that > they're both set. The other problem here is that a simple call-this-instead-of-that top-level hook doesn't seem all that useful anyway, because it leaves you with the task of duplicating a huge amount of functionality that you're then going to make some tweaks within. That's already an issue when you're just thinking about the grammar, and if you have to buy into it for parse analysis too, I doubt that it's going to be very practical. If, say, you'd like to support some weird function that requires special parsing and analysis rules, I don't see how you get that out of this without first duplicating a very large fraction of src/backend/parser/. (As a comparison point, we do have a top-level hook for replacing the planner; but I have never heard of anyone actually doing so. There are people using that hook to *wrap* the planner with some before-and-after processing, which is quite a different thing.) I don't have any better ideas to offer :-( ... but I very much fear that the approach proposed here is a dead end. regards, tom lane
Hi, On 2021-09-15 16:51:37 -0400, Tom Lane wrote: > The other problem here is that a simple call-this-instead-of-that > top-level hook doesn't seem all that useful anyway, because it leaves > you with the task of duplicating a huge amount of functionality that > you're then going to make some tweaks within. That's already an issue > when you're just thinking about the grammar, and if you have to buy > into it for parse analysis too, I doubt that it's going to be very > practical. If, say, you'd like to support some weird function that > requires special parsing and analysis rules, I don't see how you get > that out of this without first duplicating a very large fraction of > src/backend/parser/. We do have a small amount of infrastructure around this - the hackery that plpgsql uses. That's not going to help you with everything, but I think it should be be enough to recognize e.g. additional top-level statements. Obviously not enough to intercept parsing deeper into a statement, but at least something. And parse-analysis for some types of things will be doable with the current infrastructure, by e.g. handling the new top-level statement in the hook, and then passing the buck to the normal parse analysis for e.g. expressions in that. Obviously not going to get you that far... > (As a comparison point, we do have a top-level hook for replacing > the planner; but I have never heard of anyone actually doing so. > There are people using that hook to *wrap* the planner with some > before-and-after processing, which is quite a different thing.) Citus IIRC has some paths that do not end up calling into the standard planner, but only for a few simplistic cases. > I don't have any better ideas to offer :-( ... but I very much fear > that the approach proposed here is a dead end. I unfortunately don't see a good way forward without changing the way we do parsing on a more fundamental level :(. Greetings, Andres Freund
On Thu, Sep 16, 2021 at 5:40 AM Andres Freund <andres@anarazel.de> wrote: > > > I don't have any better ideas to offer :-( ... but I very much fear > > that the approach proposed here is a dead end. > > I unfortunately don't see a good way forward without changing the way we do > parsing on a more fundamental level :(. Using the ExtensibleNode infrastructure, I can see two ways to try to leverage that. First one would be to require modules to wrap their RawStmt->stmt in an ExtensibleNode if they want to do anything that requires semantic analysis, and handle such node in transformStmt() with another hook. I think it would allow modules to do pretty much anything, at the cost of walking the stmt twice and duplicating possibly huge amount of analyze.c and friends. The other one would be to allow the parser to leak ExtensibleNode in the middle of the RawStmt and catch them in the transform* functions, with e.g. some generic transformExtensibleNode(pstate, node, some_identifier...) (the identifier giving both the general transform action and some secondary info, like ParseExprKind for expressions). This would avoid the downsides of the first approach, but would require to call this new hook in a bunch of places. Or we could combine both approaches so that the most common use cases, like transformExprRecurse(), would be easily handled while more exotic cases will have to go the hard way. Parser authors could still ask for adding a new call to this new hook to ease their work in the next major version. Would any of that be a reasonable approach?
On Thu, 16 Sept 2021 at 05:33, Julien Rouhaud <rjuju123@gmail.com> wrote: > Would any of that be a reasonable approach? The way I summarize all of the above is that 1) nobody is fundamentally opposed to the idea 2) we just need to find real-world example(s) and show that any associated in-core patch provides all that is needed in a clean way, since that point is currently in-doubt by senior committers. So what is needed is some actual prototypes that explore this. I guess that means they have to be open source, but those examples could be under a different licence, as long as the in-core patch is clearly a project submission to PostgreSQL. I presume a few real-world examples could be: * Grammar extensions to support additional syntax for Greenplum, Citus, XL * A grammar that adds commands for an extension, such as pglogical (Jim's example) * A strict SQL Standard grammar/parser * GQL implementation -- Simon Riggs http://www.EnterpriseDB.com/
On Thu, Sep 23, 2021 at 07:37:27AM +0100, Simon Riggs wrote: > On Thu, 16 Sept 2021 at 05:33, Julien Rouhaud <rjuju123@gmail.com> wrote: > > > Would any of that be a reasonable approach? > > The way I summarize all of the above is that > 1) nobody is fundamentally opposed to the idea > 2) we just need to find real-world example(s) and show that any > associated in-core patch provides all that is needed in a clean way, > since that point is currently in-doubt by senior committers. > > So what is needed is some actual prototypes that explore this. I guess > that means they have to be open source, but those examples could be > under a different licence, as long as the in-core patch is clearly a > project submission to PostgreSQL. > > I presume a few real-world examples could be: > * Grammar extensions to support additional syntax for Greenplum, Citus, XL > * A grammar that adds commands for an extension, such as pglogical > (Jim's example) > * A strict SQL Standard grammar/parser > * GQL implementation As I mentioned, there's at least one use case that would work with that approach that I will be happy to code in hypopg, which is an open source project. As a quick prototype, here's a basic overview of how I can use this hook to implement a CREATE HYPOTHETICAL INDEX command: rjuju=# LOAD 'hypopg'; LOAD rjuju=# create hypothetical index meh on t1(id); CREATE INDEX rjuju=# explain select * from t1 where id = 1; QUERY PLAN -------------------------------------------------------------------------------- Index Scan using "<13543>btree_t1_id" on t1 (cost=0.04..8.05 rows=1 width=13) Index Cond: (id = 1) (2 rows) rjuju=# \d t1 Table "public.t1" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- id | integer | | | val | text | | | My POC's grammar is only like: CREATE HYPOTHETICAL INDEX opt_index_name ON relation_expr '(' index_params ')' { IndexStmt *n = makeNode(IndexStmt); n->idxname = $4; n->relation = $6; n->accessMethod = DEFAULT_INDEX_TYPE; n->indexParams = $8; n->options = list_make1(makeDefElem("hypothetical", NULL, -1)); $$ = (Node *) n; } as I'm not willing to import the whole CREATE INDEX grammar for now for a patch that may be rejected. I can however publish this POC if that helps. Note that once my parser returns this parse tree, all I need to do is intercept IndexStmt containing this option in a utility_hook and run my code rather than a plain DefineIndex(), which works as intended as I showed above. One could easily imagine similar usage to extend existing commands, like implementing a new syntax on top of CREATE TABLE to implement an automatic partition creation grammar (which would return multiple CreateStmt), or even a partition manager. I'm not an expert in other RDBMS syntax, but maybe you could use such a hook to implement SQL Server or mysql syntax, which use at least different quoting rules. Maybe Amazon people could confirm that as it looks like they implemented an SQL Server parser using a similar hook? So yes you can't create new commands or implement grammars that require additional semantic analysis with this hook, but I think that there are still real use cases that can be implemented using only a different parser.
Julien Rouhaud <rjuju123@gmail.com> writes: > My POC's grammar is only like: > CREATE HYPOTHETICAL INDEX opt_index_name ON relation_expr '(' index_params ')' > { > IndexStmt *n = makeNode(IndexStmt); > n->idxname = $4; > n->relation = $6; > n->accessMethod = DEFAULT_INDEX_TYPE; > n->indexParams = $8; > n->options = list_make1(makeDefElem("hypothetical", NULL, -1)); > $$ = (Node *) n; > } I'm not too impressed by this example, because there seems little reason why you couldn't just define "hypothetical" as an index_param option, and not need to touch the grammar at all. > as I'm not willing to import the whole CREATE INDEX grammar for now for a patch > that may be rejected. The fact that that's so daunting seems to me to be a perfect illustration of the problems with this concept. Doing anything interesting with a hook like this will create a maintenance nightmare, because you'll have to duplicate (and track every change in) large swaths of gram.y. To the extent that you fail to, say, match every detail of the core's expression grammar, you'll be creating a crappy user experience. > that once my parser returns this parse tree, all I need to do is intercept > IndexStmt containing this option in a utility_hook and run my code rather than > a plain DefineIndex(), which works as intended as I showed above. And I'm even less impressed by the idea of half a dozen extensions each adding its own overhead to the parser and also to ProcessUtility so that they can process statements in this klugy, highly-restricted way. I do have sympathy for the idea that extensions would like to define their own statement types. I just don't see a practical way to do it in our existing parser infrastructure. This patch certainly doesn't offer that. regards, tom lane
On Thu, Sep 23, 2021 at 10:21:20AM -0400, Tom Lane wrote: > > I do have sympathy for the idea that extensions would like to define > their own statement types. I just don't see a practical way to do it > in our existing parser infrastructure. This patch certainly doesn't > offer that. Allowing extensions to define their own (utility) statement type is just a matter of allowing ExtensibleNode as top level statement. As far as I can see the only change required for that is to give those a specific command tag in CreateCommandTag(), since transformStmt() default to emitting a utility command. You can then easily intercept such statement in the utility hook and fetch your custom struct. I could do that but I'm assuming that you still wouldn't be satisfied as custom parser would still be needed, whihc may or may not require to copy/paste chunks of the core grammar? If so, do you have any suggestion for an approach you would accept?
Hi, On Fri, Sep 24, 2021 at 02:33:59PM +0800, Julien Rouhaud wrote: > On Thu, Sep 23, 2021 at 10:21:20AM -0400, Tom Lane wrote: > > > > I do have sympathy for the idea that extensions would like to define > > their own statement types. I just don't see a practical way to do it > > in our existing parser infrastructure. This patch certainly doesn't > > offer that. > > Allowing extensions to define their own (utility) statement type is just a > matter of allowing ExtensibleNode as top level statement. As far as I can > see the only change required for that is to give those a specific command tag > in CreateCommandTag(), since transformStmt() default to emitting a utility > command. You can then easily intercept such statement in the utility hook and > fetch your custom struct. > > I could do that but I'm assuming that you still wouldn't be satisfied as > custom parser would still be needed, whihc may or may not require to > copy/paste chunks of the core grammar? > > If so, do you have any suggestion for an approach you would accept? Given the total lack of answer on the various improvements I suggested, I'm assuming that no one is interested in that feature, so I'm marking it as Rejected.