Thread: [PATCH] Add --syntax to postgres for SQL syntax checking
Hello! Inspired by Simon Riggs' keynote talk at PGCounf.eu 2023 sharing list of ideas for PostgreSQL (https://riggs.business/blog/f/postgresql-todo-2023), I have crafted a quick patch to do SQL syntax validation. It is also heavily inspired by the "ruby -c" command, useful to check syntax of Ruby programs without executing them. For now, to keep it simple and to open discussion, I have added new "--syntax" option into "postgres" command, since it is currently the only one using needed parser dependency (at least per my understanding). I tried to add this into psql or separate pg_syntax commands, but parser is not exposed in "postgres_fe.h" and including backend into those tools would not make most likely sense. Also syntax could vary per backend, it makes sense to keep it in there. It expects input on STDIN, prints out error if any and prints out summary message (Valid SQL/Invalid SQL). On valid input it exits with 0 (success), otherwise it exits with 1 (error). Example usage: $ echo "SELECT 1" | src/backend/postgres --syntax Valid SQL $ echo "SELECT 1abc" | src/backend/postgres --syntax ERROR: trailing junk after numeric literal at or near "1a" at character 8 Invalid SQL $ cat ../src/test/regress/sql/alter_operator.sql | src/backend/postgres --syntax Valid SQL $ cat ../src/test/regress/sql/advisory_lock.sql | src/backend/postgres --syntax ERROR: syntax error at or near "\" at character 99 Invalid SQL This could be useful for manual script checks, automated script checks and code editor integrations. Notice it just parses the SQL, it doesn't detect any "runtime" problems like unexisting table names, etc. I have various ideas around this (like exposing similar functionality directly in SQL using custom function like pg_check_syntax), but I would like to get some feedback first. What do you think? enhnace PS: I wasn't able to find any automated tests for "postgres" command to enhance with, are there any? PS2: Patch could be found at https://github.com/simi/postgres/pull/8 as well.
Attachment
On Fri, 2023-12-15 at 13:21 +0100, Josef Šimánek wrote: > Inspired by Simon Riggs' keynote talk at PGCounf.eu 2023 sharing list > of ideas for PostgreSQL > (https://riggs.business/blog/f/postgresql-todo-2023), I have crafted a > quick patch to do SQL syntax validation. > > What do you think? I like the idea. But what will happen if the SQL statement references tables or other objects, since we have no database? Yours, Laurenz Albe
Dne pá 15. 12. 2023 14:09 uživatel Laurenz Albe <laurenz.albe@cybertec.at> napsal:
On Fri, 2023-12-15 at 13:21 +0100, Josef Šimánek wrote:
> Inspired by Simon Riggs' keynote talk at PGCounf.eu 2023 sharing list
> of ideas for PostgreSQL
> (https://riggs.business/blog/f/postgresql-todo-2023), I have crafted a
> quick patch to do SQL syntax validation.
>
> What do you think?
I like the idea. But what will happen if the SQL statement references
tables or other objects, since we have no database?
It checks just the identifier is valid from parser perspective, like it is valid table name.
Yours,
Laurenz Albe
Dne pá 15. 12. 2023 14:14 uživatel Josef Šimánek <josef.simanek@gmail.com> napsal:
Dne pá 15. 12. 2023 14:09 uživatel Laurenz Albe <laurenz.albe@cybertec.at> napsal:On Fri, 2023-12-15 at 13:21 +0100, Josef Šimánek wrote:
> Inspired by Simon Riggs' keynote talk at PGCounf.eu 2023 sharing list
> of ideas for PostgreSQL
> (https://riggs.business/blog/f/postgresql-todo-2023), I have crafted a
> quick patch to do SQL syntax validation.
>
> What do you think?
I like the idea. But what will happen if the SQL statement references
tables or other objects, since we have no database?It checks just the identifier is valid from parser perspective, like it is valid table name.
There can by two levels, like plpgsql or like pllgsql_check
Regards
Pavel
Yours,
Laurenz Albe
Laurenz Albe <laurenz.albe@cybertec.at> writes: > On Fri, 2023-12-15 at 13:21 +0100, Josef Šimánek wrote: >> Inspired by Simon Riggs' keynote talk at PGCounf.eu 2023 sharing list >> of ideas for PostgreSQL >> (https://riggs.business/blog/f/postgresql-todo-2023), I have crafted a >> quick patch to do SQL syntax validation. >> >> What do you think? > I like the idea. But what will happen if the SQL statement references > tables or other objects, since we have no database? This seems like a fairly useless wart to me. What does it do that you can't do better with existing facilities (psql etc)? In the big picture a command line switch in the postgres executable doesn't feel like the right place for this. There's no good reason to assume that the server executable will be installed where you want this capability; not to mention the possibility of version skew between that executable and whatever installation you're actually running on. Another thing I don't like is that this exposes to the user what ought to be purely an implementation detail, namely the division of labor between gram.y (raw_parser()) and the rest of the parser. There are checks that a user would probably see as "syntax checks" that don't happen in gram.y, and conversely there are some things we reject there that seem more like semantic than syntax issues. regards, tom lane
pá 15. 12. 2023 v 15:50 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal: > > Laurenz Albe <laurenz.albe@cybertec.at> writes: > > On Fri, 2023-12-15 at 13:21 +0100, Josef Šimánek wrote: > >> Inspired by Simon Riggs' keynote talk at PGCounf.eu 2023 sharing list > >> of ideas for PostgreSQL > >> (https://riggs.business/blog/f/postgresql-todo-2023), I have crafted a > >> quick patch to do SQL syntax validation. > >> > >> What do you think? > > > I like the idea. But what will happen if the SQL statement references > > tables or other objects, since we have no database? > > This seems like a fairly useless wart to me. What does it do that > you can't do better with existing facilities (psql etc)? Per my experience, this is mostly useful during development to catch syntax errors super early. For SELECT/INSERT/UPDATE/DELETE queries, it is usually enough to prepend with EXPLAIN and run. But EXPLAIN doesn't support all SQL like DDL statements. Let's say I have a long SQL script I'm working on and there is typo in the middle like ALTERR instead of ALTER. Is there any simple way to detect this without actually running the statement in psql or other existing facilities? This check could be simply combined with editor capabilities to be run on each SQL file save to get quick feedback on this kind of mistakes for great developer experience. > In the big picture a command line switch in the postgres executable > doesn't feel like the right place for this. There's no good reason > to assume that the server executable will be installed where you want > this capability; not to mention the possibility of version skew > between that executable and whatever installation you're actually > running on. This is mostly intended for SQL developers and CI systems where PostgreSQL backend is usually installed and in the actual version needed. I agree postgres is not the best place (even it makes partially sense to me), but as I mentioned, I wasn't able to craft a quick patch with a better place to put this in. What would you recommend? Separate executable like pg_syntaxcheck? > Another thing I don't like is that this exposes to the user what ought > to be purely an implementation detail, namely the division of labor > between gram.y (raw_parser()) and the rest of the parser. There are > checks that a user would probably see as "syntax checks" that don't > happen in gram.y, and conversely there are some things we reject there > that seem more like semantic than syntax issues. I have no big insight into SQL parsing. Can you please share examples of given concerns? Is there anything better than raw_parser() for this purpose? > regards, tom lane
On Fri, Dec 15, 2023 at 8:05 AM Josef Šimánek <josef.simanek@gmail.com> wrote:
pá 15. 12. 2023 v 15:50 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
>
> Laurenz Albe <laurenz.albe@cybertec.at> writes:
> > On Fri, 2023-12-15 at 13:21 +0100, Josef Šimánek wrote:
> >> Inspired by Simon Riggs' keynote talk at PGCounf.eu 2023 sharing list
> >> of ideas for PostgreSQL
> >> (https://riggs.business/blog/f/postgresql-todo-2023), I have crafted a
> >> quick patch to do SQL syntax validation.
> >>
> >> What do you think?
>
> > I like the idea. But what will happen if the SQL statement references
> > tables or other objects, since we have no database?
>
> This seems like a fairly useless wart to me. What does it do that
> you can't do better with existing facilities (psql etc)?
Per my experience, this is mostly useful during development to catch
syntax errors super early.
I'd suggest helping this project get production capable since it already is trying to integrate with the development ecosystem you describe here.
I agree that developing this as a new executable for the purpose is needed in order to best conform to existing conventions.
David J.
pá 15. 12. 2023 v 16:16 odesílatel David G. Johnston <david.g.johnston@gmail.com> napsal: > > On Fri, Dec 15, 2023 at 8:05 AM Josef Šimánek <josef.simanek@gmail.com> wrote: >> >> pá 15. 12. 2023 v 15:50 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal: >> > >> > Laurenz Albe <laurenz.albe@cybertec.at> writes: >> > > On Fri, 2023-12-15 at 13:21 +0100, Josef Šimánek wrote: >> > >> Inspired by Simon Riggs' keynote talk at PGCounf.eu 2023 sharing list >> > >> of ideas for PostgreSQL >> > >> (https://riggs.business/blog/f/postgresql-todo-2023), I have crafted a >> > >> quick patch to do SQL syntax validation. >> > >> >> > >> What do you think? >> > >> > > I like the idea. But what will happen if the SQL statement references >> > > tables or other objects, since we have no database? >> > >> > This seems like a fairly useless wart to me. What does it do that >> > you can't do better with existing facilities (psql etc)? >> >> Per my experience, this is mostly useful during development to catch >> syntax errors super early. > > > I'd suggest helping this project get production capable since it already is trying to integrate with the development ecosystemyou describe here. > > https://github.com/supabase/postgres_lsp Indeed LSP is the one of the targets of this feature. Currently it is using https://github.com/pganalyze/libpg_query under the hood probably because of the same reasons I have described (parser is not available in public APIs of postgres_fe.h or libpq). Exposing basic parser capabilities in postgres binary itself and also on SQL level by pg_check_syntax function can prevent the need of "hacking" pg parser to be accessible outside of server binary. > I agree that developing this as a new executable for the purpose is needed in order to best conform to existing conventions. > > David J. >
On Fri, Dec 15, 2023 at 8:20 AM Josef Šimánek <josef.simanek@gmail.com> wrote:
(parser is not available
in public APIs of postgres_fe.h or libpq).
What about building "libpg" that does expose and exports some public APIs for the parser? We can include a reference CLI implementation for basic usage of the functionality while leaving the actual language server project outside of core.
David J.
pá 15. 12. 2023 v 16:32 odesílatel David G. Johnston <david.g.johnston@gmail.com> napsal: > > On Fri, Dec 15, 2023 at 8:20 AM Josef Šimánek <josef.simanek@gmail.com> wrote: >> >> (parser is not available >> in public APIs of postgres_fe.h or libpq). > > > What about building "libpg" that does expose and exports some public APIs for the parser? We can include a reference CLIimplementation for basic usage of the functionality while leaving the actual language server project outside of core. Language server (LSP) can just benefit from this feature, but it doesn't cover all possibilities since LSP is meant for one purpose -> run in developer's code editor. Actual syntax check is more generic, able to cover CI checks and more. I would not couple this feature and LSP, LSP can just benefit from it (and it is usually built in a way that uses other tools and packs them into LSP). Exposing this kind of SQL check doesn't mean something LSP related being implemented in core. LSP can just benefit from this. Exposing parser to libpg seems good idea, but I'm not sure how simple that could be done since during my journey I have found out there are a lot of dependencies which are not present in usual frontend code per my understanding like memory contexts and removal of those (decoupling) would be huge project IMHO. > David J.
On 12/15/23 16:38, Josef Šimánek wrote: > pá 15. 12. 2023 v 16:32 odesílatel David G. Johnston > <david.g.johnston@gmail.com> napsal: >> >> On Fri, Dec 15, 2023 at 8:20 AM Josef Šimánek <josef.simanek@gmail.com> wrote: >>> >>> (parser is not available >>> in public APIs of postgres_fe.h or libpq). >> >> >> What about building "libpg" that does expose and exports some public APIs for the parser? We can include a referenceCLI implementation for basic usage of the functionality while leaving the actual language server project outsideof core. > > Language server (LSP) can just benefit from this feature, but it > doesn't cover all possibilities since LSP is meant for one purpose -> > run in developer's code editor. Actual syntax check is more generic, > able to cover CI checks and more. I would not couple this feature and > LSP, LSP can just benefit from it (and it is usually built in a way > that uses other tools and packs them into LSP). Exposing this kind of > SQL check doesn't mean something LSP related being implemented in > core. LSP can just benefit from this. > I don't know enough about LSP to have a good idea how to implement this for PG, but my assumption would be we'd have some sort of library exposing "parser" to frontend tools, and also an in-core binary using that library (say, src/bin/pg_syntax_check). And LSP would use that parser library too ... I think there's about 0% chance we'd add this to "postgres" binary. > Exposing parser to libpg seems good idea, but I'm not sure how simple > that could be done since during my journey I have found out there are > a lot of dependencies which are not present in usual frontend code per > my understanding like memory contexts and removal of those > (decoupling) would be huge project IMHO. > You're right the grammar/parser expects a lot of backend infrastructure, so making it available to frontend is going to be challenging. But I doubt there's a better way than starting with gram.y and either removing or adding the missing pieces (maybe only a mock version of it). I'm not a bison expert, but considering your goal seems to be a basic syntax check, maybe you could simply rip out most of the bits depending on backend stuff, or maybe replace them with some trivial no-op code? But as Tom mentioned, the question is how far gram.y gets you. There's plenty of ereport(ERROR) calls in src/backend/parser/*.c our users might easily consider as parse errors ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
ne 25. 2. 2024 v 23:24 odesílatel Tomas Vondra <tomas.vondra@enterprisedb.com> napsal: > > > > On 12/15/23 16:38, Josef Šimánek wrote: > > pá 15. 12. 2023 v 16:32 odesílatel David G. Johnston > > <david.g.johnston@gmail.com> napsal: > >> > >> On Fri, Dec 15, 2023 at 8:20 AM Josef Šimánek <josef.simanek@gmail.com> wrote: > >>> > >>> (parser is not available > >>> in public APIs of postgres_fe.h or libpq). > >> > >> > >> What about building "libpg" that does expose and exports some public APIs for the parser? We can include a referenceCLI implementation for basic usage of the functionality while leaving the actual language server project outsideof core. > > > > Language server (LSP) can just benefit from this feature, but it > > doesn't cover all possibilities since LSP is meant for one purpose -> > > run in developer's code editor. Actual syntax check is more generic, > > able to cover CI checks and more. I would not couple this feature and > > LSP, LSP can just benefit from it (and it is usually built in a way > > that uses other tools and packs them into LSP). Exposing this kind of > > SQL check doesn't mean something LSP related being implemented in > > core. LSP can just benefit from this. > > > > I don't know enough about LSP to have a good idea how to implement this > for PG, but my assumption would be we'd have some sort of library > exposing "parser" to frontend tools, and also an in-core binary using > that library (say, src/bin/pg_syntax_check). And LSP would use that > parser library too ... > > I think there's about 0% chance we'd add this to "postgres" binary. Exposing parser to frontend tools makes no sense to me and even if it would, it is a huge project not worth to be done just because of syntax check. I can update the patch to prepare a new binary, but still on the backend side. This syntax check should be equivalent to running a server locally, running a query and caring only about parsing part finished successfully. In my thinking, this belongs to backend tools. > > Exposing parser to libpg seems good idea, but I'm not sure how simple > > that could be done since during my journey I have found out there are > > a lot of dependencies which are not present in usual frontend code per > > my understanding like memory contexts and removal of those > > (decoupling) would be huge project IMHO. > > > > You're right the grammar/parser expects a lot of backend infrastructure, > so making it available to frontend is going to be challenging. But I > doubt there's a better way than starting with gram.y and either removing > or adding the missing pieces (maybe only a mock version of it). > > I'm not a bison expert, but considering your goal seems to be a basic > syntax check, maybe you could simply rip out most of the bits depending > on backend stuff, or maybe replace them with some trivial no-op code? > > But as Tom mentioned, the question is how far gram.y gets you. There's > plenty of ereport(ERROR) calls in src/backend/parser/*.c our users might > easily consider as parse errors ... > > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
On Sun, 25 Feb 2024 at 23:34, Josef Šimánek <josef.simanek@gmail.com> wrote: > Exposing parser to frontend tools makes no sense to me Not everyone seems to agree with that, it's actually already done by Lukas from pganalyze: https://github.com/pganalyze/libpg_query
po 26. 2. 2024 v 8:20 odesílatel Jelte Fennema-Nio <postgres@jeltef.nl> napsal: > > On Sun, 25 Feb 2024 at 23:34, Josef Šimánek <josef.simanek@gmail.com> wrote: > > Exposing parser to frontend tools makes no sense to me > > Not everyone seems to agree with that, it's actually already done by > Lukas from pganalyze: https://github.com/pganalyze/libpg_query I did a quick look. That's clearly amazing work, but it is not parser being exposed to frontend (refactored). It is (per my understanding) backend code isolated to minimum to be able to parse query. It is still bound to individual backend version and to backend source code. And it is close to my effort (I was about to start with a simpler version not providing tokens, just the result), but instead of copying files from backend into separate project and shave off to minimum, provide the same tool with backend utils directly.
On Sun, Feb 25, 2024 at 5:24 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > I think there's about 0% chance we'd add this to "postgres" binary. Several people have taken this position, so I'm going to mark https://commitfest.postgresql.org/48/4704/ as Rejected. My own opinion is that syntax checking is a useful thing to expose, but I don't believe that this is a useful way to expose it. I think this comment that Tom made upthread is right on target: # Another thing I don't like is that this exposes to the user what ought # to be purely an implementation detail, namely the division of labor # between gram.y (raw_parser()) and the rest of the parser. There are # checks that a user would probably see as "syntax checks" that don't # happen in gram.y, and conversely there are some things we reject there # that seem more like semantic than syntax issues. I think that what that means in practice is that, while this patch may seem to give reasonable results in simple tests, as soon as you try to do slightly more complicated things with it, it's going to give weird results, either failing to flag things that you'd expect it to flag, or flagging things you'd expect it not to flag. Fixing that would be either impossible or a huge amount of work depending on your point of view. If you take the point of view that we need to adjust things so that the raw parser reports all the things that ought to be reported by a tool like this and none of the things that it shouldn't, then it's probably just a huge amount of work. If you take the point of view that what goes into the raw parser and what goes into parse analysis ought to be an implementation decision untethered to what a tool like this ought to report, then fixing the problems would be impossible, or at least, it would amount to throwing away this patch and starting over. I think the latter point of view, which Tom has already taken, would be the more prevalent view among hackers by far, but even if the former view prevailed, who is going to do all that work? I strongly suspect that doing something useful in this area requires about two orders of magnitude more code than are included in this patch, and a completely different design. If it actually worked well to do something this simple, somebody probably would have done it already. In fact, there are already tools out there for validation, like https://github.com/okbob/plpgsql_check for example. That tool doesn't do exactly the same thing as this patch is trying to do, but it does do other kinds of validation, and it's 19k lines of code, vs. the 45 lines of code in this patch, which I think reinforces the point that you need to do something much more complicated than this to create real value. Also, the patch as proposed, besides being 45 lines, also has zero lines of comments, no test cases, no documentation, and doesn't follow the PostgreSQL coding standards. I'm not saying that to be mean, nor am I suggesting that Josef should go fix that stuff. It's perfectly reasonable to propose a small patch without a lot of research to see what people think -- but now we have the answer to that question: people think it won't work. So Josef should now decide to either give up, or try a new approach, or if he's really sure that all the feedback that has been given so far is completely wrong, he can try to demonstrate that the patch does all kinds of wonderful things with very few disadvantages. But I think if he does that last, he's going to find that it's not really possible, because I'm pretty sure that Tom is right. -- Robert Haas EDB: http://www.enterprisedb.com
st 15. 5. 2024 v 19:43 odesílatel Robert Haas <robertmhaas@gmail.com> napsal: > > On Sun, Feb 25, 2024 at 5:24 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: > > I think there's about 0% chance we'd add this to "postgres" binary. > > Several people have taken this position, so I'm going to mark > https://commitfest.postgresql.org/48/4704/ as Rejected. > > My own opinion is that syntax checking is a useful thing to expose, > but I don't believe that this is a useful way to expose it. I think > this comment that Tom made upthread is right on target: > > # Another thing I don't like is that this exposes to the user what ought > # to be purely an implementation detail, namely the division of labor > # between gram.y (raw_parser()) and the rest of the parser. There are > # checks that a user would probably see as "syntax checks" that don't > # happen in gram.y, and conversely there are some things we reject there > # that seem more like semantic than syntax issues. > > I think that what that means in practice is that, while this patch may > seem to give reasonable results in simple tests, as soon as you try to > do slightly more complicated things with it, it's going to give weird > results, either failing to flag things that you'd expect it to flag, > or flagging things you'd expect it not to flag. Fixing that would be > either impossible or a huge amount of work depending on your point of > view. If you take the point of view that we need to adjust things so > that the raw parser reports all the things that ought to be reported > by a tool like this and none of the things that it shouldn't, then > it's probably just a huge amount of work. If you take the point of > view that what goes into the raw parser and what goes into parse > analysis ought to be an implementation decision untethered to what a > tool like this ought to report, then fixing the problems would be > impossible, or at least, it would amount to throwing away this patch > and starting over. I think the latter point of view, which Tom has > already taken, would be the more prevalent view among hackers by far, > but even if the former view prevailed, who is going to do all that > work? > > I strongly suspect that doing something useful in this area requires > about two orders of magnitude more code than are included in this > patch, and a completely different design. If it actually worked well > to do something this simple, somebody probably would have done it > already. In fact, there are already tools out there for validation, > like https://github.com/okbob/plpgsql_check for example. That tool > doesn't do exactly the same thing as this patch is trying to do, but > it does do other kinds of validation, and it's 19k lines of code, vs. > the 45 lines of code in this patch, which I think reinforces the point > that you need to do something much more complicated than this to > create real value. > > Also, the patch as proposed, besides being 45 lines, also has zero > lines of comments, no test cases, no documentation, and doesn't follow > the PostgreSQL coding standards. I'm not saying that to be mean, nor > am I suggesting that Josef should go fix that stuff. It's perfectly > reasonable to propose a small patch without a lot of research to see > what people think -- but now we have the answer to that question: > people think it won't work. So Josef should now decide to either give > up, or try a new approach, or if he's really sure that all the > feedback that has been given so far is completely wrong, he can try to > demonstrate that the patch does all kinds of wonderful things with > very few disadvantages. But I think if he does that last, he's going > to find that it's not really possible, because I'm pretty sure that > Tom is right. I'm totally OK to mark this as rejected and indeed 45 lines were just minimal patch to create PoC (I have coded this during last PGConf.eu lunch break) and mainly to start discussion. I'm not sure everyone in this thread understands the reason for this patch, which is clearly my fault, since I have failed to explain. Main idea is to make a tool to validate query can be parsed, that's all. Similar to running "EXPLAIN query", but not caring about the result and not caring about the DB structure (ignoring missing tables, ...), just checking it was successfully executed. This definitely belongs to the server side and not to the client side, it is just a tool to validate that for this running PostgreSQL backend it is a "parseable" query. I'm not giving up on this, but I hear there are various problems to explore. If I understand it well, just running the parser to query doesn't guarantee the query is valid, since it can fail later for various reasons (I mean other than missing table, ...). I wasn't aware of that. Also exposing this inside postgres binary seems controversial. I had an idea to expose parser result at SQL level with a new command (similar to EXPLAIN), but you'll need running PostgreSQL backend to be able to use this capability, which is against one of the original ideas. On the otherside PostgreSQL exposes a lot of "meta" functionality already and this could be a nice addition. I'll revisit the discussion again and try to submit another try once I'll get more context and experience. Thanks everyone for constructive comments! > -- > Robert Haas > EDB: http://www.enterprisedb.com
pá 15. 12. 2023 v 15:50 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal: > > Laurenz Albe <laurenz.albe@cybertec.at> writes: > > On Fri, 2023-12-15 at 13:21 +0100, Josef Šimánek wrote: > >> Inspired by Simon Riggs' keynote talk at PGCounf.eu 2023 sharing list > >> of ideas for PostgreSQL > >> (https://riggs.business/blog/f/postgresql-todo-2023), I have crafted a > >> quick patch to do SQL syntax validation. > >> > >> What do you think? > > > I like the idea. But what will happen if the SQL statement references > > tables or other objects, since we have no database? > > This seems like a fairly useless wart to me. this hurts :'( > > In the big picture a command line switch in the postgres executable > doesn't feel like the right place for this. There's no good reason > to assume that the server executable will be installed where you want > this capability; not to mention the possibility of version skew > between that executable and whatever installation you're actually > running on. > > Another thing I don't like is that this exposes to the user what ought > to be purely an implementation detail, namely the division of labor > between gram.y (raw_parser()) and the rest of the parser. There are > checks that a user would probably see as "syntax checks" that don't > happen in gram.y, and conversely there are some things we reject there > that seem more like semantic than syntax issues. > > regards, tom lane
=?UTF-8?B?Sm9zZWYgxaBpbcOhbmVr?= <josef.simanek@gmail.com> writes: > I'm not sure everyone in this thread understands the reason for this > patch, which is clearly my fault, since I have failed to explain. Main > idea is to make a tool to validate query can be parsed, that's all. > Similar to running "EXPLAIN query", but not caring about the result > and not caring about the DB structure (ignoring missing tables, ...), > just checking it was successfully executed. This definitely belongs to > the server side and not to the client side, it is just a tool to > validate that for this running PostgreSQL backend it is a "parseable" > query. The thing that was bothering me most about this is that I don't understand why that's a useful check. If I meant to type UPDATE mytab SET mycol = 42; and instead I type UPDATEE mytab SET mycol = 42; your proposed feature would catch that; great. But if I type UPDATE mytabb SET mycol = 42; it won't. How does that make sense? I'm not entirely sure where to draw the line about what a "syntax check" should catch, but this seems a bit south of what I'd want in a syntax-checking editor. BTW, if you do feel that a pure grammar check is worthwhile, you should look at the ecpg preprocessor, which does more or less that with the SQL portions of its input. ecpg could be a better model to follow because it doesn't bring all the dependencies of the server and so is much more likely to appear in a client-side installation. It's kind of an ugly, unmaintained mess at the moment, sadly. The big knock on doing this client-side is that there might be version skew compared to the server you're using --- but if you are not doing anything beyond a grammar-level check then your results are pretty approximate anyway, ISTM. We've not heard anything suggesting that version skew is a huge problem for ecpg users. regards, tom lane
Tom Lane: > The thing that was bothering me most about this is that I don't > understand why that's a useful check. If I meant to type > > UPDATE mytab SET mycol = 42; > > and instead I type > > UPDATEE mytab SET mycol = 42; > > your proposed feature would catch that; great. But if I type > > UPDATE mytabb SET mycol = 42; > > it won't. How does that make sense? I'm not entirely sure where > to draw the line about what a "syntax check" should catch, but this > seems a bit south of what I'd want in a syntax-checking editor. > > BTW, if you do feel that a pure grammar check is worthwhile, you > should look at the ecpg preprocessor, which does more or less that > with the SQL portions of its input. ecpg could be a better model > to follow because it doesn't bring all the dependencies of the server > and so is much more likely to appear in a client-side installation. > It's kind of an ugly, unmaintained mess at the moment, sadly. Would working with ecpg allow to get back a parse tree of the query to do stuff with that? This is really what is missing for the ecosystem. A libpqparser for tools to use: Formatters, linters, query rewriters, simple syntax checkers... they are all missing access to postgres' own parser. Best, Wolfgang
st 15. 5. 2024 v 20:39 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal: > > =?UTF-8?B?Sm9zZWYgxaBpbcOhbmVr?= <josef.simanek@gmail.com> writes: > > I'm not sure everyone in this thread understands the reason for this > > patch, which is clearly my fault, since I have failed to explain. Main > > idea is to make a tool to validate query can be parsed, that's all. > > Similar to running "EXPLAIN query", but not caring about the result > > and not caring about the DB structure (ignoring missing tables, ...), > > just checking it was successfully executed. This definitely belongs to > > the server side and not to the client side, it is just a tool to > > validate that for this running PostgreSQL backend it is a "parseable" > > query. > > The thing that was bothering me most about this is that I don't > understand why that's a useful check. If I meant to type > > UPDATE mytab SET mycol = 42; > > and instead I type > > UPDATEE mytab SET mycol = 42; > > your proposed feature would catch that; great. But if I type > > UPDATE mytabb SET mycol = 42; > > it won't. How does that make sense? I'm not entirely sure where > to draw the line about what a "syntax check" should catch, but this > seems a bit south of what I'd want in a syntax-checking editor. This is exactly where the line is drawn. My motivation is not to use this feature for syntax check in editor (even could be used to find those banalities). I'm looking to improve automation to be able to detect those banalities as early as possible. Let's say there is complex CI automation configuring and starting PostgreSQL backend, loading some data, ... and in the end all this is useless, since there is this kind of simple mistake like UPDATEE. I would like to detect this problem as early as possible and stop the complex CI pipeline to save time and also to save resources (= money) by failing super-early and reporting back. This kind of mistake could be simply introduced by like wrongly resolved git conflict, human typing error ... This kind of mistake (typo, ...) can be usually spotted super early. In compiled languages during compilation, in interpreted languages (like Ruby) at program start (since code is parsed as one of the first steps). There is no such early detection possible for PostgreSQL currently IMHO. > BTW, if you do feel that a pure grammar check is worthwhile, you > should look at the ecpg preprocessor, which does more or less that > with the SQL portions of its input. ecpg could be a better model > to follow because it doesn't bring all the dependencies of the server > and so is much more likely to appear in a client-side installation. > It's kind of an ugly, unmaintained mess at the moment, sadly. > > The big knock on doing this client-side is that there might be > version skew compared to the server you're using --- but if you > are not doing anything beyond a grammar-level check then your > results are pretty approximate anyway, ISTM. We've not heard > anything suggesting that version skew is a huge problem for > ecpg users. Thanks for the info, I'll check. > regards, tom lane
walther@technowledgy.de writes: > Tom Lane: >> BTW, if you do feel that a pure grammar check is worthwhile, you >> should look at the ecpg preprocessor, which does more or less that >> with the SQL portions of its input. > Would working with ecpg allow to get back a parse tree of the query to > do stuff with that? No, ecpg isn't interested in building a syntax tree. > This is really what is missing for the ecosystem. A libpqparser for > tools to use: Formatters, linters, query rewriters, simple syntax > checkers... they are all missing access to postgres' own parser. To get to that, you'd need some kind of agreement on what the syntax tree is. I doubt our existing implementation would be directly useful to very many tools, and even if it is, do they want to track constant version-to-version changes? regards, tom lane
On Wed, May 15, 2024 at 2:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > The thing that was bothering me most about this is that I don't > understand why that's a useful check. If I meant to type > > UPDATE mytab SET mycol = 42; > > and instead I type > > UPDATEE mytab SET mycol = 42; > > your proposed feature would catch that; great. But if I type > > UPDATE mytabb SET mycol = 42; > > it won't. How does that make sense? I'm not entirely sure where > to draw the line about what a "syntax check" should catch, but this > seems a bit south of what I'd want in a syntax-checking editor. I don't agree with this, actually. The first wrong example can never be valid, while the second one can be valid given the right table definitions. That lines up quite nicely with the distinction between parsing and parse analysis. There is a problem with actually getting all the way there, I'm fairly sure, because we've got thousands of lines of gram.y and thousands of lines of parse analysis code that weren't all written with the idea of making a crisp distinction here. For example, I'd like both EXPLAIN (WAFFLES) SELECT 1 and EXPLAIN WAFFLES SELECT 1 to be flagged as syntactically invalid, and with things as they are that would not happen. Even for plannable statements I would not be at all surprised to hear that there are a bunch of corner cases that we'd get wrong. But I don't understand the idea that the concept doesn't make sense. I think it is perfectly reasonable to imagine a world in which the initial parsing takes care of reporting everything that can be determined by static analysis without knowing anything about catalog contents, and parse analysis checks all the things that require catalog access, and you can run the first set of checks and then decide whether to proceed further. I think if I were designing a new system from scratch, I'd definitely want to set it up that way, and I think moving our existing system in that direction would probably let us clean up a variety of warts along the way. Really, the only argument I see against it is that getting from where we are to there would be a gigantic amount of work for the value we'd derive. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, May 15, 2024 at 2:12 PM Josef Šimánek <josef.simanek@gmail.com> wrote: > I'm totally OK to mark this as rejected and indeed 45 lines were just > minimal patch to create PoC (I have coded this during last PGConf.eu > lunch break) and mainly to start discussion. Thanks for understanding. > I'm not sure everyone in this thread understands the reason for this > patch, which is clearly my fault, since I have failed to explain. Main > idea is to make a tool to validate query can be parsed, that's all. > Similar to running "EXPLAIN query", but not caring about the result > and not caring about the DB structure (ignoring missing tables, ...), > just checking it was successfully executed. This definitely belongs to > the server side and not to the client side, it is just a tool to > validate that for this running PostgreSQL backend it is a "parseable" > query. I don't think it's at all obvious that this belongs on the server side rather than the client side. I think it could be done in either place, or both. I just think we don't have the infrastructure to do it cleanly, at present. -- Robert Haas EDB: http://www.enterprisedb.com
Tom Lane: >> This is really what is missing for the ecosystem. A libpqparser for >> tools to use: Formatters, linters, query rewriters, simple syntax >> checkers... they are all missing access to postgres' own parser. > > To get to that, you'd need some kind of agreement on what the syntax > tree is. I doubt our existing implementation would be directly useful > to very many tools, and even if it is, do they want to track constant > version-to-version changes? Correct, on top of what the syntax tree currently has, one would probably need: - comments - locations (line number / character) for everything, including those of comments Otherwise it's impossible to print proper SQL again without losing information. And then on top of that, to be really really useful, you'd need to be able to parse partial statements, too, to support all kinds of "language server" applications. Tracking version-to-version changes is exactly the reason why it would be good to have that from upstream, imho. New syntax is added in (almost?) every release and everyone outside core trying to write their own parser and staying up2date with **all** the new syntax.. will eventually fail. Yes, there could be changes to the produced parse tree as well and you'd also need to adjust, for example, your SQL-printers. But it should be easier to stay up2date than right now. Best, Wolfgang
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, May 15, 2024 at 2:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The thing that was bothering me most about this is that I don't >> understand why that's a useful check. ... > But I don't understand the idea that the concept doesn't make sense. Sorry: "make sense" was a poorly chosen phrase there. What I was doubting, and continue to doubt, is that 100% checking of what you can check without catalog access and 0% checking of the rest is a behavior that end users will find especially useful. > I think it is perfectly reasonable to imagine a world in which the > initial parsing takes care of reporting everything that can be > determined by static analysis without knowing anything about catalog > contents, and parse analysis checks all the things that require > catalog access, and you can run the first set of checks and then > decide whether to proceed further. I think if I were designing a new > system from scratch, I'd definitely want to set it up that way, and I > think moving our existing system in that direction would probably let > us clean up a variety of warts along the way. Really, the only > argument I see against it is that getting from where we are to there > would be a gigantic amount of work for the value we'd derive. I'm less enthusiatic about this than you are. I think it would likely produce a slower and less maintainable system. Slower because we'd need more passes over the query: what parse analysis does today would have to be done in at least two separate steps. Less maintainable because knowledge about certain things would end up being more spread around the system. Taking your example of getting syntax checking to recognize invalid EXPLAIN keywords: right now there's just one piece of code that knows what those options are, but this'd require two. Also, "run the first set of checks and then decide whether to proceed further" seems like optimizing the system for broken queries over valid ones, which I don't think is an appropriate goal. Now, I don't say that there'd be *no* benefit to reorganizing the system that way. But it wouldn't be an unalloyed win, and so I share your bottom line that the costs would be out of proportion to the benefits. regards, tom lane
On Wed, May 15, 2024 at 12:18 PM <walther@technowledgy.de> wrote:
Tom Lane:
>> This is really what is missing for the ecosystem. A libpqparser for
>> tools to use: Formatters, linters, query rewriters, simple syntax
>> checkers... they are all missing access to postgres' own parser.
>
> To get to that, you'd need some kind of agreement on what the syntax
> tree is. I doubt our existing implementation would be directly useful
> to very many tools, and even if it is, do they want to track constant
> version-to-version changes?
Correct, on top of what the syntax tree currently has, one would
probably need:
- comments
- locations (line number / character) for everything, including those of
comments
I'm with the original patch idea at this point. A utility that simply runs text through the parser, not parse analysis, and answers the question: "Were you able to parse this?" has both value and seems like something that can be patched into core in a couple of hundred lines, not thousands, as has already been demonstrated.
Sure, other questions are valid and other goals exist in the ecosystem, but that doesn't diminish this one which is sufficiently justified for my +1 on the idea.
Now, in my ideal world something like this could be made as an extension so that it can work on older versions and not have to be maintained by core. And likely grow more features over time. Is there anything fundamental about this that prevents it being implemented in an extension and, if so, what can we add to core in v18 to alleviate that limitation?
David J.
st 15. 5. 2024 v 21:28 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal: > > Robert Haas <robertmhaas@gmail.com> writes: > > On Wed, May 15, 2024 at 2:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> The thing that was bothering me most about this is that I don't > >> understand why that's a useful check. ... > > > But I don't understand the idea that the concept doesn't make sense. > > Sorry: "make sense" was a poorly chosen phrase there. What I was > doubting, and continue to doubt, is that 100% checking of what > you can check without catalog access and 0% checking of the rest > is a behavior that end users will find especially useful. But that's completely different feature which is not exclusive and shouldn't block this other feature to do only exactly as specified. > > I think it is perfectly reasonable to imagine a world in which the > > initial parsing takes care of reporting everything that can be > > determined by static analysis without knowing anything about catalog > > contents, and parse analysis checks all the things that require > > catalog access, and you can run the first set of checks and then > > decide whether to proceed further. I think if I were designing a new > > system from scratch, I'd definitely want to set it up that way, and I > > think moving our existing system in that direction would probably let > > us clean up a variety of warts along the way. Really, the only > > argument I see against it is that getting from where we are to there > > would be a gigantic amount of work for the value we'd derive. > > I'm less enthusiatic about this than you are. I think it would likely > produce a slower and less maintainable system. Slower because we'd > need more passes over the query: what parse analysis does today would > have to be done in at least two separate steps. Less maintainable > because knowledge about certain things would end up being more spread > around the system. Taking your example of getting syntax checking to > recognize invalid EXPLAIN keywords: right now there's just one piece > of code that knows what those options are, but this'd require two. > Also, "run the first set of checks and then decide whether to proceed > further" seems like optimizing the system for broken queries over > valid ones, which I don't think is an appropriate goal. > > Now, I don't say that there'd be *no* benefit to reorganizing the > system that way. But it wouldn't be an unalloyed win, and so I > share your bottom line that the costs would be out of proportion > to the benefits. > > regards, tom lane
st 15. 5. 2024 v 21:33 odesílatel David G. Johnston <david.g.johnston@gmail.com> napsal: > > On Wed, May 15, 2024 at 12:18 PM <walther@technowledgy.de> wrote: >> >> Tom Lane: >> >> This is really what is missing for the ecosystem. A libpqparser for >> >> tools to use: Formatters, linters, query rewriters, simple syntax >> >> checkers... they are all missing access to postgres' own parser. >> > >> > To get to that, you'd need some kind of agreement on what the syntax >> > tree is. I doubt our existing implementation would be directly useful >> > to very many tools, and even if it is, do they want to track constant >> > version-to-version changes? >> >> Correct, on top of what the syntax tree currently has, one would >> probably need: >> - comments >> - locations (line number / character) for everything, including those of >> comments >> > > I'm with the original patch idea at this point. A utility that simply runs text through the parser, not parse analysis,and answers the question: "Were you able to parse this?" has both value and seems like something that can be patchedinto core in a couple of hundred lines, not thousands, as has already been demonstrated. > > Sure, other questions are valid and other goals exist in the ecosystem, but that doesn't diminish this one which is sufficientlyjustified for my +1 on the idea. > > Now, in my ideal world something like this could be made as an extension so that it can work on older versions and nothave to be maintained by core. And likely grow more features over time. Is there anything fundamental about this thatprevents it being implemented in an extension and, if so, what can we add to core in v18 to alleviate that limitation? Like extension providing additional binary? Or what kind of extension do you mean? One of the original ideas was to be able to do so (parse query) without running postgres itself. Could extension be useful without running postgres backend? > David J. >
On Wed, May 15, 2024 at 12:35 PM Josef Šimánek <josef.simanek@gmail.com> wrote:
st 15. 5. 2024 v 21:33 odesílatel David G. Johnston
<david.g.johnston@gmail.com> napsal:
> Now, in my ideal world something like this could be made as an extension so that it can work on older versions and not have to be maintained by core. And likely grow more features over time. Is there anything fundamental about this that prevents it being implemented in an extension and, if so, what can we add to core in v18 to alleviate that limitation?
Like extension providing additional binary? Or what kind of extension
do you mean? One of the original ideas was to be able to do so (parse
query) without running postgres itself. Could extension be useful
without running postgres backend?
Pushing beyond my experience level here...but yes a separately installed binary (extension is being used conceptually here, this doesn't involve "create extension") that can inspect pg_config to find out where backend/postmaster library objects are installed and link to them.
David J.
"David G. Johnston" <david.g.johnston@gmail.com> writes: > Now, in my ideal world something like this could be made as an extension so > that it can work on older versions and not have to be maintained by core. > And likely grow more features over time. Is there anything fundamental > about this that prevents it being implemented in an extension and, if so, > what can we add to core in v18 to alleviate that limitation? It'd be pretty trivial to create a function that takes a string and runs it through raw_parser --- I've got such things laying about for microbenchmarking purposes, in fact. But the API that'd present for tools such as editors is enormously different from the proposed patch: there would need to be a running server and they'd need to be able to log into it, plus there are more minor concerns like having to wrap the string in valid quoting. Now on the plus side, once you'd bought into that environment, it'd be equally trivial to offer alternatives like "run raw parsing and parse analysis, but don't run the query". I continue to maintain that that's the set of checks you'd really want in a lot of use-cases. regards, tom lane
On Wed, May 15, 2024 at 3:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Sorry: "make sense" was a poorly chosen phrase there. What I was > doubting, and continue to doubt, is that 100% checking of what > you can check without catalog access and 0% checking of the rest > is a behavior that end users will find especially useful. You might be right, but my guess is that you're wrong. Syntax highlighting is very popular, and seems like a more sophisticated version of that same concept. I don't personally like it or use it myself, but I bet I'm hugely in the minority these days. And EDB certainly gets customer requests for syntax checking of various kinds; whether this particular kind would get more or less traction than other things is somewhat moot in view of the low likelihood of it actually happening. > I'm less enthusiatic about this than you are. I think it would likely > produce a slower and less maintainable system. Slower because we'd > need more passes over the query: what parse analysis does today would > have to be done in at least two separate steps. Less maintainable > because knowledge about certain things would end up being more spread > around the system. Taking your example of getting syntax checking to > recognize invalid EXPLAIN keywords: right now there's just one piece > of code that knows what those options are, but this'd require two. > Also, "run the first set of checks and then decide whether to proceed > further" seems like optimizing the system for broken queries over > valid ones, which I don't think is an appropriate goal. Well, we've talked before about other problems that stem from the fact that DDL doesn't have a clear separation between parse analysis and execution. I vaguely imagine that it would be valuable to clean that up for various reasons. But I haven't really thought it through, so I'm prepared to concede that it might have various downsides that aren't presently obvious to me. > Now, I don't say that there'd be *no* benefit to reorganizing the > system that way. But it wouldn't be an unalloyed win, and so I > share your bottom line that the costs would be out of proportion > to the benefits. I'm glad we agree on that much, and don't feel a compelling need to litigate the remaining differences between our positions, unless you really want to. I'm just telling you what I think, and I'm pleased that we think as similarly as we do, despite remaining differences. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, May 15, 2024 at 1:00 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, May 15, 2024 at 3:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Sorry: "make sense" was a poorly chosen phrase there. What I was
> doubting, and continue to doubt, is that 100% checking of what
> you can check without catalog access and 0% checking of the rest
> is a behavior that end users will find especially useful.
You might be right, but my guess is that you're wrong. Syntax
highlighting is very popular, and seems like a more sophisticated
version of that same concept.
The proposed seems distinctly less sophisticated though would be a starting point.
I think the documentation for --syntax check would read something like this:
postgres --syntax-check={filename | -}
Performs a pass over the lexical structure of the script supplied in filename or, if - is specified, standard input, then exits. The exit code is zero if no errors were found, otherwise it is 1, and the errors, at full verbosity, are printed to standard error. This does not involve reading the configuration file and, by extension, will not detect errors that require knowledge of a database schema, including the system catalogs, to manifest. There will be no false positives, but due to the prior rule, false negatives must be factored into its usage. Thus this option is most useful as an initial triage point, quickly rejecting SQL code without requiring a running PostgreSQL service.
Note: This is exposed as a convenient way to get access to the outcome of performing a raw parse within the specific version of the postgres binary being executed. The specific implementation of that parse is still non-public. Likewise, PostgreSQL doesn't itself have a use for a raw parse output beyond sending it to post-parse analysis. All of the catalog required checks, and potentially some that don't obviously need the catalogs, happen in this post-parse step; which the syntax checking API does not expose. In short, the API here doesn't include any guarantees regarding the specific errors one should expect to see output, only the no false positive test result of performing the first stage raw parse.
David J.
If in core I would still want to expose this as say a contrib module binary instead of hacking it into postgres. It would be our first server program entry there.
On Wed, May 15, 2024 at 6:35 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
If in core I would still want to expose this as say a contrib module binary instead of hacking it into postgres. It would be our first server program entry there.
Sorry for self-reply but:
Maybe name it "pg_script_check" with a, for now mandatory, "--syntax-only" option that enables this raw parse mode. Leaving room for executing this in an environment where there is, or it can launch, a running instance that then performs post-parse analysis as well.
David J.
On Wed, 2024-05-15 at 14:39 -0400, Tom Lane wrote: > The thing that was bothering me most about this is that I don't > understand why that's a useful check. If I meant to type > > UPDATE mytab SET mycol = 42; > > and instead I type > > UPDATEE mytab SET mycol = 42; > > your proposed feature would catch that; great. But if I type > > UPDATE mytabb SET mycol = 42; > > it won't. How does that make sense? It makes sense to me. I see a clear distinction between "this is a valid SQL statement" and "this is an SQL statement that will run on a specific database with certain objects in it". To me, "correct syntax" is the former. Yours, Laurenz Albe
On 15.05.24 21:05, Robert Haas wrote: > I don't think it's at all obvious that this belongs on the server side > rather than the client side. I think it could be done in either place, > or both. I just think we don't have the infrastructure to do it > cleanly, at present. I think if you're going to do a syntax-check-with-catalog-lookup mode, you need authentication and access control. The mode without catalog lookup doesn't need that. But it might be better to offer both modes through a similar interface (or at least plan ahead for that).