Thread: hint infrastructure setup (v3)
Dear patchers, Well, as the discussion rages over my previous patch submissions, I've time to improve the code;-) I finally figured out that there is 2 errhint functions (elog.c vs ipc_text.c), and the one I'm calling is the fist one, so I need to put a format. The second appears to ignore it's arguments after the first. Anyway, please consider the following patch for inclusion to current head. It validates for me. I need it to be able to go on. Have a nice day, -- Fabien Coelho - coelho@cri.ensmp.fr
Attachment
Fabien COELHO wrote: > > Dear patchers, > > Well, as the discussion rages over my previous patch submissions, I've > time to improve the code;-) > > I finally figured out that there is 2 errhint functions (elog.c vs > ipc_text.c), and the one I'm calling is the fist one, so I need to put a > format. The second appears to ignore it's arguments after the first. > > Anyway, please consider the following patch for inclusion to current > head. It validates for me. I need it to be able to go on. Why did all the tags have to be renamed: + cmdGRANT: GRANT {noH;}; Also, what is typical output for a hint? Can you show one? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Dear Bruce, > Why did all the tags have to be renamed: > > + cmdGRANT: GRANT {noH;}; that's a good question. In order to add hints, I want to attach them to the states of the parser automaton. So as to do that, I need I put a name on every state, so I need to refactor the prefix that lead to a state, at give it a name. Otherwise, I would have : xxx: GRANT {noH;} ALL | GRANT {noH;} CREATE ; (1) I would have to put the "after GRANT" hint twice, one after each GRANT occurence. (2) this would result in shift/reduce conflicts, as the parser does not know which hint should be put. Well, it is the same code in both case, but yacc does not look at that. Also, I need to stop hints, otherwise the last 'pushed' hint would be shown on any error later. > Also, what is typical output for a hint? Can you show one? Well, the current status of the infrastructure is that there is no hint;-) The only hint with the patch is if you do not put an SQL command. It may look like : psql> creat table foo(); ERROR: syntax error at or near "creat" at character 1 creat table foo(); ^ HINT: sql command such as SELECT or CREATE... ? All other syntax errors result in no hint being displayed, thanks to all added noH calls I put. Also, As far as I remember, I put a new option to activate these hints. Hints are disactivated by default. This is because some people do not like advices and may want to turn them on or off. -- Fabien Coelho - coelho@cri.ensmp.fr
Fabien COELHO <coelho@cri.ensmp.fr> writes: >> Also, what is typical output for a hint? Can you show one? > Well, the current status of the infrastructure is that there is no hint;-) Ah, now I remember why I was waiting to review that stuff: I was expecting you to come out with a version that actually did some things :-( What you've got at the moment is a patch that significantly uglifies the grammar without actually doing anything useful, or even clearly pointing the way to what else will need to happen before it does do something useful. So it's difficult to form any reasoned judgment about whether we want to commit to following this path or not. As I said back at the beginning, I'm unconvinced that this path leads to anything useful --- it seems like an extremely painful way of accomplishing less than what a simple change to auto-show the command's syntax summary could do. To change my mind, you've got to show me some concrete results of hints that are more useful than "\h <command>" is. regards, tom lane
Dear Tom, > > Well, the current status of the infrastructure is that there is no hint;-) > > Ah, now I remember why I was waiting to review that stuff: I was expecting > you to come out with a version that actually did some things :-( Well, if you wait for something from me, it is better to tell me directly. I was waiting for anything to happen to the patch (accept, discuss or reject) before submitting anything else. > What you've got at the moment is a patch that significantly uglifies the > grammar without actually doing anything useful, or even clearly pointing > the way to what else will need to happen before it does do something > useful. So it's difficult to form any reasoned judgment about whether > we want to commit to following this path or not. I can see your point. The reasonnable way out of the deadlock that I can suggest is the following: I can resubmit a new patch that would provide the needed infrastructure AND some hints on some commands as a proof of concept of what can be achieved. Then you can decide whether it is worth having this kind of thing on all commands, or not. If not, I won't have passed 3 week-ends in the parser instead if my garden for nothing. If you are okay then you apply, and I'll submit some new patches later on, one command at a time, and when I have time. Does this sound reasonnable enough? -- Fabien Coelho - coelho@cri.ensmp.fr
Fabien COELHO <coelho@cri.ensmp.fr> writes: > I can resubmit a new patch that would provide the needed infrastructure > AND some hints on some commands as a proof of concept of what can be > achieved. I quite agree that you shouldn't do a complete implementation when it's not clear if we'll accept it or not. What I'd like to see is a demo, basically. How about one example of each of several different kinds of hints? We need to get a feeling for what can be accomplished and how messy the grammar would get. BTW, it seems like you are putting in a lot of infrastructure to recreate externally the parse-stack state that bison has internally. Maybe an interesting alternative would be to look at touching that state directly. I realize that bison doesn't consider that part of their official API, but bison is a stable tool and seems unlikely to change much in the future. We could probably get away with it... regards, tom lane
Dear Tom, > I quite agree that you shouldn't do a complete implementation when it's > not clear if we'll accept it or not. What I'd like to see is a demo, > basically. How about one example of each of several different kinds > of hints? We need to get a feeling for what can be accomplished and how > messy the grammar would get. I've already tried some implementation but it was not showable, and because the infrastructure was not in place, inappropriate hints could be shown. It is not "that" messy. Just prefix are given a name to attach a rule for hint updates. > BTW, it seems like you are putting in a lot of infrastructure to > recreate externally the parse-stack state that bison has internally. Yes. > Maybe an interesting alternative would be to look at touching that > state directly. I realize that bison doesn't consider that part of > their official API, but bison is a stable tool and seems unlikely to > change much in the future. We could probably get away with it... Interesting, however: I would not like to break postgresql portability with other parser generators. It would be enough to reject the patch from my point of view;-) Using some non standard undocumented API would not help other people who would like to touch the parser. It would not help me either, because I would have to learn the undocumented API;-) I really need to give a name to the state so as to be able to attach a hint. I cannot set how I could guess the state number given after "GRANT" just from the source, and without generating lots of conflicts. So I'll do some more programming maybe over the week-end a resubmit something soon. -- Fabien.
Fabien COELHO wrote: > I would not like to break postgresql portability with other parser > generators. It would be enough to reject the patch from my point of > view;-) We require bison to build, period. I am sure we use enough bison-specific stuff now. No one has suggested another parser generator to support either. > > Using some non standard undocumented API would not help other people who > would like to touch the parser. > > It would not help me either, because I would have to learn the > undocumented API;-) > > I really need to give a name to the state so as to be able to attach a > hint. I cannot set how I could guess the state number given after "GRANT" > just from the source, and without generating lots of conflicts. > > So I'll do some more programming maybe over the week-end a resubmit > something soon. Please show us an example of what a typical hint would look like to the user. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
> Please show us an example of what a typical hint would look like to the > user. As an illustration, here is the kind of thing I had with an early quick and dirty proof of concept implementation I made for myself one month ago. As a result of this first try, I've convinced myself that I should do that cleanly and incrementally, one command at a time, with a proper clean infrastructure, and intermediate validations... My idea of the target audience for those hints is my students while they're learning SQL. -- -- test HINTs provided on parser errors. -- -- it should: -- (1) trigger all possible hints -- (2) include ACTUAL syntax errors by REAL people -- -- Note that these test assume a psql like interface which -- can only send "lexable" and "even" commands to the backend. -- Thus I cannot test badly closed quotes or parentheses. -- -- -- (1) all hints -- -- -- sql command expected foo; ERROR: syntax error at or near "foo" at character 1 HINT: sql command such as SELECT, CREATE, GRANT... expected 123; ERROR: syntax error at or near "123" at character 1 HINT: sql command such as SELECT, CREATE, GRANT... expected +; ERROR: syntax error at or near "+" at character 1 HINT: sql command such as SELECT, CREATE, GRANT... expected 'hello world'; ERROR: syntax error at or near "'hello world'" at character 1 HINT: sql command such as SELECT, CREATE, GRANT... expected NULL; ERROR: syntax error at or near "NULL" at character 1 HINT: sql command such as SELECT, CREATE, GRANT... expected COLUMN; ERROR: syntax error at or near "COLUMN" at character 1 HINT: sql command such as SELECT, CREATE, GRANT... expected CREAT TABLE foo(id INT4); ERROR: syntax error at or near "CREAT" at character 1 HINT: sql command such as SELECT, CREATE, GRANT... expected -- -- CREATE xxx CREATE 123; ERROR: syntax error at or near "123" at character 8 HINT: OR REPLACE, TABLE, DATABASE, USER... expected CREATE foo; ERROR: syntax error at or near "foo" at character 8 HINT: OR REPLACE, TABLE, DATABASE, USER... expected CREATE + CREATE (); ERROR: syntax error at or near "+" at character 8 HINT: OR REPLACE, TABLE, DATABASE, USER... expected CREATE INT4; ERROR: syntax error at or near "INT4" at character 8 HINT: OR REPLACE, TABLE, DATABASE, USER... expected CREATE OR foo; ERROR: syntax error at or near "foo" at character 11 HINT: REPLACE expected -- -- DROP xxx DROP 123; ERROR: syntax error at or near "123" at character 6 HINT: object such as TABLE, DATABASE, USER... expected DROP foo; ERROR: syntax error at or near "foo" at character 6 HINT: object such as TABLE, DATABASE, USER... expected DROP USER 123; ERROR: syntax error at or near "123" at character 11 HINT: user id such as calvin expected DROP SYSID 123; ERROR: syntax error at or near "SYSID" at character 6 HINT: object such as TABLE, DATABASE, USER... expected DROP USER SYSID 123; ERROR: syntax error at or near "123" at character 17 DROP GROUP 321; ERROR: syntax error at or near "321" at character 12 HINT: group id such as admin expected DROP GROUP calvin, foo; ERROR: syntax error at or near "," at character 18 HINT: ; expected DROP TABLE (foo); ERROR: syntax error at or near "(" at character 12 HINT: table name... expected -- -- CREATE USER CREATE USER COLUMN; ERROR: syntax error at or near "COLUMN" at character 13 HINT: user id such as calvin expected CREATE USER 123; ERROR: syntax error at or near "123" at character 13 HINT: user id such as calvin expected CREATE USER 'hello'; ERROR: syntax error at or near "'hello'" at character 13 HINT: user id such as calvin expected CREATE USER calvin foo; ERROR: syntax error at or near "foo" at character 20 HINT: user options or SET or RESET expected CREATE USER calvin WITH foo; ERROR: syntax error at or near "foo" at character 25 HINT: user options expected CREATE USER calvin WITH CREATEDB foo; ERROR: syntax error at or near "foo" at character 34 HINT: other user options or ';' expected CREATE USER calvin PASSWORD 123; ERROR: syntax error at or near "123" at character 29 HINT: password string such as 'secret' expected CREATE USER calvin WITH PASSWORD 123; ERROR: syntax error at or near "123" at character 34 HINT: password string such as 'secret' expected CREATE USER calvin WITH ENCRYPTED 'pass'; ERROR: syntax error at or near "'pass'" at character 35 HINT: PASSWORD expected CREATE USER calvin WITH UNENCRYPTED 'pass'; ERROR: syntax error at or near "'pass'" at character 37 HINT: PASSWORD expected CREATE USER calvin WITH ENCRYPTED PASS 123; ERROR: syntax error at or near "PASS" at character 35 HINT: PASSWORD expected CREATE USER calvin WITH ENCRYPTED PASSWORD 123; ERROR: syntax error at or near "123" at character 44 HINT: password string such as 'secret' expected CREATE USER calvin WITH UNENCRYPTED PASSWORD 'secret' CREADB; ERROR: syntax error at or near "CREADB" at character 55 HINT: other user options or ';' expected CREATE USER calvin NOCREATEDB foo; ERROR: syntax error at or near "foo" at character 31 HINT: other user options or ';' expected CREATE USER calvin WITH IN foo; ERROR: syntax error at or near "foo" at character 28 HINT: GROUP expected CREATE USER calvin WITH IN GROUP 'foo'; ERROR: syntax error at or near "'foo'" at character 34 HINT: group id list expected CREATE USER calvin WITH IN GROUP 123; ERROR: syntax error at or near "123" at character 34 HINT: group id list expected CREATE USER calvin WITH IN GROUP admin, column; ERROR: syntax error at or near "column" at character 41 HINT: group id such as admin expected CREATE USER calvin WITH IN GROUP admin, 123; ERROR: syntax error at or near "123" at character 41 HINT: group id such as admin expected CREATE USER calvin WITH IN GROUP admin, CREATEDB; ERROR: group "admin" does not exist CREATE USER calvin WITH IN GROUP admin CREATEDB foo; ERROR: syntax error at or near "foo" at character 49 HINT: other user options or ';' expected CREATE USER calvin NOCREATEUSER CREATEDB foo; ERROR: syntax error at or near "foo" at character 42 HINT: other user options or ';' expected CREATE USER calvin CREATEUSER foo; ERROR: syntax error at or near "foo" at character 31 HINT: other user options or ';' expected CREATE USER calvin SYSID foo; ERROR: syntax error at or near "foo" at character 26 HINT: integer constant expected CREATE USER calvin SYSID 12 foo; ERROR: syntax error at or near "foo" at character 29 HINT: other user options or ';' expected CREATE USER calvin VALID UNTIL 12; ERROR: syntax error at or near "12" at character 32 HINT: date string such as '01-02-2000' expected CREATE USER calvin VALID foo; ERROR: syntax error at or near "foo" at character 26 HINT: UNTIL expected -- -- ALTER USER ALTER USER calvin foo; ERROR: syntax error at or near "foo" at character 19 HINT: user options or SET or RESET or ; expected ALTER USER calvin WITH foo; ERROR: syntax error at or near "foo" at character 24 HINT: user options expected ALTER USER calvin SET foo; ERROR: syntax error at or near ";" at character 26 HINT: SET stuff expected ALTER USER calvin RESET 123; ERROR: syntax error at or near "123" at character 25 HINT: user id such as calvin expected -- -- CREATE GROUP CREATE GROUP 123; ERROR: syntax error at or near "123" at character 14 HINT: group id such as admin expected CREATE GROUP grp foo; ERROR: syntax error at or near "foo" at character 18 HINT: user list such as "USER calvin" or "SYSID 123" expected CREATE GROUP grp WITH foo; ERROR: syntax error at or near "foo" at character 23 HINT: user list such as "USER calvin" or "SYSID 123" expected CREATE GROUP grp WITH USER ; ERROR: syntax error at or near ";" at character 28 HINT: user id such as calvin expected CREATE GROUP grp WITH USER 123; ERROR: syntax error at or near "123" at character 28 HINT: user id such as calvin expected CREATE GROUP grp WITH USER calvin, 123; ERROR: syntax error at or near "123" at character 36 HINT: user id such as calvin expected CREATE GROUP grp WITH SYSID calvin; ERROR: syntax error at or near "calvin" at character 29 HINT: integer constant expected CREATE GROUP grp WITH SYSID 'hello'; ERROR: syntax error at or near "'hello'" at character 29 HINT: integer constant expected CREATE GROUP grp WITH USER calvin, hobbes SYSID 'hello'; ERROR: syntax error at or near "'hello'" at character 49 HINT: integer constant expected CREATE GROUP grtp USER 123; ERROR: syntax error at or near "123" at character 24 HINT: user id such as calvin expected CREATE GROUP grtp SYSID calvin; ERROR: syntax error at or near "calvin" at character 25 HINT: integer constant expected CREATE GROUP grp WITH USER calvin, hobbes SYSID 123 foo; ERROR: syntax error at or near "foo" at character 53 HINT: user list such as "USER calvin" or "SYSID 123" expected -- -- ALTER GROUP ALTER GROUP grp; ERROR: syntax error at or near ";" at character 16 HINT: ADD or DROP expected ALTER GROUP grp foo; ERROR: syntax error at or near "foo" at character 17 HINT: ADD or DROP expected ALTER GROUP grp ADD calvin; ERROR: syntax error at or near "calvin" at character 21 HINT: USER user id list expected ALTER GROUP grp DROP calvin; ERROR: syntax error at or near "calvin" at character 22 HINT: USER user id list expected ALTER GROUP grp ADD USER 123; ERROR: syntax error at or near "123" at character 26 HINT: user id such as calvin expected ALTER GROUP grp DROP USER 123; ERROR: syntax error at or near "123" at character 27 HINT: user id such as calvin expected ALTER GROUP grp ADD USER calvin, USER 123; ERROR: syntax error at or near "USER" at character 34 HINT: user id such as calvin expected ALTER GROUP grp DROP USER calvin, 123; ERROR: syntax error at or near "123" at character 35 HINT: user id such as calvin expected ALTER GROUP grp ADD SYSID calvin; ERROR: syntax error at or near "SYSID" at character 21 HINT: USER user id list expected ALTER GROUP grp DROP SYSID calvin; ERROR: syntax error at or near "SYSID" at character 22 HINT: USER user id list expected -- -- CREATE SCHEMA... CREATE SCHEMA 123; ERROR: syntax error at or near "123" at character 15 HINT: schema name or AUTHORIZATION expected CREATE SCHEMA sch foo; ERROR: syntax error at or near "foo" at character 19 HINT: AUTHORIZATION or ... expected CREATE SCHEMA sch AUTHORIZATION 123; ERROR: syntax error at or near "123" at character 33 HINT: user id such as calvin expected CREATE SCHEMA AUTHORIZATION 123; ERROR: syntax error at or near "123" at character 29 HINT: user id such as calvin expected CREATE SCHEMA sch AUTHORIZATION calvin CREATE foo; ERROR: syntax error at or near "foo" at character 47 HINT: TABLE, INDEX, SEQUENCE, TRIGGER or VIEW expected CREATE SCHEMA sch AUTHORIZATION calvin CREATE TABLE +; ERROR: syntax error at or near "+" at character 53 HINT: table name... expected CREATE SCHEMA sch AUTHORIZATION calvin CREATE TABLE foo() xxx; ERROR: syntax error at or near "xxx" at character 59 HINT: schema statement list such as CREATE TABLE/INDEX/SEQUENCE/TRIGGER/VIEW or GRANT or ; expected CREATE SCHEMA sch AUTHORIZATION calvin CREATE TABLE foo() CREATE TABLE bla() x; ERROR: syntax error at or near "x" at character 78 HINT: schema statement list such as CREATE TABLE/INDEX/SEQUENCE/TRIGGER/VIEW or GRANT or ; expected -- Fabien Coelho - coelho@cri.ensmp.fr
Fabien COELHO <coelho@cri.ensmp.fr> writes: > I would not like to break postgresql portability with other parser > generators. I agree with Bruce's opinion that this is a nonissue. In particular I'd point out that the current grammar is already too big for any known yacc clone besides bison (even with bison you need a very recent version), and your proposed changes are going to make the grammar a whole lot larger yet, at least in terms of states and actions. Even if there was any usefulness to supporting other yacc clones, the odds of making one work seem minuscule. > Using some non standard undocumented API would not help other people who > would like to touch the parser. It would help them a *lot* if it meant that they didn't have to be aware of this stuff in order to do anything at all with the grammar. My fundamental objection to this proposal continues to be that it will make the grammar unreadable and unmaintainable. If we could push the error message generation issues into a subroutine of yyerror() and not touch the grammar rules at all, the chances of getting this accepted would rise about a thousand percent. The sample hints you show in another message still aren't impressing me much, but in any case they look like they only depend on being able to see what grammar symbols can follow the current point. The last time I studied this stuff (which was quite a while back) the follow set was something an LR parser generator would have to compute anyway. I don't know whether bison's internal tables expose that set in any useful fashion, but it surely seems worth a look. regards, tom lane PS: another reason for doing it that way is that I suspect your current approach is a dead end anyhow. You've already hit one blocker problem where you got reduce/reduce errors when you tried to insert state-recording actions, and you've only gone as far as hinting the very first symbol, which is hardly one of the more complex parts of the grammar. There are large parts of the grammar that only manage to avoid ambiguity by the skin of their teeth --- I don't believe you'll be able to do anything of this sort in those areas without breaking it. Take a look at SelectStmt and subsidiary productions as an example ... that's a fragile bit of work that took a long time to get right.
Dear Tom, > > I would not like to break postgresql portability with other parser > > generators. > > I agree with Bruce's opinion that this is a nonissue. In particular I'd > point out that the current grammar is already too big for any known yacc > clone besides bison (even with bison you need a very recent version), > and your proposed changes are going to make the grammar a whole lot > larger yet, at least in terms of states and actions. Not really in terms of state. The state should basically be the same. However yes in terms of "explicit" state that are given explicit names. And definitely in terms of actions, as you say. > > Using some non standard undocumented API would not help other people who > > would like to touch the parser. > > It would help them a *lot* if it meant that they didn't have to be aware > of this stuff in order to do anything at all with the grammar. My > fundamental objection to this proposal continues to be that it will make > the grammar unreadable and unmaintainable. If we could push the error > message generation issues into a subroutine of yyerror() and not touch > the grammar rules at all, the chances of getting this accepted would > rise about a thousand percent. Hmmm. I'm so much below zero! So maybe I think I won't resubmit a "infrastructure patch with some hints for some commands", as it seems just a lose of time. > The sample hints you show in another message still aren't impressing me > much, I don't mean them to be impressive! It is not an expert system or a seer I'm planing. I just want them to be useful to my students! > but in any case they look like they only depend on being able to > see what grammar symbols can follow the current point. Yes. That's basically where the parser is when it generates an error, it is expecting something and cannot find it. I could do more, but with more efforts. > The last time I studied this stuff (which was quite a while back) the > follow set was something an LR parser generator would have to compute > anyway. Yes. > I don't know whether bison's internal tables expose that set in any > useful fashion, but it surely seems worth a look. Having a look is something I can do;-) I'm afraid it looks like "internal state 1232, 43425 and 42523", but there may be some support enough in the generated code to get something more interesting. It would require to be able to get textual meta informations out of the state number, which is possible if bison/flex people did something about it. I can remembre something like that, somewhere deep in my memory... > PS: another reason for doing it that way is that I suspect your current > approach is a dead end anyhow. You've already hit one blocker problem > where you got reduce/reduce errors when you tried to insert > state-recording actions, and you've only gone as far as hinting the very > first symbol, which is hardly one of the more complex parts of the > grammar. These states exists anyway. Once the first keyword(s) are passed, there is much less troubles as the SQL syntax is really verbose, so it is usually hard to get lost. The only real problem I had passed the first symbol, for the part I played with, is with SET [SESSION|LOCAL] SESSION AUTHORIZATION which needed a real bit of refactoring. Also, it is better to avoid some factorizations if the semantics is different. For instance, user_list is used both for groups and users, so I switched during my tests to a user_list and a group_list. > There are large parts of the grammar that only manage to avoid > ambiguity by the skin of their teeth --- I don't believe you'll be able > to do anything of this sort in those areas without breaking it. Take a > look at SelectStmt and subsidiary productions as an example ... that's > a fragile bit of work that took a long time to get right. I noticed that great care has been taken to avoid any ambiguity. I have no intention to change that. -- Fabien Coelho - coelho@cri.ensmp.fr
Fabien COELHO wrote: >>The last time I studied this stuff (which was quite a while back) the >>follow set was something an LR parser generator would have to compute >>anyway. >> >> > >Yes. > > > > >>I don't know whether bison's internal tables expose that set in any >>useful fashion, but it surely seems worth a look. >> >> > >Having a look is something I can do;-) > >I'm afraid it looks like "internal state 1232, 43425 and 42523", but there >may be some support enough in the generated code to get something more >interesting. It would require to be able to get textual meta informations >out of the state number, which is possible if bison/flex people did >something about it. > > You *really* don't want to go there. If you want to see what the parser is doing you can run "bison -r all" over the grammar and examine the .output file. But please, let's not examine the internal states. Talk about unmaintainability! Also, I suspect that bison does a good bit of optimisation by way of combining states that removes some of the information you might need, but I haven't looked into it closely. If we really wanted to go down this route, a predictive (i.e. LL(n) type) parser rather than a bottom up parser would probably be more suitable, but I doubt anyone has any stomach for doing that work, even if there was agreement on the need for it. cheers andrew
Fabien COELHO <coelho@cri.ensmp.fr> writes: >> ... your proposed changes are going to make the grammar a whole lot >> larger yet, at least in terms of states and actions. > Not really in terms of state. The state should basically be the same. > However yes in terms of "explicit" state that are given explicit names. > And definitely in terms of actions, as you say. But mid-rule actions are implemented by inventing additional internal productions (see the bison manual's discussion of them; the O'Reilly lex & yacc book says that all yaccs do it like that). That's not only more states, but more symbols, which is going to impose an O(N^2) cost on the raw tabular representation of the parsing rules. Maybe much of this will be bought back when bison compresses the tables, and maybe not. Have you checked how much the size of gram.o grows with the stuff you've installed so far? (I'm also slightly worried about what this will do to parsing speed, but evidence about that would have to wait for further development of the patch. In any case it'd be more attractive if the cost of this could be paid only when we've already detected an error...) > I'm afraid it looks like "internal state 1232, 43425 and 42523", but there > may be some support enough in the generated code to get something more > interesting. It would require to be able to get textual meta informations > out of the state number, which is possible if bison/flex people did > something about it. The string names of the grammar symbols are all embedded in gram.c anyway, so if you can figure out the symbols that are expected next, their names could be printed directly. We could alter the symbol names to be more useful to novices, or alternatively install an additional lookup table to substitute reader-friendly phrases. regards, tom lane
Andrew Dunstan <andrew@dunslane.net> writes: > You *really* don't want to go there. If you want to see what the parser > is doing you can run "bison -r all" over the grammar and examine the > .output file. But please, let's not examine the internal states. Talk > about unmaintainability! What I was suggesting was that we might be able to extract the "follow set" from bison's tables, ie, the set of grammar symbols that are legal next inputs given the current parse state stack. I surely agree that we don't want code that goes like "if state is N then print message X" ... but the follow set should be stable. One way of describing Fabien's existing patch is that it's essentially keeping track of the follow set by hand :-( > Also, I suspect that bison does a good bit of > optimisation by way of combining states that removes some of the > information you might need, but I haven't looked into it closely. That could be a showstopper if true, but it's all speculation at this point. regards, tom lane
Fabien COELHO wrote: > CREATE USER calvin WITH IN GROUP admin, CREATEDB; > ERROR: group "admin" does not exist > > CREATE USER calvin WITH IN GROUP admin CREATEDB foo; > ERROR: syntax error at or near "foo" at character 49 > HINT: other user options or ';' expected > > CREATE USER calvin NOCREATEUSER CREATEDB foo; > ERROR: syntax error at or near "foo" at character 42 > HINT: other user options or ';' expected > > CREATE USER calvin CREATEUSER foo; > ERROR: syntax error at or near "foo" at character 31 > HINT: other user options or ';' expected I hate to say it but I don't see these hints as being very helpful. Basically, if you look at the developers FAQ, you should discuss what you want to do, how you want to do it, then code. I realize you spent a lot of time on a patch, but I don't see a lot of value in adding such hints. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Fri, 2 Apr 2004, Bruce Momjian wrote: > Fabien COELHO wrote: > > CREATE USER calvin WITH IN GROUP admin, CREATEDB; > > ERROR: group "admin" does not exist > > > > CREATE USER calvin WITH IN GROUP admin CREATEDB foo; > > ERROR: syntax error at or near "foo" at character 49 > > HINT: other user options or ';' expected > > > > CREATE USER calvin NOCREATEUSER CREATEDB foo; > > ERROR: syntax error at or near "foo" at character 42 > > HINT: other user options or ';' expected > > > > CREATE USER calvin CREATEUSER foo; > > ERROR: syntax error at or near "foo" at character 31 > > HINT: other user options or ';' expected > > I hate to say it but I don't see these hints as being very helpful. I can see them as potentially being useful for people who don't have alot of knowledge of SQL or our dialect thereof. I think some of the ones shown may need better wording (for example the ones above basically seem to mean go look at the help for create user which is pretty much the same as the error on its own, but perhaps with a longer hint, it might be less so) but I think it illustrates the point for these kinds of queries. I'm not sure that the HINT strings would be as meaningful in the middle of complicated select/update/etc queries, but that would be something to see. I'm not sure it's PostgreSQL's responsibility to teach SQL or even really to teach our own commands, but if it were possible to do without much of a performance, readability or maintenance cost, it'd probably be worth doing. I can't really say much specifically about the patch either way on any of those grounds (having not actually looked).
Stephan Szabo wrote: > > I hate to say it but I don't see these hints as being very helpful. > > I can see them as potentially being useful for people who don't have alot > of knowledge of SQL or our dialect thereof. I think some of the ones > shown may need better wording (for example the ones above basically seem > to mean go look at the help for create user which is pretty much the same > as the error on its own, but perhaps with a longer hint, it might be less > so) but I think it illustrates the point for these kinds of queries. I'm > not sure that the HINT strings would be as meaningful in the middle of > complicated select/update/etc queries, but that would be something to see. > > I'm not sure it's PostgreSQL's responsibility to teach SQL or even really > to teach our own commands, but if it were possible to do without much of a > performance, readability or maintenance cost, it'd probably be worth > doing. I can't really say much specifically about the patch either way on > any of those grounds (having not actually looked). I wonder if we could just do a \h command as a hint. In some cases, it might be clearer than printing some text. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > I'm not sure that the HINT strings would be as meaningful in the middle of > complicated select/update/etc queries, but that would be something to see. That's pretty much my stumbling block too. The examples Fabien has shown so far don't seem to me to be all that much more helpful than a simple facility to show the "\h command" syntax summary. Now "\h" doesn't know anything about syntax errors within expressions, and so I'm prepared to believe that there's some gold to be mined here if we can get the parser to offer useful hints in the context of complex expressions. What I've not seen yet is any proof that that can be done at all, much less at a reasonable cost in terms of grammar complexity... > I'm not sure it's PostgreSQL's responsibility to teach SQL or even really > to teach our own commands, but if it were possible to do without much of a > performance, readability or maintenance cost, it'd probably be worth > doing. Right; we have to consider those costs and trade them off against the benefit of better error reporting. One thought here is that the extra costs might be better spent in a client program that's in a "training wheels" mode. Would it be reasonable to invent a "psql-student" client that has its own version of the parser with better error detection? Then production clients wouldn't have to pay any cycles towards support of novices. I'm not sure right now how hard it would be to keep such a client in sync with the production backend, but it's an idea to consider... regards, tom lane
Dear Stephan, > I can see them as potentially being useful for people who don't have alot > of knowledge of SQL or our dialect thereof. That is my audience. > I think some of the ones shown may need better wording [...] Sure. I'm not the man for writing very clear English sentences. I may be more fluent in C;-) > I'm not sure it's PostgreSQL's responsibility to teach SQL or even really > to teach our own commands, I don't know about postgreSQL responsability, but I am a teacher using postgreSQL as a tool for teaching database. I don't need hints, but my students may find sql a little bit easier to grasp if they have a little bit more help from the error reports. I can understand that this is not the main agenda of postgresql, which is more about new features, better performance and sustained stability. But it is my agenda as a teacher. -- Fabien Coelho - coelho@cri.ensmp.fr
> I wonder if we could just do a \h command as a hint. In some cases, it > might be clearer than printing some text. \h SELECT or \h CREATE TABLE are 2 pages long. For the beginner, it is a lot of help... a little bit too much. So I think that a better job can be done. If I'm wrong, you don't apply a patch and that's all! Basically, this is the current status;-) -- Fabien Coelho - coelho@cri.ensmp.fr
Dear Tom, > > Not really in terms of state. The state should basically be the same. > > However yes in terms of "explicit" state that are given explicit names. > > And definitely in terms of actions, as you say. > > But mid-rule actions are implemented by inventing additional internal > productions Sure. > That's not only more states, but more symbols, which is going to impose > an O(N^2) cost on the raw tabular representation of the parsing rules. > Maybe much of this will be bought back when bison compresses the tables, > and maybe not. Mmh. Maybe. I don't know at the time. > Have you checked how much the size of gram.o grows with the stuff > you've installed so far? I have not looked at that. It was just a quick and dirty implementation, just to convince myself of what can be achieved. > (I'm also slightly worried about what this will do to parsing speed, Well, more reductions are performed, and I'm not sure that the switch() implementation is really intelligent. Having a hash table could help big grammars, but that is bison problems, and I will not rewrite that. However, I'm not sure that parsing overhead is a big issue wrt other costs in the backend, but this is not a reason to make it explode. > > I'm afraid it looks like "internal state 1232, 43425 and [...], > > The string names of the grammar symbols are all embedded in gram.c Yes, I've noticed yytname[]. > anyway, so if you can figure out the symbols that are expected next, > their names could be printed directly. That is done with YYERROR_VERBOSE, but the result is really poor most of the time, because it does not look for all possible terminals, just the ones easilly accessible. So you have something like: DROP TABL foo; ERROR: syntax error at unexpected Ident "TABL", LANGUAGE expected. no comment. A better job could be done, but you need to "touch" a little bit "gram.c" for that. > We could alter the symbol names to be more useful to novices, or > alternatively install an additional lookup table to substitute > reader-friendly phrases. Yep, I thought about that also. Some symbols are appended "_P" to avoid clashes. I'm investigating the "internal" way. Not really optimistic because of the details, but I may find workarounds. I'll post later when I'll have a definite opinion. -- Fabien COELHO _ http://www.cri.ensmp.fr/~coelho _ Fabien.Coelho@ensmp.fr CRI-ENSMP, 35, rue Saint-Honoré, 77305 Fontainebleau cedex, France phone: (+33|0) 1 64 69 {voice: 48 52, fax: 47 09, standard: 47 08} ________ All opinions expressed here are mine _________
> > You *really* don't want to go there. If you want to see what the parser > > is doing you can run "bison -r all" over the grammar and examine the > > .output file. But please, let's not examine the internal states. Talk > > about unmaintainability! > > What I was suggesting was that we might be able to extract the "follow > set" from bison's tables, ie, the set of grammar symbols that are legal > next inputs given the current parse state stack. > I surely agree that we don't want code that goes like "if state is N > then print message X" ... but the follow set should be stable These are 2 different issues. (1) computing the set of expected following tokens. It is possible to do that, with some processing of bison tables. (2) give advices based on guesses: for what I have looked so far, it could look like: - I'm here for rule "user_list", the previous token was ',' OR I'm here for "select_expressions_list", last token was ',' and current token is "FROM" => "HINT: remove previous comma" I don't think that (2) would be a bad idea, especially for my students. The good point is that the cost would only be paid at error time. > One way of describing Fabien's existing patch is that it's essentially > keeping track of the follow set by hand :-( Yes. I give name to existing states, and states leads to expected follow tokens. > > Also, I suspect that bison does a good bit of > > optimisation by way of combining states that removes some of the > > information you might need, but I haven't looked into it closely. > > That could be a showstopper if true, but it's all speculation at this > point. I think the information is there. -- Fabien Coelho - coelho@cri.ensmp.fr
Fabien COELHO <coelho@cri.ensmp.fr> writes: > That is done with YYERROR_VERBOSE, but the result is really poor > most of the time, because it does not look for all possible terminals, > just the ones easilly accessible. I wasn't aware that bison had a built-in facility for better messages --- this is kind of cool actually. I played with it a little and found that: 1. There's an arbitrary restriction in the bison code to show no more than four alternatives; if there's more than four then it shows none of them, rather than doing something helpful like "etc." 2. The problem you exhibit with DROP seems to be due to our use of empty productions: > DROP TABL foo; > ERROR: syntax error at unexpected Ident "TABL", LANGUAGE expected. It looks like the parser makes one additional state move, reducing "opt_procedural" to "empty", before raising the error. We might be able to suppress that; alternatively we could look at the stacked state(s) and add their follow sets to the printout. In any case it is clear that we can extract the follow set from the tables. A bigger problem with this, though, is the verbosity of the results in some cases. I diked out the limitation to four outputs and soon found examples like regression=# grant; ERROR: syntax error, unexpected ';', expecting ALL or CREATE or DELETE_P or EXECUTE or INSERT or REFERENCES or RULE or SELECTor TEMP or TEMPORARY or TRIGGER or UPDATE or USAGE at or near ";" at character 6 LINE 1: grant; ^ which is useful, and regression=# grant select on t to; ERROR: syntax error, unexpected ';', expecting ABORT_P or ABSOLUTE_P or ACCESS or ACTION or ADD or AFTER or AGGREGATE orALTER or ASSERTION or ASSIGNMENT or AT or BACKWARD or BEFORE or BEGIN_P or BIGINT or BIT or BOOLEAN_P or BY or CACHE orCALLED or CASCADE or CHAIN or CHAR_P or CHARACTER or CHARACTERISTICS or CHECKPOINT or CLASS or CLOSE or CLUSTER or COALESCEor COMMENT or COMMIT or COMMITTED or CONSTRAINTS or CONVERSION_P or CONVERT or COPY or CREATEDB or CREATEUSER orCURSOR or CYCLE or DATABASE or DAY_P or DEALLOCATE or DEC or DECIMAL_P or DECLARE or DEFAULTS or DEFERRED or DEFINER orDELETE_P or DELIMITER or DELIMITERS or DOMAIN_P or DOUBLE_P or DROP or EACH or ENCODING or ENCRYPTED or ESCAPE or EXCLUDINGor EXCLUSIVE or EXECUTE or EXISTS or EXPLAIN or EXTERNAL or EXTRACT or FETCH or FIRST_P or FLOAT_P or FORCE or FORWARDor FUNCTION or GLOBAL or GROUP_P or HANDLER or HOLD or HOUR_P or IMMEDIATE or IMMUTABLE or IMPLICIT_P or INCLUDINGor INCREMENT or INDEX or INHERIT! S or INOUT or INPUT_P or INSENSITIVE or INSERT or INSTEAD or INT_P or INTEGER or INTERVAL or INVOKER or ISOLATION or KEYor LANCOMPILER or LANGUAGE or LARGE_P or LAST_P or LEVEL or LISTEN or LOAD or LOCAL or LOCATION or LOCK_P or MATCH orMAXVALUE or MINUTE_P or MINVALUE or MODE or MONTH_P or MOVE or NAMES or NATIONAL or NCHAR or NEXT or NO or NOCREATEDB orNOCREATEUSER or NONE or NOTHING or NOTIFY or NULLIF or NUMERIC or OBJECT_P or OF or OIDS or OPERATOR or OPTION or OUT_Por OVERLAY or OWNER or PARTIAL or PASSWORD or PATH_P or PENDANT or POSITION or PRECISION or PRESERVE or PREPARE or PRIORor PRIVILEGES or PROCEDURAL or PROCEDURE or READ or REAL or RECHECK or REINDEX or RELATIVE_P or RENAME or REPEATABLEor REPLACE or RESET or RESTART or RESTRICT or RETURNS or REVOKE or ROLLBACK or ROW or ROWS or RULE or SCHEMA orSCROLL or SECOND_P or SECURITY or SEQUENCE or SERIALIZABLE or SESSION or SET or SETOF or SHARE or SHOW or SIMPLE or SMALLINTor STABLE or START or STATEMENT o! r STATISTICS or STDIN or STDOUT or STORAGE or STRICT_P or SUBSTRING or SYSID or TEMP or TEMPLATE or TEMPORARY or TIME or TIMESTAMP or TOAST or TRANSACTION or TREAT or TRIGGER or TRIM or TRUNCATEor TRUSTED or TYPE_P or UNCOMMITTED or UNENCRYPTED or UNKNOWN or UNLISTEN or UNTIL or UPDATE or USAGE or VACUUM orVALID or VALIDATOR or VALUES or VARCHAR or VARYING or VERSION or VIEW or VOLATILE or WITH or WITHOUT or WORK or WRITE orYEAR_P or ZONE or IDENT at or near ";" at character 22 LINE 1: grant select on t to; ^ which is not real useful at all :-(. You really want to see just "expecting IDENT" in such a case. Still we might be able to do some postprocessing on the collected set of valid follow symbols, such as removing all the unreserved_keywords when they are present along with IDENT. It'd be fairly reasonable to embed knowledge about this in keywords.c and/or scan.l. We'd have to write our own version of bison's verbose-error code anyway, because the canned code doesn't support localization --- it uses hardwired strings for "expecting" and so on. But it looks possibly doable to me. regards, tom lane
Dear Tom, First, thanks for the discussion about my "hint" infrastructure patch submission. Whatever the opinion, it helps. > We'd have to write our own version of bison's verbose-error code anyway, > because the canned code doesn't support localization --- it uses > hardwired strings for "expecting" and so on. But it looks possibly > doable to me. I also played a little bit with the parser over the weekend. I found the same issues that you noted in your mail, tried more options, and found other problems. I sent here the full mail I wrote. Although I agree that it is "doable", I have stronger reserve than yours. Also, I do not find it an appealing solution to change "gram.c" a lot. The problems I found are: (1) not available raw information. The automaton stack keeps states, which are not directly linked to rules and terminals. The terminals are not available, they must be kept separatly if you want them. This can be done in yylex(). The internal state, stack, token... are local variables within yyparse(). As a result, they are not accessible from yyerror. I haven't found any available hook, so you have to hack "gram.c" to get this information. It is a 1 line hack, but it modifies the generated code. Or you have to produce a new hacked "gram.c", which is what you suggest. (2) hard to compute follow set The information about the syntax is there, obviously! However, from a given automaton state, the actually allowed tokens are not directly available. Only some are directly available, so the result is: DROP ? ; ERROR: syntax error at or near "?" at character 6 HINT: hint status is state=422 n=796 char=575 type=320=Op stack= 0 16 422 tokens= 88:DROP 320:Op expected= 150:LANGUAGE? Other tokens may be found, but you have either: - to *simulate* the automaton forward through reductions to check whether it is finally accepted. Not really beautiful, but that can be done. It is just the bison code without the reduction production rules;-) - move backwards before doing the above, if some reductions where performed because of the submitted token and finally resulted in the error, the state that lead to the error may not be the highest available one, so maybe other allowed tokens may also be missed. We would need to have the last state before any reduction. Another small hack in generated "gram.c" would be needed to get it. However, (3) the computed follow set is often useless, as you noted. The reason for that is that keywords are accepted in place of identifiers. The use of rules: reserved_keyword, func_name_keyword, col_name_keyword, unreserved_keyword as possible column/function/... identifiers is IMHO the main reason for the size of the generated automaton. If you drop that, everything would be much simpler, because it would reduce a lot the number of arcs in the automaton. In particular, anything which is just an identifier should not be given a name (say, BOOLEAN_P or SMALLINT should not be tokens, but could rather be recognized as a special case of identifier from within the parser, maybe with some simple post-filtering). As you noted, for things like "SHOW 123", the follow set basically includes all keywords although you can have SHOW ALL or SHOW name. So, as you suggested, you can either say "ident" as a simplification, but you miss ALL which is meaningful, or you say all keywords, which is useless. An alternative is to know that you're within a "SHOW something", that is you somehow reparse the code from within the syntax error:-( It is necessary if you want to say something more interesting than "ident", say "option name". You may also get this information by simulating the automaton forward, or noticing that some of the current on stack states can only lead to the reduction of a ShowStmt... Well, on the positive side, it is sometimes right, say: REVOKE ? ; ERROR: syntax error, unexpected Op at or near "?" at character 8 HINT: hint status is state=481 n=1297 char=575 type=320=Op stack= 0 31 481 tokens= 234:REVOKE 320:Op expected= 10:ALL 59:CREATE 80:DELETE_P 98:EXECUTE 136:INSERT 224:REFERENCES 239:RULE 244:SELECT 268:TEMP 270:TEMPORARY 279:TRIGGER 292:UPDATE 293:USAGE? (4) the available automaton data are inverted with respect to what would be needed... It would be nice to know whether we're heading for a create statement or something like that, but the information is hidden down the automaton. basically, the automaton would need to be "inverted" to compute this information. We would need the initial information in "gram.y", which was "inverted" to build the automaton. So the code somehow needs to undo what bison compilation has done. (5) anything that can be done would be hardwired to one version of bison. There is a lot of asumptions in the code and data structures, and any newer/older version with some different internal representation would basically break any code that would rely on that. So postgres would not be "bison" portable:-( I don't think it is an real option that old postgresql source would be broken against future bison releases. So, as far as I'm concerned, I don't find the internal way really convincing as the way to go. Thus I can see three options: (a) put hints in "gram.y" as I already suggested + I can do it + it can be incremental once the infrastructure is in place. + hints are simple to add or fix = hints are not necessarily marvellous - it would take time to develop - it may slow down the parser, because of added production rules. - it would change "gram.y" a lot and could make it harder to maintain - patchers don't like it... (b) write a new "recursive descendant" parser, and drop "gram.y" + I could do it + hints could be more intelligent + I think the parser would be faster, at least not slower = I'm not sure it would be really easier to maintain, but maybe not harder - it cannot be incremental at all - hints are at the price of some programming - it would take time to develop, but basically the structure would look like "gram.y" a lot, and existing data structures can be kept. - I'm not sure patchers would like it, and if it is thrown down the drain, I would not be happy for the one who spent the time (say, me;-) (c) do nothing. + I can do that quite easily;-) + patchers are ok with that + it does not take time = the maintainability of "gram.y" is not changed. - my students won't have hints. A funny technical detail is that the refactoring which is needed for the (a) option would be also needed for (b). (b) would require more time than (a), but the results could be better for the hint/parser error handling. I can have some time for this, especially over the next few months. Incremental options looked a better way to me, but (a) seems stuck and (b) is risky, and the internal way looks ugly. As a side effect of my inspection is that the automaton generated by bison could be simplified if a different tradeoff between the lexer, the parser and the post-processing would be chosen. Namelly, all tokens that are just identifiers should be dropped and processed differently. Have a nice day, -- Fabien Coelho - coelho@cri.ensmp.fr
Fabien COELHO <coelho@cri.ensmp.fr> writes: > Although I agree that it is "doable", I have stronger reserve than yours. > Also, I do not find it an appealing solution to change "gram.c" a lot. I was not proposing hand-editing gram.c after bison generates it, if that's what you meant ;-). It seems however perfectly doable to reference bison's state stack from yyerror. We'd need some macro hacking to make yystate and the stack available to yyerror, but nothing worse than is already documented and recommended practice for other bison tweaks. > The automaton stack keeps states, which are not directly linked to rules > and terminals. The terminals are not available, they must be kept > separatly if you want them. This can be done in yylex(). We don't need them. Any already-shifted tokens are to the left of where the error is, no? > The internal state, stack, token... are local variables within yyparse(). > As a result, they are not accessible from yyerror. I haven't found any > available hook, so you have to hack "gram.c" to get this information. No, just redefine "yyerror" as a macro that passes additional parameters. > - move backwards before doing the above, if some reductions where > performed because of the submitted token and finally resulted in the error, > the state that lead to the error may not be the highest available one, so > maybe other allowed tokens may also be missed. We would need to have > the last state before any reduction. Yeah, I had come to the same conclusion --- state moves made without consuming input would need to be backed out if we wanted to report the complete follow set. I haven't yet looked to see how to do that. > As you noted, for things like "SHOW 123", the follow set basically > includes all keywords although you can have SHOW ALL or SHOW name. > So, as you suggested, you can either say "ident" as a simplification, but > you miss ALL which is meaningful, or you say all keywords, which is > useless. You're ignoring the distinction between classes of keywords. I would not recommend treating reserved_keywords as a subclass of ident. > (5) anything that can be done would be hardwired to one version of bison. > There is a lot of asumptions in the code and data structures, and any > newer/older version with some different internal representation would > basically break any code that would rely on that. So postgres would not be > "bison" portable:-( I don't think it is an real option that old postgresql > source would be broken against future bison releases. I think this argument is completely without merit. The technology of LALR parsers has been stable for what, thirty years now? The parts of bison that we'd want to look at are inherited lock stock and barrel from AT&T yacc, and are unlikely to change in the foreseeable future; even more unlikely to change in a way that we couldn't easily adapt to. You might as well argue that we shouldn't use autoconf because the autoconf authors sometimes make not-very-compatible changes. > (b) write a new "recursive descendant" parser, and drop "gram.y" Been there, done that, not impressed with the idea. There's a reason why people invented parser generators... > As a side effect of my inspection is that the automaton generated by bison > could be simplified if a different tradeoff between the lexer, the parser > and the post-processing would be chosen. Namelly, all tokens that are > just identifiers should be dropped and processed differently. We are not going to reserve the keywords that are presently unreserved. If you can think of a reasonable way to stop treating them as separate tokens inside the grammar without altering the user-visible behavior, I'm certainly interested. I think that will be rather difficult, however, considering for one thing that SQL specifies different case-folding behavior for identifiers and keywords. regards, tom lane
Dear Tom, > No, just redefine "yyerror" as a macro that passes additional parameters. Ok! That's just the kind of the hack-hook I was looking for!! > > The terminals are not available, they must be kept separatly if you > > want them. This can be done in yylex(). > > We don't need them. Any already-shifted tokens are to the left of where > the error is, no? Yes and no. I agree that you don't need them for computing the follow set. However it would help a lot to generate nicer hints. I'm looking for help the user, and the follow set is just part of the help. Think of typical "SELECT foo, bla, FROM xxx" (it looks stupid on a one line query, but not so on a very large query) which is quite easier to detect and hint about because of FROM is just after a comma. The rule part is also a real issue. There is no easy way to translate states into rules, to know whether we're somewhere in a "ShowStmt" or "AlterTableStmt"... If you're after a comma in state 1232, should you just say IDENT... I'd rather say "user name" or "table name" if I can... Otherwise the hint stays at a low parser level. Good hints requires to know about the current context, and it is not easy to get that deep in the automaton. > Yeah, I had come to the same conclusion --- state moves made without > consuming input would need to be backed out if we wanted to report the > complete follow set. I haven't yet looked to see how to do that. Well, following you're previous suggestion, one can redefine the YYLEX macro to save the current state each time a token is required. > > As you noted, for things like "SHOW 123", the follow set basically > > includes all keywords although you can have SHOW ALL or SHOW name. > > So, as you suggested, you can either say "ident" as a simplification, but > > You're ignoring the distinction between classes of keywords. I would > not recommend treating reserved_keywords as a subclass of ident. Ok, I agree that it can help in this very case a little bit here because ALL is reserved. I did not make this distinction when I played with bison. > > (5) anything that can be done would be hardwired to one version of bison. > > I think this argument is completely without merit. Possible;-) However I don't like writing hacks that depends on other people future behavior. > > (b) write a new "recursive descendant" parser, and drop "gram.y" > > Been there, done that, not impressed with the idea. There's a reason > why people invented parser generators... Sure. It was just a list of options;-) > > As a side effect of my inspection is that the automaton generated by bison > > could be simplified if a different tradeoff between the lexer, the parser > > and the post-processing would be chosen. Namelly, all tokens that are > > just identifiers should be dropped and processed differently. > > We are not going to reserve the keywords that are presently unreserved. Reserving keywords would help, but I would not think it could be accepted, because it is not SQL philosophy. SQL is about 30 years also, as old as yacc ideas... but they did not care at the time;-) When you look at the syntax, it was designed with a recursive parser in mind. Part of my comment was to explain "why" the generated automaton is large. SQL is a small language, but it has a big automaton in postgresql, and this is IMHO the explanation. > If you can think of a reasonable way to stop treating them as separate > tokens inside the grammar without altering the user-visible behavior, > I'm certainly interested. I was thinking about type names especially, and maybe others. I join a small proof-of-concept patch to drop some tokens out of the parser. As a results, 6 tokens, 6 rules and 9 states are removed, and the automaton table is reduced by 438 entries (1.5%). Not too bad;-) I think others could be dealt with similarily, or with more work. Thanks for the discussion, -- Fabien.
Attachment
> As another teaching aid idea, is there something that would auto-format > queries to they are more readable? Not that I know of. But I guess it must exist in some interface, maybe with Oracle or DB2. My emacs does seem to do that. It just put colors on keywords. If I had to put it somewhere, I would try to enhance emacs sql-mode. But I don't like much lisp programming. -- Fabien Coelho - coelho@cri.ensmp.fr
As another teaching aid idea, is there something that would auto-format queries to they are more readable? --------------------------------------------------------------------------- Fabien COELHO wrote: > > > > Dear Tom, > > > No, just redefine "yyerror" as a macro that passes additional parameters. > > Ok! That's just the kind of the hack-hook I was looking for!! > > > > > The terminals are not available, they must be kept separatly if you > > > want them. This can be done in yylex(). > > > > We don't need them. Any already-shifted tokens are to the left of where > > the error is, no? > > Yes and no. I agree that you don't need them for computing the follow set. > However it would help a lot to generate nicer hints. I'm looking for help > the user, and the follow set is just part of the help. > > Think of typical "SELECT foo, bla, FROM xxx" (it looks stupid on a one > line query, but not so on a very large query) which is quite easier to > detect and hint about because of FROM is just after a comma. > > The rule part is also a real issue. There is no easy way to translate > states into rules, to know whether we're somewhere in a "ShowStmt" or > "AlterTableStmt"... > > If you're after a comma in state 1232, should you just say IDENT... I'd > rather say "user name" or "table name" if I can... Otherwise the hint > stays at a low parser level. Good hints requires to know about the > current context, and it is not easy to get that deep in the automaton. > > > > Yeah, I had come to the same conclusion --- state moves made without > > consuming input would need to be backed out if we wanted to report the > > complete follow set. I haven't yet looked to see how to do that. > > Well, following you're previous suggestion, one can redefine the YYLEX > macro to save the current state each time a token is required. > > > > > As you noted, for things like "SHOW 123", the follow set basically > > > includes all keywords although you can have SHOW ALL or SHOW name. > > > So, as you suggested, you can either say "ident" as a simplification, but > > > > You're ignoring the distinction between classes of keywords. I would > > not recommend treating reserved_keywords as a subclass of ident. > > Ok, I agree that it can help in this very case a little bit here because > ALL is reserved. I did not make this distinction when I played with bison. > > > > > (5) anything that can be done would be hardwired to one version of bison. > > > > I think this argument is completely without merit. > > Possible;-) > > However I don't like writing hacks that depends on other people future > behavior. > > > > > (b) write a new "recursive descendant" parser, and drop "gram.y" > > > > Been there, done that, not impressed with the idea. There's a reason > > why people invented parser generators... > > Sure. It was just a list of options;-) > > > > > As a side effect of my inspection is that the automaton generated by bison > > > could be simplified if a different tradeoff between the lexer, the parser > > > and the post-processing would be chosen. Namelly, all tokens that are > > > just identifiers should be dropped and processed differently. > > > > We are not going to reserve the keywords that are presently unreserved. > > Reserving keywords would help, but I would not think it could be accepted, > because it is not SQL philosophy. SQL is about 30 years also, as old > as yacc ideas... but they did not care at the time;-) When you look at > the syntax, it was designed with a recursive parser in mind. > > Part of my comment was to explain "why" the generated automaton is large. > SQL is a small language, but it has a big automaton in postgresql, and > this is IMHO the explanation. > > > > If you can think of a reasonable way to stop treating them as separate > > tokens inside the grammar without altering the user-visible behavior, > > I'm certainly interested. > > I was thinking about type names especially, and maybe others. > > I join a small proof-of-concept patch to drop some tokens out of the > parser. As a results, 6 tokens, 6 rules and 9 states are removed, > and the automaton table is reduced by 438 entries (1.5%). Not too bad;-) > I think others could be dealt with similarily, or with more work. > > > Thanks for the discussion, > > -- > Fabien. Content-Description: [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Fabien COELHO <coelho@cri.ensmp.fr> writes: >> If you can think of a reasonable way to stop treating them as separate >> tokens inside the grammar without altering the user-visible behavior, >> I'm certainly interested. > I join a small proof-of-concept patch to drop some tokens out of the > parser. I believe these were treated this way *specifically* because of the keyword-is-not-an-identifier issue. SQL99 calls out most of these as being keywords: SQL defines predefined data types named by the following <key word>s: CHARACTER, CHARACTER VARYING, CHARACTER LARGE OBJECT, BINARY LARGE OBJECT, BIT, BIT VARYING, NUMERIC, DECIMAL, INTEGER, SMALLINT, FLOAT, REAL, DOUBLE PRECISION, BOOLEAN, DATE, TIME, TIMESTAMP, and INTERVAL. These names are used in the type and if we don't treat them as keywords then we will have a couple of problems. One is case-conversion issues in locales where the standard downcasing is not an extension of ASCII (Turkish is the only one I know of offhand). Another is that depending on where you put the renaming that this patch removes without replacing :-(, it would be possible for the renaming transformation to get applied to user-defined types with similar names, or for user-defined types to unexpectedly shadow system definitions. The former would be surprising and the latter would violate the spec. Check the archives; I'm sure this was discussed in the 7.3 development cycle and we concluded that treating these names as keywords was the only practical solution. regards, tom lane
Tom Lane wrote: >Fabien COELHO <coelho@cri.ensmp.fr> writes: > > >>(b) write a new "recursive descendant" parser, and drop "gram.y" >> >> er, that's "recursive descent" :-) > >Been there, done that, not impressed with the idea. There's a reason >why people invented parser generators... > > Well, unless you are a serious masochist, cutting a parser of the LR family (including LALR(1), such as yacc/bison produce) by hand is very difficult and error prone. That is not the case with LL type parsers. Also, there are parser generators for this family, both table driven and recursive descent. The table driven variety especially can have quite well tuned error reporting and recovery. (Among the reasons I know this is that I actually wrote such a beast around 13 years ago). That is not to say that I think we should move from the present setup. Familiarity of the developer community with the tool used suggests we should not, quite apart from any other reasons. Changing to, say, an RD parser, would be a massive and probably error prone change and the benefit just doesn't seem worth it. In fact, considering this thread I'm not sure any of the suggested steps will achieve Fabien's goal. ISTM that a smarter "training wheels" frontend, perhaps with some modest backend improvements, is likely to have better results. ("Oh, you found an error near there - now what do I suggest belongs there?") cheers andrew
Dear Tom, > No, just redefine "yyerror" as a macro that passes additional parameters. Ok! That's just the kind of the hack-hook I was looking for!! > > The terminals are not available, they must be kept separatly if you > > want them. This can be done in yylex(). > > We don't need them. Any already-shifted tokens are to the left of where > the error is, no? Yes and no. I agree that you don't need them for computing the follow set. However it would help a lot to generate nicer hints. I'm looking for help the user, and the follow set is just part of the help. Think of typical "SELECT foo, bla, FROM xxx" (it looks stupid on a one line query, but not so on a very large query) which is quite easier to detect and hint about because of FROM is just after a comma. The rule part is also a real issue. There is no easy way to translate states into rules, to know whether we're somewhere in a "ShowStmt" or "AlterTableStmt"... If you're after a comma in state 1232, should you just say IDENT... I'd rather say "user name" or "table name" if I can... Otherwise the hint stays at a low parser level. Good hints requires to know about the current context, and it is not easy to get that deep in the automaton. > Yeah, I had come to the same conclusion --- state moves made without > consuming input would need to be backed out if we wanted to report the > complete follow set. I haven't yet looked to see how to do that. Well, following you're previous suggestion, one can redefine the YYLEX macro to save the current state each time a token is required. > > As you noted, for things like "SHOW 123", the follow set basically > > includes all keywords although you can have SHOW ALL or SHOW name. > > So, as you suggested, you can either say "ident" as a simplification, but > > You're ignoring the distinction between classes of keywords. I would > not recommend treating reserved_keywords as a subclass of ident. Ok, I agree that it can help in this very case a little bit here because ALL is reserved. I did not make this distinction when I played with bison. > > (5) anything that can be done would be hardwired to one version of bison. > > I think this argument is completely without merit. Possible;-) However I don't like writing hacks that depends on other people future behavior. > > (b) write a new "recursive descendant" parser, and drop "gram.y" > > Been there, done that, not impressed with the idea. There's a reason > why people invented parser generators... Sure. It was just a list of options;-) > > As a side effect of my inspection is that the automaton generated by bison > > could be simplified if a different tradeoff between the lexer, the parser > > and the post-processing would be chosen. Namelly, all tokens that are > > just identifiers should be dropped and processed differently. > > We are not going to reserve the keywords that are presently unreserved. Reserving keywords would help, but I would not think it could be accepted, because it is not SQL philosophy. SQL is about 30 years also, as old as yacc ideas... but they did not care at the time;-) When you look at the syntax, it was designed with a recursive parser in mind. Part of my comment was to explain "why" the generated automaton is large. SQL is a small language, but it has a big automaton in postgresql, and this is IMHO the explanation. > If you can think of a reasonable way to stop treating them as separate > tokens inside the grammar without altering the user-visible behavior, > I'm certainly interested. I was thinking about type names especially, and maybe others. I join a small proof-of-concept patch to drop some tokens out of the parser. As a results, 6 tokens, 6 rules and 9 states are removed, and the automaton table is reduced by 438 entries (1.5%). Not too bad;-) I think others could be dealt with similarily, or with more work. Thanks for the discussion, -- Fabien.
Attachment
Fabien COELHO <coelho@cri.ensmp.fr> writes: >> Another is that depending on where you put the renaming that this patch >> removes without replacing :-(, > I do not understand your point. It seems to me that the renaming is > performed when a type name is expected? The "boolean" keyword (not token) > is translated to system "bool" type in the GenericType rule?? ??? I mean that you removed functionality without putting it back; the modified parser will fail to recognize BOOLEAN as a type name at all, because it doesn't match "bool" which is in the catalogs. (And changing the entry to "boolean" is not a solution, it just moves the problem.) I assume you intended to handle this by doing the substitutions in type name lookup elsewhere in the parser, but I don't think that is a valid solution, because there is no longer enough information. In particular you can't any longer tell the difference between BOOLEAN and "boolean" (with quotes), which are not the same thing --- a quoted string is never a keyword, per spec. Possibly a better example than boolean is the REAL => pg_catalog.float4 transformation. If a user has defined his own type named foo.real, he ought to be able to refer to it as "real" (with quotes) and not get messed up by the keyword transformation. I think our original motivation for converting all these things to keywords was the realization that pg_dump would in fact screw up and fail to dump such a type definition correctly if "real" wasn't recognized as conflicting with a keyword (which is what prompts pg_dump to stick quotes on). The basic point here is that eliminating tokens as you propose will result in small changes in behavior, none of which are good or per spec. Making the parser automaton smaller would be nice, but not at that price. > My point is that you can have the very same *semantical* result with a > smaller automaton if you chose a different trade-off within the > lexer/parser/post filtering. I don't want to change the language. You have not proven that you can have the same result. regards, tom lane
Dear Tom, > > I join a small proof-of-concept patch to drop some tokens out of the > > parser. > > I believe these were treated this way *specifically* because of the > keyword-is-not-an-identifier issue. SQL99 calls out most of these > as being keywords: Well, I think that the "reserved keywords" are fine as tokens in a lexer/parser, but think that the "unreserved keywords" should be dropped of the token status if possible. > and if we don't treat them as keywords then we will have a couple of > problems. One is case-conversion issues in locales where the standard > downcasing is not an extension of ASCII (Turkish is the only one I know > of offhand). Do you mean it should use an ASCII-only strcasecmp, not a possibly "localised" version? I agree, but this is just a "proof of concept" patch to show that you don't need so many tokens in the parser. > Another is that depending on where you put the renaming that this patch > removes without replacing :-(, I do not understand your point. It seems to me that the renaming is performed when a type name is expected? The "boolean" keyword (not token) is translated to system "bool" type in the GenericType rule?? ??? > it would be possible for the renaming transformation to get applied to > user-defined types with similar names, or for user-defined types to > unexpectedly shadow system definitions. I don't think that the patch changes the result of the parsing. It drops *TOKENS* out of the lexer, but they are still *KEYWORDS*, although they are not explicitly in the lexer list. "keyword.c" deals with tokens, the file name was ill-chosen. If you think that keywords can only be lexical tokens, then you end-up with an automaton larger than necessary, IMVHO. Note that the removed tokens are still "keywords" as they are treated *especially* anyway. It is not a semantical transformation. Also, if you don't want these names as candidate function names, they could be filtered out at some other point in the parser. They really don't need to be special tokens. My point is that you can have the very same *semantical* result with a smaller automaton if you chose a different trade-off within the lexer/parser/post filtering. I don't want to change the language. > The former would be surprising and the latter would violate the spec. I'm really not sure this is the case with the patch I sent. > Check the archives; I'm sure this was discussed in the 7.3 development > cycle and we concluded that treating these names as keywords was the > only practical solution. Hmmm... I can check the archive, but I cannot see how different the language is with the patch. Maybe there is a missing filter out, or strcasecmp is not the right version, but no more. I think it is a small technical issue in the parser internals, and has nothing to do with great principles and whether this or that is a keyword. It's about what keywords need to be tokens. -- Fabien Coelho - coelho@cri.ensmp.fr
> >>(b) write a new "recursive descendant" parser, and drop "gram.y" > > er, that's "recursive descent" :-) Sorry for my French version. > Well, unless you are a serious masochist, I'm not a masochist;-) I'm arguing about where hint should/could be put. > In fact, considering this thread I'm not sure any of the suggested steps > will achieve Fabien's goal. ISTM that a smarter "training wheels" > frontend, perhaps with some modest backend improvements, is likely to > have better results. ("Oh, you found an error near there - now what do I > suggest belongs there?") I cannot see what you're suggesting "practically" as a frontend. I don't think having another parser next to the first one for better error messages is a serious option? I would not like to put another parser that need to be kept synchronized with the first one. So either it is integrated or linked with the current parser, or it is not there? Out of the parser, the only information is the offending token (embedded in the error string) and the character number in the string, that is quite small a clue to give a hint. -- Fabien Coelho - coelho@cri.ensmp.fr
Dear Tom, > In particular you can't any longer tell the difference between BOOLEAN > and "boolean" (with quotes), which are not the same thing --- a quoted > string is never a keyword, per spec. [...] Ok, so you mean that on -boolean- the lexer returns a BOOLEAN_P token, but with -"boolean"- it returns an Ident and -boolean- as a lval. Indeed, in such a case I cannot recognize that simply boolean vs "boolean" if they are both idents that look the same. As a matter of fact, this can also be fixed with some post-filtering. Say, all quoted idents could be returned with a leading " to show it was dquoted, and the IDENT rules in the parser could remove when it is not needed anymore to distinguish the case. Not beautiful, I agree, but my point is that the current number of tokens and number of states and automaton size are not inherent to SQL but to the way the lexing/parsing is performed in postgresql. > The basic point here is that eliminating tokens as you propose will > result in small changes in behavior, none of which are good or per spec. > Making the parser automaton smaller would be nice, but not at that > price. Ok. I don't want to change the spec. I still stand that it can be done, although some more twicking is required. It was just a "proof of concept", not a patch submission. Well, a "proof of concept" must still be a proof;-) I attach a small patch that solve the boolean vs "boolean" issue, still as a proof of concept that it is 'doable' to preserve semantics with a different lexer/parser balance. I don't claim that it should be applied, I just claim that the automaton size could be smaller, especially by shortening the "unreserved_keyword" list. > You have not proven that you can have the same result. Well, I passed the regression tests, but that does not indeed prove anything, because these issues are not tested at all. Maybe you could consider to add the "regression" part of the attached patcht, which creates a user "boolean" type. Anyway, my motivation is about "hints" and "advises", and that does not help a lot to solve these issues. -- Fabien.