Thread: RE: [HACKERS] Postgres' lexer
Leon, if you manage to find a replacement for this, please let me know. I'll probably only pick it up after the weekend. I think that we need to find another way to tokenise the minus. First of all, though, how is the parser supposed to tell whether this: a -2 means this: (a - 2) or this: a (-2) i.e.: does the unary - operator take precedence over the binary - operator or not? Is there even a difference. If the parser runs into this: 'a -2', perhaps we could replace it with 'a + (-2)' instead. How does a C compiler tokenize this? Or some other standard SQL parser? MikeA >> -----Original Message----- >> From: Leon [mailto:leon@udmnet.ru] >> Sent: Friday, August 20, 1999 3:34 PM >> To: hackers >> Subject: Re: [HACKERS] Postgres' lexer >> >> >> Ansley, Michael wrote: >> > >> > Leon, I see that you have been running into the vltc >> problem ;-) I just run >> > a flex -p, and went to line 314. >> >> I got it. It is done to prevent minus from sticking to number in >> expressions like 'a -2'. Dirty, but it works. >> >> -- >> Leon. >> --------- >> "This may seem a bit weird, but that's okay, because it is weird." - >> Perl manpage. >> >> >> ************ >>
I think that we need to find another way to tokenise the minus. First of all, though, how is the parser supposed totell whether this: a -2 means this: (a - 2) or this: a (-2) i.e.: does the unary - operator take precedence over the binary - operator or not? Is there even a difference. If theparser runs into this: 'a -2', perhaps we could replace it with 'a + (-2)' instead. How does a C compiler tokenize this? Or some other standard SQL parser? For the C compiler a -2 can only mean (a - 2); a (-2) must explicitly be a function call and isn't generated by the compiler from a -2. I think the question for SQL is, does the language allow an ambiguity here? If not, wouldn't it be much smarter to keep the minus sign as its own token and deal with the semantics in the parser? Cheers, Brook
Brook Milligan <brook@biology.nmsu.edu> writes: > I think the question for SQL is, does the language allow an ambiguity > here? If not, wouldn't it be much smarter to keep the minus sign as > its own token and deal with the semantics in the parser? I don't see a good reason to tokenize the '-' as part of the number either. I think that someone may have hacked the lexer to try to merge unary minus into numeric constants, so that in an expression likeWHERE field < -2 the -2 would be treated as a constant rather than an expression involving application of unary minus --- which is important because the optimizer is too dumb to optimize the query if it looks like an expression. However, trying to make that happen at lex time is just silly. The lexer doesn't have enough context to handle all the cases anyway. We've currently got code in the grammar to do the same reduction. (IMHO that's still too early, and it ought to be done post-rewriter as part of a general-purpose constant expression reducer; will get around to that someday ;-).) So it seems to me that we should just rip *all* this cruft out of the lexer, and always return '-' as a separate token, never as part of a number. (*) Then we wouldn't need this lookahead feature. But it'd be good to get an opinion from the other tgl first ;-). I'm just a kibitzer when it comes to the lex/yacc stuff. regards, tom lane (*) not counting float exponents, eg "1.234e-56" of course.
Ansley, Michael wrote: > > Leon, if you manage to find a replacement for this, please let me know. > I'll probably only pick it up after the weekend. > > I think that we need to find another way to tokenise the minus. First of > all, though, how is the parser supposed to tell whether this: > a -2 > means this: > (a - 2) > or this: > a (-2) I think that the current behavior is ok - it is what we would expect from expressions like 'a -2'. I have produced a patch to cleanup the code. It works due to the fact that unary minus gets processed in doNegate() in parser anyway, and it is by no way lexer's job to do grammatical parsing - i.e. deciding if operator is to be treated as binary or unary. I ran regression tests, everything seems to be ok. It is my first diff/patch experience in *NIX, so take it with mercy :) But it seems to be correct. It is to be applied against 6.5.0 (I have not upgraded to 6.5.1 yet, but hope lexer hasn't changed since then.) The patch mainly contains nuked code. The only thing added is my short comment :) Have I done some right thing? :) -- Leon. --------- "This may seem a bit weird, but that's okay, because it is weird." - Perl manpage.
Attachment
> But it'd be good to get an opinion from the other tgl first ;-). Sadly, the former "tgl" ;) Sorry, I was away on vacation. I've waded through ~300 mail messages already, but have ~700 to go, so I apologize if I've missed some more developments. I added the <xm> exclusive state to accomodate the possibility of a unary minus. The change was provoked by Vadim's addition of CREATE SEQUENCE, which should allow negative numbers for some arguments. But this just uncovered the tip of the general problem... There are several cases which need to be handled (I'm doing this from memory, so may miss a few): o Positive and negative numbers as standalone arguments, with and without spaces between the "-" and the digits. o Positive and negative numbers as first arguments to binary operators, with and without spaces at all possible places. o Positive and negative numbers as second arguments to binary operators, or as arguments to unary operators. o Positive and negative numbers in the presence of operators containing minus signs, including a trailing minus sign where possible. 'taint easy to do it completely right. Perhaps trying to do less in the scanner is the right thing to do, but istm that it may put restrictions on the grammar which are not currently there. Not a good trade for a longer query length... - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > I added the <xm> exclusive state to accomodate the possibility of a > unary minus. The change was provoked by Vadim's addition of CREATE > SEQUENCE, which should allow negative numbers for some arguments. But > this just uncovered the tip of the general problem... It seems awfully hard and dangerous to try to identify unary minus in the lexer. The grammar at least has enough knowledge to recognize that a minus *is* unary and not binary. Looking into gram.y, I find that the CREATE SEQUENCE productions handle collapsing unary minus all by themselves! So in that particular case, there is still no need for the lexer to do it. AFAICT in a quick look through gram.y, there are no places where unary minus is recognized that gram.y won't try to collapse it. In short, I still think that the whole mess ought to come out of the lexer... regards, tom lane
> > I added the <xm> exclusive state to accomodate the possibility of a > > unary minus. The change was provoked by Vadim's addition of CREATE > > SEQUENCE, which should allow negative numbers for some arguments. But > > this just uncovered the tip of the general problem... > It seems awfully hard and dangerous to try to identify unary minus in > the lexer. The grammar at least has enough knowledge to recognize that > a minus *is* unary and not binary. Looking into gram.y, I find that the > CREATE SEQUENCE productions handle collapsing unary minus all by > themselves! So in that particular case, there is still no need for the > lexer to do it. AFAICT in a quick look through gram.y, there are no > places where unary minus is recognized that gram.y won't try to collapse > it. > In short, I still think that the whole mess ought to come out of the > lexer... My recollection of the whole point is that, as you mention, *you can't identify a unary minus in the lexer*. So the minus sign is kept distinct, to be reconciled later as either a unary minus *or* an operator *or* whatever. The problem was that before, things like (-2) and (- 2) were handled differently just because the spacing was different. Anyway, I'll look at the defacto changes; perhaps they are just fine but I'm worried that we've reverted the behavior... - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
Oh, there isn't really much change. The minus is passed standalone as it always was. The only thing is that currently in numbers with unary minus it gets coerced not in lexer, but in parser in doNegate(). I wonder why that hasn't been done earlier - especially considering that doNegate() existed long before my, hmm, fiddling. To tell the truth, there is some ambiguity in various operators. That ambiguity is stemming from Postgres's type-extension system. Consider this: SELECT 3+-2; What would you expect from that? I personally would expect the result of 1. But it produces an error, because '+-' is treated as some user-defined operator, which is not true. Such innocent expression as SELECT --2 puts Postgres in daze - it (psql) waits for 'completion' of such query (it treats symbols '--' as comment start :-) See? There are more pitfalls beside minus coercing :-) This all was done to clean up the code and 'straighten' the parser. There was a performance breaker, officially called AFAIR 'variable trailing context'. Thomas Lockhart wrote: > > There are several cases which need to be handled (I'm doing this from > memory, so may miss a few): > > o Positive and negative numbers as standalone arguments, with and > without spaces between the "-" and the digits. > > o Positive and negative numbers as first arguments to binary > operators, with and without spaces at all possible places. > > o Positive and negative numbers as second arguments to binary > operators, or as arguments to unary operators. > > o Positive and negative numbers in the presence of operators > containing minus signs, including a trailing minus sign where > possible. > > 'taint easy to do it completely right. Perhaps trying to do less in > the scanner is the right thing to do, but istm that it may put > restrictions on the grammar which are not currently there. Not a good > trade for a longer query length... -- Leon.
> Oh, there isn't really much change. The minus is passed standalone > as it always was. The only thing is that currently in numbers with unary > minus it gets coerced not in lexer, but in parser in doNegate(). > I wonder why that hasn't been done earlier - especially considering > that doNegate() existed long before my, hmm, fiddling. For good reasons. See below :( > To tell the truth, there is some ambiguity in various operators. > That ambiguity is stemming from Postgres's type-extension system. There is the possibility for ambiguity. But it is our responsibility to minimize that ambiguity and to make a predictable system, in the presence of Postgres' unique and valuable features such as type extension. imho this is more important than, for example, allowing infinite-length queries. > Consider this: SELECT 3+-2; What would you expect from that? I > personally would expect the result of 1. But it produces an error, > because '+-' is treated as some user-defined operator, which is > not true. That is part of my concern here. The current behavior is what you say you would expect! Your patches change that behavor!! postgres=> select 3+-2; ?column? -------- 1 (1 row) > Such innocent expression as SELECT --2 puts Postgres in > daze - it (psql) waits for 'completion' of such query (it treats > symbols '--' as comment start :-) See? There are more pitfalls > beside minus coercing :-) There are some well-defined features of SQL92 which we try hard to support; the comment convention is one of them. That is a special case and shouldn't confuse the issue here; we'll need a different test case to make the point for me... > This all was done to clean up the code and 'straighten' the parser. > There was a performance breaker, officially called AFAIR 'variable > trailing context'. Sorry, what is the performance penalty for that feature, and how do we measure that against breakage of expected, predictable behavior? Please quantify. So far, I'm not a fan of the proposed change; we're giving up behavior that impacts Postgres' unique type extension features for an arbitrarily large query buffer (as opposed to a generously large query buffer, which can be accomplished just by changing the fixed size). - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
Thomas Lockhart <lockhart@alumni.caltech.edu> writes: >> Consider this: SELECT 3+-2; What would you expect from that? I >> personally would expect the result of 1. But it produces an error, >> because '+-' is treated as some user-defined operator, which is >> not true. > That is part of my concern here. The current behavior is what you say > you would expect! Your patches change that behavor!! > postgres=> select 3+-2; > ?column? > -------- > 1 > (1 row) OTOH, with current sources: regression=> select 3+- 2; ERROR: Unable to identify an operator '+-' for types 'int4' and 'int4' You will have to retype this query using anexplicit cast regression=> select f1+-f1 from int4_tbl; ERROR: Unable to identify an operator '+-' for types 'int4' and 'int4' You will have to retype this query using anexplicit cast To my mind, without spaces this construction *is* ambiguous, and frankly I'd have expected the second interpretation ('+-' is a single operator name). Almost every computer language in the world uses "greedy" tokenization where the next token is the longest series of characters that can validly be a token. I don't regard the above behavior as predictable, natural, nor obvious. In fact, I'd say it's a bug that "3+-2" and "3+-x" are not lexed in the same way. However, aside from arguing about whether the current behavior is good or bad, these examples seem to indicate that it doesn't take an infinite amount of lookahead to reproduce the behavior. It looks to me like we could preserve the current behavior by parsing a '-' as a separate token if it *immediately* precedes a digit, and otherwise allowing it to be folded into the preceding operator. That could presumably be done without VLTC. regards, tom lane
Tom Lane wrote: > To my mind, without spaces this construction *is* ambiguous, and frankly > I'd have expected the second interpretation ('+-' is a single operator > name). Almost every computer language in the world uses "greedy" > tokenization where the next token is the longest series of characters > that can validly be a token. I don't regard the above behavior as > predictable, natural, nor obvious. In fact, I'd say it's a bug that > "3+-2" and "3+-x" are not lexed in the same way. > Completely agree with that. This differentiating behavior looks like a bug. > However, aside from arguing about whether the current behavior is good > or bad, these examples seem to indicate that it doesn't take an infinite > amount of lookahead to reproduce the behavior. It looks to me like we > could preserve the current behavior by parsing a '-' as a separate token > if it *immediately* precedes a digit, and otherwise allowing it to be > folded into the preceding operator. That could presumably be done > without VLTC. Ok. If we *have* to preserve old weird behavior, here is the patch. It is to be applied over all my other patches. Though if I were to decide whether to restore old behavior, I wouldn't do it. Because it is inconsistency in grammar, i.e. a bug. -- Leon.