Thread: RE: [HACKERS] Postgres' lexer
>> > To my mind, without spaces this construction *is* ambiguous, and frankly >> > I'd have expected the second interpretation ('+-' is a single operator >> > name). Almost every computer language in the world uses "greedy" >> > tokenization where the next token is the longest series of characters >> > that can validly be a token. I don't regard the above behavior as >> > predictable, natural, nor obvious. In fact, I'd say it's a bug that >> > "3+-2" and "3+-x" are not lexed in the same way. >> > >> >> Completely agree with that. This differentiating behavior looks like a bug. >> >> > However, aside from arguing about whether the current behavior is good >> > or bad, these examples seem to indicate that it doesn't take an infinite >> > amount of lookahead to reproduce the behavior. It looks to me like we >> > could preserve the current behavior by parsing a '-' as a separate token >> > if it *immediately* precedes a digit, and otherwise allowing it to be >> > folded into the preceding operator. That could presumably be done >> > without VLTC. >> >> Ok. If we *have* to preserve old weird behavior, here is the patch. >> It is to be applied over all my other patches. Though if I were to >> decide whether to restore old behavior, I wouldn't do it. Because it >> is inconsistency in grammar, i.e. a bug. >> If a construct is ambiguous, then the behaviour should be undefined (i.e.: we can do what we like, within reason). If the user wants something predictable, then she should use brackets ;-) If 3+-2 presents an ambiguity (which it does) then make sure that you do this: 3+(-2). If you have an operator +- then you should do this (3)+-(2). However, if you have 3+-2 without brackets, then, because this is ambiguous (assuming no +- operator), this is undefined, and we can do pretty much whatever we feel like with it. Unless there is an operator +- defined, because then the behaviour is no longer ambiguous. The longest possible identifier is always matched, and this means that the +- will be identified. Especially with the unary minus, my feeling is that it should be placed in brackets if correct behaviour is desired. MikeA
Ansley, Michael wrote: > If a construct is ambiguous, then the behaviour should be undefined (i.e.: > we can do what we like, within reason). If the user wants something > predictable, then she should use brackets ;-) > > If 3+-2 presents an ambiguity (which it does) then make sure that you do > this: 3+(-2). If you have an operator +- then you should do this (3)+-(2). > However, if you have 3+-2 without brackets, then, because this is ambiguous > (assuming no +- operator), this is undefined, and we can do pretty much > whatever we feel like with it. Unless there is an operator +- defined, > because then the behaviour is no longer ambiguous. The longest possible > identifier is always matched, and this means that the +- will be identified. > > Especially with the unary minus, my feeling is that it should be placed in > brackets if correct behaviour is desired. When I first read that, I thought "can sign every word of that". But suddenly realized that there are more buggy situations here: consider a>-2. It is parsed as (a) >- (2). Even in original Thomas Lockhart's version there is a bug: it parses a>-b as (a) >- (b). So I decided to simply forbid long operators to end with minus. If you think that it is right, here is the patch (today is my patch bomb day :). It is to be applied *instead* my earlier today's patch. Seems that it is the only more or less clean way to deal with big operator/unary minus conflict. -- Leon.
Leon <leon@udmnet.ru> writes: > So I decided to simply forbid long operators to end with minus. No good: we already have some. There are three standard geometric operators named "?-" ... not to mention lord-knows-what user-defined operators out in the field. This might have been a good solution if we'd put it in on day one, but it's too late. I still like just telling people to write "a > -2". They don't expect "ab" to mean the same thing as "a b", nor "24" to be the same as "2 4", so why should ">-" necessarily mean the same as "> -" ? It would also be worth remembering that "-" is far from the only unary operator name we have, and so a solution that creates special behavior just for "-" is really no solution at all. Making a special case for "-" just increases the potential for confusion, not decreases it, IMHO. regards, tom lane
Tom Lane wrote: > It would also be worth remembering that "-" is far from the only unary > operator name we have, and so a solution that creates special behavior > just for "-" is really no solution at all. Making a special case for > "-" just increases the potential for confusion, not decreases it, IMHO. Ok. Especially if there are more unary operators (I always wondered what unary % in gram.y stands for :) it is reasonable not to make a special case of uminus and slightly change the old behavior. That is even more convincing that constructs like 3+-2 and 3+-b were parsed in different way, and, what is worse, a>-2 and a>-b also parsed differently. So let us ask the (hopefully) last question: Thomas (Lockhart), do you agree on always parsing constructs like '+-' or '>-' as is, and not as '+' '-' or '>' '-' ? -- Leon.
> When I first read that, I thought "can sign every word of that". > But suddenly realized that there are more buggy situations here: > consider a>-2. It is parsed as (a) >- (2). Even in original > Thomas Lockhart's version there is a bug: it parses a>-b as (a) >- (b). Bugs can be fixed. We don't always need to perform radical surgery. > So I decided to simply forbid long operators to end with minus. If you > think that it is right, here is the patch (today is my patch bomb > day :). It is to be applied *instead* my earlier today's patch. > Seems that it is the only more or less clean way to deal with > big operator/unary minus conflict. That would disallow an existing built-in operator ("?-", the "is horizontal?" test; of course, that one's my fault too ;). This is a great conversation, because at the end of it we are going to have a more solid parser. But I would suggest that we do at least two things: 1) generate a *complete* list of test cases. I'll include them in the regression tests to make sure that we preserve capabilities when changes are made in the future. This should include cases which we think *should* change behavior later. 2) move slowly on patching the parser for this, since we clearly have incomplete coverage in our regression tests and since we aren't perfectly predicting the ramifications yet. My recollection is that my last patches for the lexer stemmed from trying to fix unary minus behavior for constants used as arguments to DDL statements like CREATE SEQUENCE/START, but as I did that I started seeing other cases which weren't handled correctly. I fixed, to my understanding of what desirable behavior should be, the cases which involved numeric constants. imho this same consideration should be given to other expressions just as you are doing now. The overall parser behavior should meet some criteria, such as (in decreasing priority): o don't produce non-intuitive or unexpected results o fully expose underlying capabilities of the backend o try to do the right thing in common cases o try to do the right thing in unusual cases I'll make the (perhaps incorrect) claim that the current behavior is about right for numeric constants (common cases involving various whitespace possibilities work about right once everything is through the parser). (The "+-" operator is a good unusual case to focus on, and we may conclude that it isn't done right at the moment.) Where things happen in the parser can change. If the current behavior can't be reconciled with improved behaviors with other non-constant expressions, then maybe it should be sacrificed, but not until we try hard to improve it, rather than disallow it... - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
Thomas Lockhart wrote: > I'll make the (perhaps incorrect) claim that the current behavior is > about right for numeric constants (common cases involving various > whitespace possibilities work about right once everything is through > the parser). (The "+-" operator is a good unusual case to focus on, > and we may conclude that it isn't done right at the moment.) Where > things happen in the parser can change. If the current behavior can't > be reconciled with improved behaviors with other non-constant > expressions, then maybe it should be sacrificed, but not until we try > hard to improve it, rather than disallow it... Suppose you parse a***-b (where *** are any operator-like symbols) as (a) *** - (b). Hence you parse a?-b as (a) ? - (b). No good. Solution? No clean solution within horizon - must then have hardwired list of operators somwhere in pasrer. If we could dream of changing ?- operator ... ;) But we can't. Even your model of system which sticks uminus to number isn't fit for type-extension system. Imagine there is crazy user some place out there who wants to define operator like +- or #- . It doesn't seem to be senseless - if Postgres itself has ?- operator, it then could live with my homegrown %- operator! And then suppose that the second argument to that operator is number. See the pitfall? The only possible thing seems to be to state in documentation that we have a peculiar type-extension system which is biased towards long operators - when it sees long operator string, it swallows it as a whole. Thus - users, use spaces where needed! This is the way to introduce type-extension ideology throughout the system from parser onwards. This ideology could be the guiding light in parser matters (there is now lack thereof). -- Leon.