Thread: RE: [HACKERS] Postgres' lexer

RE: [HACKERS] Postgres' lexer

From
"Ansley, Michael"
Date:
>> > To my mind, without spaces this construction *is* ambiguous, and
frankly
>> > I'd have expected the second interpretation ('+-' is a single operator
>> > name).  Almost every computer language in the world uses "greedy"
>> > tokenization where the next token is the longest series of characters
>> > that can validly be a token.  I don't regard the above behavior as
>> > predictable, natural, nor obvious.  In fact, I'd say it's a bug that
>> > "3+-2" and "3+-x" are not lexed in the same way.
>> > 
>> 
>> Completely agree with that. This differentiating behavior looks like a
bug.
>> 
>> > However, aside from arguing about whether the current behavior is good
>> > or bad, these examples seem to indicate that it doesn't take an
infinite
>> > amount of lookahead to reproduce the behavior.  It looks to me like we
>> > could preserve the current behavior by parsing a '-' as a separate
token
>> > if it *immediately* precedes a digit, and otherwise allowing it to be
>> > folded into the preceding operator.  That could presumably be done
>> > without VLTC.
>> 
>> Ok. If we *have* to preserve old weird behavior, here is the patch.
>> It is to be applied over all my other patches. Though if I were to
>> decide whether to restore old behavior, I wouldn't do it. Because it
>> is inconsistency in grammar, i.e. a bug.
>> 
If a construct is ambiguous, then the behaviour should be undefined (i.e.:
we can do what we like, within reason).  If the user wants something
predictable, then she should use brackets ;-)

If 3+-2 presents an ambiguity (which it does) then make sure that you do
this: 3+(-2).  If you have an operator +- then you should do this (3)+-(2).
However, if you have 3+-2 without brackets, then, because this is ambiguous
(assuming no +- operator), this is undefined, and we can do pretty much
whatever we feel like with it.  Unless there is an operator +- defined,
because then the behaviour is no longer ambiguous.  The longest possible
identifier is always matched, and this means that the +- will be identified.

Especially with the unary minus, my feeling is that it should be placed in
brackets if correct behaviour is desired.

MikeA



Re: [HACKERS] Postgres' lexer

From
Leon
Date:
Ansley, Michael wrote:

> If a construct is ambiguous, then the behaviour should be undefined (i.e.:
> we can do what we like, within reason).  If the user wants something
> predictable, then she should use brackets ;-)
> 
> If 3+-2 presents an ambiguity (which it does) then make sure that you do
> this: 3+(-2).  If you have an operator +- then you should do this (3)+-(2).
> However, if you have 3+-2 without brackets, then, because this is ambiguous
> (assuming no +- operator), this is undefined, and we can do pretty much
> whatever we feel like with it.  Unless there is an operator +- defined,
> because then the behaviour is no longer ambiguous.  The longest possible
> identifier is always matched, and this means that the +- will be identified.
> 
> Especially with the unary minus, my feeling is that it should be placed in
> brackets if correct behaviour is desired.

When I first read that, I thought "can sign every word of that".
But suddenly realized that there are more buggy situations here:
consider a>-2. It is parsed as (a) >- (2). Even in original 
Thomas Lockhart's version there is a bug: it parses a>-b as (a) >- (b).
So I decided to simply forbid long operators to end with minus. If you
think that it is right, here is the patch (today is my patch bomb
day :). It is to be applied *instead* my earlier today's patch.

Seems that it is the only more or less clean way to deal with 
big operator/unary minus conflict.
-- 
Leon.

Re: [HACKERS] Postgres' lexer

From
Tom Lane
Date:
Leon <leon@udmnet.ru> writes:
> So I decided to simply forbid long operators to end with minus.

No good: we already have some.  There are three standard geometric
operators named "?-" ... not to mention lord-knows-what user-defined
operators out in the field.  This might have been a good solution if
we'd put it in on day one, but it's too late.

I still like just telling people to write "a > -2".  They don't expect
"ab" to mean the same thing as "a b", nor "24" to be the same as "2 4",
so why should ">-" necessarily mean the same as "> -" ?

It would also be worth remembering that "-" is far from the only unary
operator name we have, and so a solution that creates special behavior
just for "-" is really no solution at all.  Making a special case for
"-" just increases the potential for confusion, not decreases it, IMHO.
        regards, tom lane


Re: [HACKERS] Postgres' lexer

From
Leon
Date:
Tom Lane wrote:

> It would also be worth remembering that "-" is far from the only unary
> operator name we have, and so a solution that creates special behavior
> just for "-" is really no solution at all.  Making a special case for
> "-" just increases the potential for confusion, not decreases it, IMHO.

Ok. Especially if there are more unary operators (I always wondered
what unary % in gram.y stands for :)  it is reasonable not to make
a special case of uminus and slightly change the old behavior. That
is even more convincing that constructs like 3+-2 and 3+-b were 
parsed in different way, and, what is worse, a>-2 and a>-b also
parsed differently. So let us ask the (hopefully) last question:
Thomas (Lockhart), do you agree on always parsing constructs like
'+-' or '>-' as is, and not as '+' '-' or '>' '-'  ?

-- 
Leon.



Re: [HACKERS] Postgres' lexer

From
Thomas Lockhart
Date:
> When I first read that, I thought "can sign every word of that".
> But suddenly realized that there are more buggy situations here:
> consider a>-2. It is parsed as (a) >- (2). Even in original
> Thomas Lockhart's version there is a bug: it parses a>-b as (a) >- (b).

Bugs can be fixed. We don't always need to perform radical surgery.

> So I decided to simply forbid long operators to end with minus. If you
> think that it is right, here is the patch (today is my patch bomb
> day :). It is to be applied *instead* my earlier today's patch.
> Seems that it is the only more or less clean way to deal with
> big operator/unary minus conflict.

That would disallow an existing built-in operator ("?-", the "is
horizontal?" test; of course, that one's my fault too ;).

This is a great conversation, because at the end of it we are going to
have a more solid parser. But I would suggest that we do at least two
things:

1) generate a *complete* list of test cases. I'll include them in the
regression tests to make sure that we preserve capabilities when
changes are made in the future. This should include cases which we
think *should* change behavior later.

2) move slowly on patching the parser for this, since we clearly have
incomplete coverage in our regression tests and since we aren't
perfectly predicting the ramifications yet.

My recollection is that my last patches for the lexer stemmed from
trying to fix unary minus behavior for constants used as arguments to
DDL statements like CREATE SEQUENCE/START, but as I did that I started
seeing other cases which weren't handled correctly. I fixed, to my
understanding of what desirable behavior should be, the cases which
involved numeric constants. imho this same consideration should be
given to other expressions just as you are doing now.

The overall parser behavior should meet some criteria, such as (in
decreasing priority):

o don't produce non-intuitive or unexpected results
o fully expose underlying capabilities of the backend
o try to do the right thing in common cases
o try to do the right thing in unusual cases

I'll make the (perhaps incorrect) claim that the current behavior is
about right for numeric constants (common cases involving various
whitespace possibilities work about right once everything is through
the parser). (The "+-" operator is a good unusual case to focus on,
and we may conclude that it isn't done right at the moment.) Where
things happen in the parser can change. If the current behavior can't
be reconciled with improved behaviors with other non-constant
expressions, then maybe it should be sacrificed, but not until we try
hard to improve it, rather than disallow it...
                   - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: [HACKERS] Postgres' lexer

From
Leon
Date:
Thomas Lockhart wrote:

> I'll make the (perhaps incorrect) claim that the current behavior is
> about right for numeric constants (common cases involving various
> whitespace possibilities work about right once everything is through
> the parser). (The "+-" operator is a good unusual case to focus on,
> and we may conclude that it isn't done right at the moment.) Where
> things happen in the parser can change. If the current behavior can't
> be reconciled with improved behaviors with other non-constant
> expressions, then maybe it should be sacrificed, but not until we try
> hard to improve it, rather than disallow it...

Suppose you parse a***-b (where *** are any operator-like symbols)
as (a)  ***  -  (b).  Hence you parse a?-b as (a) ?  - (b). No good.
Solution? No clean solution within horizon - must then have hardwired
list of operators somwhere in pasrer. If we could dream of changing
?- operator ... ;) But we can't. Even your model of system which
sticks uminus to number isn't fit for type-extension  system. Imagine
there is crazy user some place out there who wants to define operator
like +- or #- . It doesn't seem to be senseless - if Postgres itself
has ?- operator, it then could live with my homegrown %- operator!
And then suppose that the second argument to that operator is number.
See the pitfall? 

The only possible thing seems to be to state in documentation that we 
have a peculiar type-extension system which is biased towards long
operators - when it sees long operator string, it swallows it as a whole.
Thus - users, use spaces where needed! This is the way to introduce
type-extension ideology throughout the system from parser onwards.
This ideology could be the guiding light in parser matters 
(there is now lack thereof).

-- 
Leon.