Thread: RE: [HACKERS] Postgres' lexer

RE: [HACKERS] Postgres' lexer

From
"Ansley, Michael"
Date:
Leon, if you manage to find a replacement for this, please let me know.
I'll probably only pick it up after the weekend.

I think that we need to find another way to tokenise the minus.  First of
all, though, how is the parser supposed to tell whether this:
a -2
means this:
(a - 2)
or this:
a (-2)

i.e.: does the unary - operator take precedence over the binary - operator
or not?  Is there even a difference.  If the parser runs into this: 'a -2',
perhaps we could replace it with 'a + (-2)' instead.

How does a C compiler tokenize this?  Or some other standard SQL parser?

MikeA

>> -----Original Message-----
>> From: Leon [mailto:leon@udmnet.ru]
>> Sent: Friday, August 20, 1999 3:34 PM
>> To: hackers
>> Subject: Re: [HACKERS] Postgres' lexer
>> 
>> 
>> Ansley, Michael wrote:
>> > 
>> > Leon, I see that you have been running into the vltc 
>> problem ;-)  I just run
>> > a flex -p, and went to line 314.
>> 
>> I got it. It is done to prevent minus from sticking to number in
>> expressions like 'a -2'. Dirty, but it works.
>> 
>> -- 
>> Leon.
>> ---------
>> "This may seem a bit weird, but that's okay, because it is weird." -
>> Perl manpage.
>> 
>> 
>> ************
>> 


Re: [HACKERS] Postgres' lexer

From
Brook Milligan
Date:
I think that we need to find another way to tokenise the minus.  First of  all, though, how is the parser supposed
totell whether this:  a -2  means this:  (a - 2)  or this:  a (-2)
 
  i.e.: does the unary - operator take precedence over the binary - operator  or not?  Is there even a difference.  If
theparser runs into this: 'a -2',  perhaps we could replace it with 'a + (-2)' instead.
 
  How does a C compiler tokenize this?  Or some other standard SQL parser?

For the C compiler a -2 can only mean (a - 2); a (-2) must explicitly
be a function call and isn't generated by the compiler from a -2.

I think the question for SQL is, does the language allow an ambiguity
here?  If not, wouldn't it be much smarter to keep the minus sign as
its own token and deal with the semantics in the parser?

Cheers,
Brook


Re: [HACKERS] Postgres' lexer

From
Tom Lane
Date:
Brook Milligan <brook@biology.nmsu.edu> writes:
> I think the question for SQL is, does the language allow an ambiguity
> here?  If not, wouldn't it be much smarter to keep the minus sign as
> its own token and deal with the semantics in the parser?

I don't see a good reason to tokenize the '-' as part of the number
either.  I think that someone may have hacked the lexer to try to
merge unary minus into numeric constants, so that in an expression likeWHERE field < -2
the -2 would be treated as a constant rather than an expression
involving application of unary minus --- which is important because
the optimizer is too dumb to optimize the query if it looks like an
expression.

However, trying to make that happen at lex time is just silly.
The lexer doesn't have enough context to handle all the cases
anyway.  We've currently got code in the grammar to do the same
reduction.  (IMHO that's still too early, and it ought to be done
post-rewriter as part of a general-purpose constant expression
reducer; will get around to that someday ;-).)

So it seems to me that we should just rip *all* this cruft out of the
lexer, and always return '-' as a separate token, never as part of
a number. (*)  Then we wouldn't need this lookahead feature.

But it'd be good to get an opinion from the other tgl first ;-).
I'm just a kibitzer when it comes to the lex/yacc stuff.
        regards, tom lane

(*) not counting float exponents, eg "1.234e-56" of course.


Re: [HACKERS] Postgres' lexer

From
Leon
Date:
Ansley, Michael wrote:
>
> Leon, if you manage to find a replacement for this, please let me know.
> I'll probably only pick it up after the weekend.
>
> I think that we need to find another way to tokenise the minus.  First of
> all, though, how is the parser supposed to tell whether this:
> a -2
> means this:
> (a - 2)
> or this:
> a (-2)

I think that the current behavior is ok - it is what we would expect
from expressions like 'a -2'.

I have produced a patch to cleanup the code. It works due to the
fact that unary minus gets processed in doNegate() in parser anyway,
and it is by no way lexer's job to do grammatical parsing - i.e.
deciding if operator is to be treated as binary or unary.

I ran regression tests, everything seems to be ok. It is my first
diff/patch experience in *NIX, so take it with mercy :) But it
seems to be correct. It is to be applied against 6.5.0 (I have
not upgraded to 6.5.1 yet, but hope lexer hasn't changed since
then.) The patch mainly contains nuked code. The only thing added
is my short comment :)

Have I done some right thing? :)

--
Leon.
---------
"This may seem a bit weird, but that's okay, because it is weird." -
Perl manpage.
Attachment

Re: [HACKERS] Postgres' lexer

From
Thomas Lockhart
Date:
> But it'd be good to get an opinion from the other tgl first ;-).

Sadly, the former "tgl" ;)

Sorry, I was away on vacation. I've waded through ~300 mail messages
already, but have ~700 to go, so I apologize if I've missed some more
developments.

I added the <xm> exclusive state to accomodate the possibility of a
unary minus. The change was provoked by Vadim's addition of CREATE
SEQUENCE, which should allow negative numbers for some arguments. But
this just uncovered the tip of the general problem...

There are several cases which need to be handled (I'm doing this from
memory, so may miss a few):

o Positive and negative numbers as standalone arguments, with and
without spaces between the "-" and the digits.

o Positive and negative numbers as first arguments to binary
operators, with and without spaces at all possible places.

o Positive and negative numbers as second arguments to binary
operators, or as arguments to unary operators.

o Positive and negative numbers in the presence of operators
containing minus signs, including a trailing minus sign where
possible.

'taint easy to do it completely right. Perhaps trying to do less in
the scanner is the right thing to do, but istm that it may put
restrictions on the grammar which are not currently there. Not a good
trade for a longer query length...
                   - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: [HACKERS] Postgres' lexer

From
Tom Lane
Date:
Thomas Lockhart <lockhart@alumni.caltech.edu> writes:
> I added the <xm> exclusive state to accomodate the possibility of a
> unary minus. The change was provoked by Vadim's addition of CREATE
> SEQUENCE, which should allow negative numbers for some arguments. But
> this just uncovered the tip of the general problem...

It seems awfully hard and dangerous to try to identify unary minus in
the lexer.  The grammar at least has enough knowledge to recognize that
a minus *is* unary and not binary.  Looking into gram.y, I find that the
CREATE SEQUENCE productions handle collapsing unary minus all by
themselves!  So in that particular case, there is still no need for the
lexer to do it.  AFAICT in a quick look through gram.y, there are no
places where unary minus is recognized that gram.y won't try to collapse
it.

In short, I still think that the whole mess ought to come out of the
lexer...
        regards, tom lane


Re: [HACKERS] Postgres' lexer

From
Thomas Lockhart
Date:
> > I added the <xm> exclusive state to accomodate the possibility of a
> > unary minus. The change was provoked by Vadim's addition of CREATE
> > SEQUENCE, which should allow negative numbers for some arguments. But
> > this just uncovered the tip of the general problem...
> It seems awfully hard and dangerous to try to identify unary minus in
> the lexer.  The grammar at least has enough knowledge to recognize that
> a minus *is* unary and not binary.  Looking into gram.y, I find that the
> CREATE SEQUENCE productions handle collapsing unary minus all by
> themselves!  So in that particular case, there is still no need for the
> lexer to do it.  AFAICT in a quick look through gram.y, there are no
> places where unary minus is recognized that gram.y won't try to collapse
> it.
> In short, I still think that the whole mess ought to come out of the
> lexer...

My recollection of the whole point is that, as you mention, *you can't
identify a unary minus in the lexer*. So the minus sign is kept
distinct, to be reconciled later as either a unary minus *or* an
operator *or* whatever. The problem was that before, things like (-2)
and (- 2) were handled differently just because the spacing was
different.

Anyway, I'll look at the defacto changes; perhaps they are just fine
but I'm worried that we've reverted the behavior...
                     - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: [HACKERS] Postgres' lexer

From
Leon
Date:
Oh, there isn't really much change. The minus is passed standalone
as it always was. The only thing is that currently in numbers with unary
minus it gets coerced not in lexer, but in parser in doNegate().
I wonder why that hasn't been done earlier - especially considering
that doNegate() existed long before my, hmm, fiddling.

To tell the truth, there is some ambiguity in various operators.
That ambiguity is stemming from Postgres's type-extension system.
Consider this: SELECT 3+-2; What would you expect from that?  I
personally would expect the result of 1. But it produces an error,
because '+-' is treated as some user-defined operator, which is 
not true. Such innocent expression as SELECT --2 puts Postgres in
daze - it (psql) waits for 'completion' of such query (it treats
symbols '--' as comment start :-)  See? There are more pitfalls
beside minus coercing :-) 

This all was done to clean up the code and 'straighten' the parser.
There was a performance breaker, officially called AFAIR 'variable
trailing context'.

Thomas Lockhart wrote:

> 
> There are several cases which need to be handled (I'm doing this from
> memory, so may miss a few):
> 
> o Positive and negative numbers as standalone arguments, with and
> without spaces between the "-" and the digits.
> 
> o Positive and negative numbers as first arguments to binary
> operators, with and without spaces at all possible places.
> 
> o Positive and negative numbers as second arguments to binary
> operators, or as arguments to unary operators.
> 
> o Positive and negative numbers in the presence of operators
> containing minus signs, including a trailing minus sign where
> possible.
> 
> 'taint easy to do it completely right. Perhaps trying to do less in
> the scanner is the right thing to do, but istm that it may put
> restrictions on the grammar which are not currently there. Not a good
> trade for a longer query length...


-- 
Leon.



Re: [HACKERS] Postgres' lexer

From
Thomas Lockhart
Date:
> Oh, there isn't really much change. The minus is passed standalone
> as it always was. The only thing is that currently in numbers with unary
> minus it gets coerced not in lexer, but in parser in doNegate().
> I wonder why that hasn't been done earlier - especially considering
> that doNegate() existed long before my, hmm, fiddling.

For good reasons. See below :(

> To tell the truth, there is some ambiguity in various operators.
> That ambiguity is stemming from Postgres's type-extension system.

There is the possibility for ambiguity. But it is our responsibility
to minimize that ambiguity and to make a predictable system, in the
presence of Postgres' unique and valuable features such as type
extension. imho this is more important than, for example, allowing
infinite-length queries.

> Consider this: SELECT 3+-2; What would you expect from that?  I
> personally would expect the result of 1. But it produces an error,
> because '+-' is treated as some user-defined operator, which is
> not true.

That is part of my concern here. The current behavior is what you say
you would expect! Your patches change that behavor!!

postgres=> select 3+-2;
?column?
--------      1
(1 row)

> Such innocent expression as SELECT --2 puts Postgres in
> daze - it (psql) waits for 'completion' of such query (it treats
> symbols '--' as comment start :-)  See? There are more pitfalls
> beside minus coercing :-)

There are some well-defined features of SQL92 which we try hard to
support; the comment convention is one of them. That is a special case
and shouldn't confuse the issue here; we'll need a different test case
to make the point for me...

> This all was done to clean up the code and 'straighten' the parser.
> There was a performance breaker, officially called AFAIR 'variable
> trailing context'.

Sorry, what is the performance penalty for that feature, and how do we
measure that against breakage of expected, predictable behavior?
Please quantify.

So far, I'm not a fan of the proposed change; we're giving up behavior
that impacts Postgres' unique type extension features for an
arbitrarily large query buffer (as opposed to a generously large query
buffer, which can be accomplished just by changing the fixed size).
                    - Thomas

-- 
Thomas Lockhart                lockhart@alumni.caltech.edu
South Pasadena, California


Re: [HACKERS] Postgres' lexer

From
Tom Lane
Date:
Thomas Lockhart <lockhart@alumni.caltech.edu> writes:
>> Consider this: SELECT 3+-2; What would you expect from that?  I
>> personally would expect the result of 1. But it produces an error,
>> because '+-' is treated as some user-defined operator, which is
>> not true.

> That is part of my concern here. The current behavior is what you say
> you would expect! Your patches change that behavor!!

> postgres=> select 3+-2;
> ?column?
> --------
>        1
> (1 row)

OTOH, with current sources:

regression=> select 3+- 2;
ERROR:  Unable to identify an operator '+-' for types 'int4' and 'int4'       You will have to retype this query using
anexplicit cast
 

regression=> select f1+-f1 from int4_tbl;
ERROR:  Unable to identify an operator '+-' for types 'int4' and 'int4'       You will have to retype this query using
anexplicit cast
 

To my mind, without spaces this construction *is* ambiguous, and frankly
I'd have expected the second interpretation ('+-' is a single operator
name).  Almost every computer language in the world uses "greedy"
tokenization where the next token is the longest series of characters
that can validly be a token.  I don't regard the above behavior as
predictable, natural, nor obvious.  In fact, I'd say it's a bug that
"3+-2" and "3+-x" are not lexed in the same way.

However, aside from arguing about whether the current behavior is good
or bad, these examples seem to indicate that it doesn't take an infinite
amount of lookahead to reproduce the behavior.  It looks to me like we
could preserve the current behavior by parsing a '-' as a separate token
if it *immediately* precedes a digit, and otherwise allowing it to be
folded into the preceding operator.  That could presumably be done
without VLTC.
        regards, tom lane


Re: [HACKERS] Postgres' lexer

From
Leon
Date:
Tom Lane wrote:


> To my mind, without spaces this construction *is* ambiguous, and frankly
> I'd have expected the second interpretation ('+-' is a single operator
> name).  Almost every computer language in the world uses "greedy"
> tokenization where the next token is the longest series of characters
> that can validly be a token.  I don't regard the above behavior as
> predictable, natural, nor obvious.  In fact, I'd say it's a bug that
> "3+-2" and "3+-x" are not lexed in the same way.
>

Completely agree with that. This differentiating behavior looks like a bug.

> However, aside from arguing about whether the current behavior is good
> or bad, these examples seem to indicate that it doesn't take an infinite
> amount of lookahead to reproduce the behavior.  It looks to me like we
> could preserve the current behavior by parsing a '-' as a separate token
> if it *immediately* precedes a digit, and otherwise allowing it to be
> folded into the preceding operator.  That could presumably be done
> without VLTC.

Ok. If we *have* to preserve old weird behavior, here is the patch.
It is to be applied over all my other patches. Though if I were to
decide whether to restore old behavior, I wouldn't do it. Because it
is inconsistency in grammar, i.e. a bug.

--
Leon.
Attachment