Thread: Replacing plpgsql's lexer

Replacing plpgsql's lexer

From
Tom Lane
Date:
Whichever way the current discussion about Unicode literals turns out,
it's clear that plpgsql is not up to speed on matching the core lexer's
behavior --- it's wrong anyway with respect to
standard_conforming_strings.

I had earlier speculated semi-facetiously about ripping out the plpgsql
lexer altogether, but the more I think about it the less silly the idea
looks.  Suppose that we change the core lexer so that the keyword lookup
table it's supposed to use is passed to scanner_init() rather than being
hard-wired in.  Then make plpgsql call the core lexer using its own
keyword table.  Everything else would match core lexical behavior
automatically.  The special behavior that we do want, such as being
able to construct a string representing a desired subrange of the input,
could all be handled in plpgsql-specific wrapper code.

I've just spent a few minutes looking for trouble spots in this theory,
and so far the only real ugliness I can see is that plpgsql treats
":=" and ".." as single tokens whereas the core would parse them as two
tokens.  We could hack the core lexer to have an additional switch that
controls that.  Or maybe just make it always return them as single
tokens --- AFAICS, neither combination is legal in core SQL anyway,
so this would only result in a small change in the exact syntax error
you get if you write such a thing in core SQL.

Another trouble spot is the #option syntax, but that could be handled
by a special-purpose prescan, or just dropped altogether; it's not like
we've ever used that for anything but debugging.

It looks like this might take about a day's worth of work (IOW two
or three days real time) to get done.

Normally I'd only consider doing such a thing during development phase,
but since we're staring at at least one and maybe two bugs that are
going to be hard to fix in any materially-less-intrusive way, I'm
thinking about doing it now.  Theoretically this change shouldn't break
any working code, so letting it hit the streets in 8.4beta2 doesn't seem
totally unreasonable.

Comments, objections, better ideas?
        regards, tom lane


Re: Replacing plpgsql's lexer

From
Robert Haas
Date:
On Tue, Apr 14, 2009 at 4:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Whichever way the current discussion about Unicode literals turns out,
> it's clear that plpgsql is not up to speed on matching the core lexer's
> behavior --- it's wrong anyway with respect to
> standard_conforming_strings.
>
> I had earlier speculated semi-facetiously about ripping out the plpgsql
> lexer altogether, but the more I think about it the less silly the idea
> looks.  Suppose that we change the core lexer so that the keyword lookup
> table it's supposed to use is passed to scanner_init() rather than being
> hard-wired in.  Then make plpgsql call the core lexer using its own
> keyword table.  Everything else would match core lexical behavior
> automatically.  The special behavior that we do want, such as being
> able to construct a string representing a desired subrange of the input,
> could all be handled in plpgsql-specific wrapper code.
>
> I've just spent a few minutes looking for trouble spots in this theory,
> and so far the only real ugliness I can see is that plpgsql treats
> ":=" and ".." as single tokens whereas the core would parse them as two
> tokens.  We could hack the core lexer to have an additional switch that
> controls that.  Or maybe just make it always return them as single
> tokens --- AFAICS, neither combination is legal in core SQL anyway,
> so this would only result in a small change in the exact syntax error
> you get if you write such a thing in core SQL.
>
> Another trouble spot is the #option syntax, but that could be handled
> by a special-purpose prescan, or just dropped altogether; it's not like
> we've ever used that for anything but debugging.
>
> It looks like this might take about a day's worth of work (IOW two
> or three days real time) to get done.
>
> Normally I'd only consider doing such a thing during development phase,
> but since we're staring at at least one and maybe two bugs that are
> going to be hard to fix in any materially-less-intrusive way, I'm
> thinking about doing it now.  Theoretically this change shouldn't break
> any working code, so letting it hit the streets in 8.4beta2 doesn't seem
> totally unreasonable.
>
> Comments, objections, better ideas?

All this sounds good.  As for how to handle := and .., I think making
them lex the same way in PL/pgsql and core SQL would be a good thing.

...Robert


Re: Replacing plpgsql's lexer

From
Simon Riggs
Date:
On Tue, 2009-04-14 at 16:37 -0400, Tom Lane wrote:
> Comments, objections, better ideas?

Please, if you do this, make it optional.

Potentially changing the behaviour of thousands of functions just to fix
a rare bug will not endear us to our users. The bug may be something
that people are relying on in some subtle way, ugly as that sounds.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Replacing plpgsql's lexer

From
Andrew Dunstan
Date:

Robert Haas wrote:
> All this sounds good.  As for how to handle := and .., I think making
> them lex the same way in PL/pgsql and core SQL would be a good thing.
>
>
>   

They don't have any significance in core SQL. What would we do with the 
lexeme?

ISTR we've used some hacks in the past to split lexemes into pieces, and 
presumably we'd have to do something similar with these.

The only thing that makes me nervous about this is that we're very close 
to Beta. OTOH, this is one area the regression suite should give a 
fairly good workout to.

cheers

andrew


Re: Replacing plpgsql's lexer

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Robert Haas wrote:
>> All this sounds good.  As for how to handle := and .., I think making
>> them lex the same way in PL/pgsql and core SQL would be a good thing.

> They don't have any significance in core SQL. What would we do with the 
> lexeme?

It would just fail --- the core grammar will have no production that can
accept it.  Right offhand I think the only difference is that instead of

regression=# select a .. 2;
ERROR:  syntax error at or near "."
LINE 1: select a .. 2;                 ^

you'd see 

regression=# select a .. 2;
ERROR:  syntax error at or near ".."
LINE 1: select a .. 2;                ^

ie it acts like one token not two in the error message.

This solution would become problematic if the core grammar ever had a
meaning for := or .. that required treating them as two tokens (eg,
the grammar allowed this sequence with whitespace between).  I don't
think that's very likely though; and if it did happen we could fix it
with the aforementioned control switch.

> The only thing that makes me nervous about this is that we're very close 
> to Beta. OTOH, this is one area the regression suite should give a 
> fairly good workout to.

Yeah, I'd rather have done it before beta1, but too late.  The other
solution still entails massive changes to the plpgsql lexer, so it
doesn't really look like much lower risk.  AFAICS the practical
alternatives are a reimplementation in beta2, or no fix until 8.5.
        regards, tom lane


Re: Replacing plpgsql's lexer

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Tue, 2009-04-14 at 16:37 -0400, Tom Lane wrote:
>> Comments, objections, better ideas?

> Please, if you do this, make it optional.

I don't think making the plpgsql lexer pluggable is realistic.

> Potentially changing the behaviour of thousands of functions just to fix
> a rare bug will not endear us to our users. The bug may be something
> that people are relying on in some subtle way, ugly as that sounds.

That's why I don't want to change it in a minor release.  In a major
release, however, it's fair game.
        regards, tom lane


Re: Replacing plpgsql's lexer

From
Bruce Momjian
Date:
Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Tue, 2009-04-14 at 16:37 -0400, Tom Lane wrote:
> >> Comments, objections, better ideas?
> 
> > Please, if you do this, make it optional.
> 
> I don't think making the plpgsql lexer pluggable is realistic.
> 
> > Potentially changing the behaviour of thousands of functions just to fix
> > a rare bug will not endear us to our users. The bug may be something
> > that people are relying on in some subtle way, ugly as that sounds.
> 
> That's why I don't want to change it in a minor release.  In a major
> release, however, it's fair game.

Well, this bug has existed long before 8.4 so we could just leave it for
8.5, and it is not like we have had tons of complaints;  the only
complaint I saw was one from March, 2008.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Replacing plpgsql's lexer

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Well, this bug has existed long before 8.4 so we could just leave it for
> 8.5, and it is not like we have had tons of complaints;  the only
> complaint I saw was one from March, 2008.

We had one last week, which is what prompted me to start looking at the
plpgsql lexer situation in the first place.  Also, if the unicode
literal situation doesn't change, that's going to be problematic as
well.
        regards, tom lane


Re: Replacing plpgsql's lexer

From
Stephen Frost
Date:
* Bruce Momjian (bruce@momjian.us) wrote:
> Well, this bug has existed long before 8.4 so we could just leave it for
> 8.5, and it is not like we have had tons of complaints;  the only
> complaint I saw was one from March, 2008.

I think it's a good thing to do in general.  I'm also concerned about
if it will impact the plpgsql functions we have (which are pretty
numerous..) but in the end I'd rather have it fixed in 8.4 than possibly
delayed indefinitely (after all, if it's in 8.4, why fix it for 8.5?).
Thanks,        Stephen

Re: Replacing plpgsql's lexer

From
Simon Riggs
Date:
On Tue, 2009-04-14 at 18:29 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> > On Tue, 2009-04-14 at 16:37 -0400, Tom Lane wrote:
> >> Comments, objections, better ideas?
> 
> > Please, if you do this, make it optional.
> 
> I don't think making the plpgsql lexer pluggable is realistic.

Doesn't sound easy, no. (I didn't suggest pluggable, just optional).

> > Potentially changing the behaviour of thousands of functions just to fix
> > a rare bug will not endear us to our users. The bug may be something
> > that people are relying on in some subtle way, ugly as that sounds.
> 
> That's why I don't want to change it in a minor release.  In a major
> release, however, it's fair game.

If we want to make easy upgrades a reality, this is the type of issue we
must consider. Not much point having perfect binary upgrades if all your
functions start behaving differently after upgrade and then you discover
there isn't a binary downgrade path...

Rather than come up with specific solutions, let me just ask the
question: Is there a workaround for people caught by these changes?
Let's plan that alongside the change itself, so we have a reserve
'chute.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Replacing plpgsql's lexer

From
Simon Riggs
Date:
On Wed, 2009-04-15 at 11:36 +0300, Heikki Linnakangas wrote:

> Extract the source of the offending plpgsql function using e.g
> pg_dump, modify it so that it works again, and restore the function.
> There's your workaround.

Forcing manual re-editing of an unknown number of lines of code is not a
useful workaround, its just the default.

How do you know which is the offending function? If we force a full
application retest we put in place a significant barrier to upgrade.
That isn't useful for us as developers, nor is it useful for users.

I'm happy for Tom to make changes now; delay has no advantage. If we
have to add some lines of code/complexity to help our users then it
seems a reasonable thing to do rather than keeping our code pure and
demanding everybody else make changes.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Replacing plpgsql's lexer

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Wed, 2009-04-15 at 11:36 +0300, Heikki Linnakangas wrote:
> 
>> Extract the source of the offending plpgsql function using e.g
>> pg_dump, modify it so that it works again, and restore the function.
>> There's your workaround.
> 
> Forcing manual re-editing of an unknown number of lines of code is not a
> useful workaround, its just the default.
> 
> How do you know which is the offending function? If we force a full
> application retest we put in place a significant barrier to upgrade.
> That isn't useful for us as developers, nor is it useful for users.

If I understood correctly, the proposed change is not supposed to have 
any user-visible effects. It doesn't force a full application retest any 
more than any of the other changes that have gone into 8.4.

We're talking about what we'll do or tell the users to do if we missed 
something. By definition we don't know what we've missed, so I don't 
think we can come up with a more specific solution than that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Replacing plpgsql's lexer

From
Greg Stark
Date:
On Wed, Apr 15, 2009 at 10:12 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> How do you know which is the offending function? If we force a full
> application retest we put in place a significant barrier to upgrade.
> That isn't useful for us as developers, nor is it useful for users.

This is a fundamental conflict, not one that has a single simple answer.

However this seems like a strange place to pick your battle. Something
as low-level as the lexer is very costly to provide multiple
interfaces to. It's basically impossible short of simply providing two
different plpgsql languages -- something which won't scale at all if
we have to do it every time we make a syntax change to the language.

I'm actually concerned that we've become *too* conservative. Pretty
much any change that doesn't have Tom's full support and credibility
standing behind it ends up being criticized on the basis that we don't
know precisely what effects it will have in every possible scenario.

One of free software's big advantages over commercial software is that
it moves so much more quickly. Oracle, AIX, Windows, etc are burdened
by hundreds of layers of backwards-compatibility which take up a huge
portion of their development and Q/A effort. The reason Linux,
Postgres, and others have been able to come up so quickly and overtake
them is partly because we don't worry about such things.

As far as I'm concerned commercial support companies can put effort
into developing backwards-compatibility modules which add no long-term
value for their paying customers who need it today while the free
software developers can keep improving the software for new users.

-- 
greg


Re: Replacing plpgsql's lexer

From
Simon Riggs
Date:
On Wed, 2009-04-15 at 10:56 +0100, Greg Stark wrote:
> On Wed, Apr 15, 2009 at 10:12 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >
> > How do you know which is the offending function? If we force a full
> > application retest we put in place a significant barrier to upgrade.
> > That isn't useful for us as developers, nor is it useful for users.
> 
> This is a fundamental conflict, not one that has a single simple answer.
> 
> However this seems like a strange place to pick your battle. 

I think you are right that you perceive a fundamental conflict and most
things I say become battles. That is not my choice and I will withdraw
from further discussion. My point has been made clearly and has not been
made to cause conflict. I've better things to do with my time than that,
though it's a shame you think that of me.

> As far as I'm concerned commercial support companies can put effort
> into developing backwards-compatibility modules which add no long-term
> value for their paying customers who need it today while the free
> software developers can keep improving the software for new users.

We will all doubtless make money from difficult upgrades, though that is
not my choice, nor that of my customers.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Replacing plpgsql's lexer

From
Greg Stark
Date:
On Wed, Apr 15, 2009 at 11:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
>> This is a fundamental conflict, not one that has a single simple answer.
>>
>> However this seems like a strange place to pick your battle.
>
> I think you are right that you perceive a fundamental conflict and most
> things I say become battles. That is not my choice and I will withdraw
> from further discussion. My point has been made clearly and has not been
> made to cause conflict. I've better things to do with my time than that,
> though it's a shame you think that of me.

Uhm, I didn't intend this as criticism at all, except inasmuch as the
judgement about whether the plpgsql lexer was a good choice of place
to make this stand. The use of "battle" was only because of the idiom
"pick your battle".

I think we are in general too conservative about making changes and
you are concerned that we're not giving enough thought to the upgrade
pain and should be more conservative. We can talk about general
policies but ultimately we'll have to debate each change on its
merits.

In this case it would help if we described the specific kinds of code
and consequences users. I'm not sure we're all on the same page.

I think changing the lexer to match the SQL lexer will only affect
string constants and only if standards_conforming_strings is enabled,
and only those instances which are handled internally to plpgsql and
not passed to the SQL engine. So the fix will pretty much always be
local to the behaviour change. It's possible for an escaped string to
need an E'' and for the backslash to migrate to other parts of the
code before triggering a bug (or possibly even get stored in the
database and cause a problem in other parts of the application). But
it should still be pretty straightforward to find the original source
of the string and also pretty easy to recognize string constants
throughout the source code.

As it currently stands a programmer sometimes has to use E'\x' and
sometimes has to use '\x' depending on whether the plpgsql is lexing
the string or is passing it to the SQL engine unlexed. It's not
obvious which parts get handled in which way to a user since some
constructs are handled as SQL which don't appear to be SQL and vice
versa -- at least it's not obvious to me even having read the source
in the past.

If I understand things correctly I think the change improves the
language for future users by far more than it imposes maintenance
costs on existing users, especially considering that anyone depending
on '\x' strings with standards_conforming_strings enabled is only
probably getting it wrong in some places without realizing it anyways

.

-- 
greg


Re: Replacing plpgsql's lexer

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Tue, 2009-04-14 at 18:29 -0400, Tom Lane wrote:
>> Simon Riggs <simon@2ndQuadrant.com> writes:
>>> Potentially changing the behaviour of thousands of functions just to fix
>>> a rare bug will not endear us to our users. The bug may be something
>>> that people are relying on in some subtle way, ugly as that sounds.
>> That's why I don't want to change it in a minor release.  In a major
>> release, however, it's fair game.
> 
> If we want to make easy upgrades a reality, this is the type of issue we
> must consider. Not much point having perfect binary upgrades if all your
> functions start behaving differently after upgrade and then you discover
> there isn't a binary downgrade path...
> 
> Rather than come up with specific solutions, let me just ask the
> question: Is there a workaround for people caught by these changes?
> Let's plan that alongside the change itself, so we have a reserve
> 'chute.

Extract the source of the offending plpgsql function using e.g pg_dump, 
modify it so that it works again, and restore the function. There's your 
workaround.

I haven't been following what the issues we have with the current 
plpgsql lexer are, so I'm not sure what I think of the plan as a whole. 
Sharing the main lexer seems like a good idea, but it also seems like 
it's way too late in the release cycle for such changes. But then again, 
if we have issues that need to be fixed anyway, it might well be the 
best way to fix them.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Replacing plpgsql's lexer

From
Robert Haas
Date:
On Wed, Apr 15, 2009 at 5:56 AM, Greg Stark <stark@enterprisedb.com> wrote:
> On Wed, Apr 15, 2009 at 10:12 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> How do you know which is the offending function? If we force a full
>> application retest we put in place a significant barrier to upgrade.
>> That isn't useful for us as developers, nor is it useful for users.
>
> This is a fundamental conflict, not one that has a single simple answer.
>
> However this seems like a strange place to pick your battle. Something
> as low-level as the lexer is very costly to provide multiple
> interfaces to. It's basically impossible short of simply providing two
> different plpgsql languages -- something which won't scale at all if
> we have to do it every time we make a syntax change to the language.

Completely agreed.

> I'm actually concerned that we've become *too* conservative. Pretty
> much any change that doesn't have Tom's full support and credibility
> standing behind it ends up being criticized on the basis that we don't
> know precisely what effects it will have in every possible scenario.

I think we've become too conservative in some areas and not
conservative enough in others.  In particular, we're not very
conservative AT ALL about changes to the on-disk format - which is
like unto a bullet through the head for in-place upgrade.  And we
sometimes make behavior changes that have potentially catastrophic
user consequences (like that one to TRUNCATE... which one, you ask?
ah, well, you'd better not use TRUNCATE in 8.4 until you RTFM then),
but then we'll have an argument about whether it's OK to make some
change where it's difficult to image the user impact being all that
severe, like:

- this one
- removing the special case for %% in the log_filename
- forward-compatible, backward-compatible improvements to CREATE OR REPLACE VIEW
- lots of others

So it seems that there is no consistent heuristic (other than, as you
say, Tom's approval or lack thereof) applied to these changes.

...Robert


Re: Replacing plpgsql's lexer

From
Andrew Dunstan
Date:

Simon Riggs wrote:
> How do you know which is the offending function? If we force a full
> application retest we put in place a significant barrier to upgrade.
> That isn't useful for us as developers, nor is it useful for users.
>
>   

We support back branches for a long time for a reason. Nobody in their 
right mind should upgrade to a new version without without first 
extensively testing (and if necessary adjusting) their application on 
it. This is true regardless of this issue and will be true for every 
release.

cheers

andrew


Re: Replacing plpgsql's lexer

From
Robert Haas
Date:
On Wed, Apr 15, 2009 at 5:56 AM, Greg Stark <stark@enterprisedb.com> wrote:
> On Wed, Apr 15, 2009 at 10:12 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> How do you know which is the offending function? If we force a full
>> application retest we put in place a significant barrier to upgrade.
>> That isn't useful for us as developers, nor is it useful for users.
>
> This is a fundamental conflict, not one that has a single simple answer.
>
> However this seems like a strange place to pick your battle. Something
> as low-level as the lexer is very costly to provide multiple
> interfaces to. It's basically impossible short of simply providing two
> different plpgsql languages -- something which won't scale at all if
> we have to do it every time we make a syntax change to the language.

Completely agreed.

> I'm actually concerned that we've become *too* conservative. Pretty
> much any change that doesn't have Tom's full support and credibility
> standing behind it ends up being criticized on the basis that we don't
> know precisely what effects it will have in every possible scenario.

I think we've become too conservative in some areas and not
conservative enough in others.  In particular, we're not very
conservative AT ALL about changes to the on-disk format - which is
like unto a bullet through the head for in-place upgrade.  And we
sometimes make behavior changes that have potentially catastrophic
user consequences (like that one to TRUNCATE... which one, you ask?
ah, well, you'd better not use TRUNCATE in 8.4 until you RTFM then),
but then we'll have an argument about whether it's OK to make some
change where it's difficult to image the user impact being all that
severe - like this one, for example (or removing the special case for
no %-escapes in log_filename).

...Robert


Re: Replacing plpgsql's lexer

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> We support back branches for a long time for a reason.

I think that's really the bottom line here.  If we insist on new major
releases always being bug-compatible with prior releases, our ability to
improve the software will go to zero.  The solution we have opted for
instead is to support back branches for a long time and to avoid making
that type of change in a back branch.
        regards, tom lane


Re: Replacing plpgsql's lexer

From
"Joshua D. Drake"
Date:
On Wed, 2009-04-15 at 11:45 -0400, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > We support back branches for a long time for a reason.
> 
> I think that's really the bottom line here.  If we insist on new major
> releases always being bug-compatible with prior releases, our ability to
> improve the software will go to zero.  The solution we have opted for
> instead is to support back branches for a long time and to avoid making
> that type of change in a back branch.
> 

+1

Joshua D. Drake


>             regards, tom lane
> 
-- 
PostgreSQL - XMPP: jdrake@jabber.postgresql.org  Consulting, Development, Support, Training  503-667-4564 -
http://www.commandprompt.com/ The PostgreSQL Company, serving since 1997
 



Re: Replacing plpgsql's lexer

From
Tom Lane
Date:
I wrote:
> I had earlier speculated semi-facetiously about ripping out the plpgsql
> lexer altogether, but the more I think about it the less silly the idea
> looks.

This little project crashed and burned upon remembering that plpgsql
invokes raw_parser() to syntax-check each chunk of SQL that it extracts.
If plpgsql were using the main lexer, that would mean recursive use of
the lexer --- and it's not re-entrant.

We could think about making the main lexer re-entrant, but that would
involve a bump in the minimum required flex version (I don't know when
%option reentrant got added, but it's not in 2.5.4).  And it definitely
doesn't seem like something to be doing during beta.

Getting rid of the requirement for recursion doesn't look palatable
either.  We don't want to delay the syntax check for reasons explained
in check_sql_expr()'s comments; and that's not the only source of
recursion anyway --- plpgsql_parse_datatype does it too, and there could
be other places.

So I think we are down to a choice of doing nothing for 8.4, or teaching
the existing plpgsql lexer about standard_conforming_strings.  Assuming
the current proposal for U& literals holds up, it should not be
necessary for plpgsql to know about those explicitly as long as it obeys
standard_conforming_strings, so this might not be too horrid a project.
I'll take a look at that next.
        regards, tom lane


Re: Replacing plpgsql's lexer

From
David Fetter
Date:
On Fri, Apr 17, 2009 at 12:12:12PM -0400, Tom Lane wrote:
> I wrote:
> > I had earlier speculated semi-facetiously about ripping out the
> > plpgsql lexer altogether, but the more I think about it the less
> > silly the idea looks.
> 
> This little project crashed and burned upon remembering that plpgsql
> invokes raw_parser() to syntax-check each chunk of SQL that it
> extracts.  If plpgsql were using the main lexer, that would mean
> recursive use of the lexer --- and it's not re-entrant.
> 
> We could think about making the main lexer re-entrant, but that
> would involve a bump in the minimum required flex version (I don't
> know when %option reentrant got added, but it's not in 2.5.4).  And
> it definitely doesn't seem like something to be doing during beta.
> 
> Getting rid of the requirement for recursion doesn't look palatable
> either.  We don't want to delay the syntax check for reasons
> explained in check_sql_expr()'s comments; and that's not the only
> source of recursion anyway --- plpgsql_parse_datatype does it too,
> and there could be other places.
> 
> So I think we are down to a choice of doing nothing for 8.4, or
> teaching the existing plpgsql lexer about
> standard_conforming_strings.  Assuming the current proposal for U&
> literals holds up, it should not be necessary for plpgsql to know
> about those explicitly as long as it obeys
> standard_conforming_strings, so this might not be too horrid a
> project.  I'll take a look at that next.

Speaking of standard_conforming_strings, I know it's late, but if we
make it a requirement now, a lot of problems just go away.  Yes, it's
inconvenient, but we're making lots of big changes, so one more
shouldn't halt adoption.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Replacing plpgsql's lexer

From
Alvaro Herrera
Date:
David Fetter wrote:

> Speaking of standard_conforming_strings, I know it's late, but if we
> make it a requirement now, a lot of problems just go away.  Yes, it's
> inconvenient, but we're making lots of big changes, so one more
> shouldn't halt adoption.

16 days too late ...

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Replacing plpgsql's lexer

From
David Fetter
Date:
On Fri, Apr 17, 2009 at 01:01:39PM -0400, Alvaro Herrera wrote:
> David Fetter wrote:
> 
> > Speaking of standard_conforming_strings, I know it's late, but if
> > we make it a requirement now, a lot of problems just go away.
> > Yes, it's inconvenient, but we're making lots of big changes, so
> > one more shouldn't halt adoption.
> 
> 16 days too late ...

Depends.  If we've found show-stopping bugs, as it appears we may have
done, in not requiring standards_conforming_strings, we can't just
pull a MySQL and ship anyhow.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Replacing plpgsql's lexer

From
Andrew Dunstan
Date:

Alvaro Herrera wrote:
> David Fetter wrote:
>
>   
>> Speaking of standard_conforming_strings, I know it's late, but if we
>> make it a requirement now, a lot of problems just go away.  Yes, it's
>> inconvenient, but we're making lots of big changes, so one more
>> shouldn't halt adoption.
>>     
>
> 16 days too late ...
>
>   

More like several months plus 16 days, I'd say.

cheers

andrew


Re: Replacing plpgsql's lexer

From
Tom Lane
Date:
David Fetter <david@fetter.org> writes:
> Depends.  If we've found show-stopping bugs, as it appears we may have
> done, in not requiring standards_conforming_strings, we can't just
> pull a MySQL and ship anyhow.

It's hardly a "show stopping bug", considering it's been there since
standard_conforming_strings was invented.
        regards, tom lane


Re: Replacing plpgsql's lexer

From
David Fetter
Date:
On Fri, Apr 17, 2009 at 02:03:45PM -0400, Andrew Dunstan wrote:
> Alvaro Herrera wrote:
>> David Fetter wrote:
>>   
>>> Speaking of standard_conforming_strings, I know it's late, but if
>>> we make it a requirement now, a lot of problems just go away.
>>> Yes, it's inconvenient, but we're making lots of big changes, so
>>> one more shouldn't halt adoption.
>>
>> 16 days too late ...
>
> More like several months plus 16 days, I'd say.

If our string problems turn out to be fixable short of making this
option mandatory, possibly.  It would depend on how fragile that
collection of fixes turns out to be.

We haven't shipped with glaring known-broken stuff in userland before.
Are we going to start now?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Replacing plpgsql's lexer

From
David Fetter
Date:
On Fri, Apr 17, 2009 at 02:07:55PM -0400, Tom Lane wrote:
> David Fetter <david@fetter.org> writes:
> > Depends.  If we've found show-stopping bugs, as it appears we may
> > have done, in not requiring standards_conforming_strings, we can't
> > just pull a MySQL and ship anyhow.
> 
> It's hardly a "show stopping bug", considering it's been there since
> standard_conforming_strings was invented.

A known sploit would be a show-stopper.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Replacing plpgsql's lexer

From
Tom Lane
Date:
David Fetter <david@fetter.org> writes:
>> It's hardly a "show stopping bug", considering it's been there since
>> standard_conforming_strings was invented.

> A known sploit would be a show-stopper.

We're not turning on standard_conforming_strings right now.  We are
*certainly* not forcing it on without recourse in existing branches,
which would be the logical conclusion if we considered this to be a
security issue fixable in no other way.  Would you mind not wasting the
list's time with this?
        regards, tom lane


Re: Replacing plpgsql's lexer

From
Stephen Cook
Date:
++


David Fetter wrote:
> On Fri, Apr 17, 2009 at 01:01:39PM -0400, Alvaro Herrera wrote:
>> David Fetter wrote:
>>
>>> Speaking of standard_conforming_strings, I know it's late, but if
>>> we make it a requirement now, a lot of problems just go away.
>>> Yes, it's inconvenient, but we're making lots of big changes, so
>>> one more shouldn't halt adoption.
>> 16 days too late ...
> 
> Depends.  If we've found show-stopping bugs, as it appears we may have
> done, in not requiring standards_conforming_strings, we can't just
> pull a MySQL and ship anyhow.
> 
> Cheers,
> David.


Re: Replacing plpgsql's lexer

From
Tom Lane
Date:
I wrote:
> So I think we are down to a choice of doing nothing for 8.4, or teaching
> the existing plpgsql lexer about standard_conforming_strings.  Assuming
> the current proposal for U& literals holds up, it should not be
> necessary for plpgsql to know about those explicitly as long as it obeys
> standard_conforming_strings, so this might not be too horrid a project.
> I'll take a look at that next.

The attached proposed patch rips out plpgsql's handling of comments and
string literals, and puts in scanner rules that are extracted from the
core lexer (but simplified in a few places where we don't need all the
complexity).  The net user-visible effects should be:

* Both regular and E'' literals should now be parsed exactly the same
as the core does it.

* Nested slash-star comments are now handled properly.

* Warnings and errors associated with string parsing should now match
the core, which means they might vary a bit from previous plpgsql
behavior.

I need to test this a bit more, and it could probably do with adding
a few regression test cases, but I think it's code-complete.

Comments?

            regards, tom lane

PS: in passing I got rid of the scanner_functype/scanner_typereported
kluge, which might once have had some purpose but now is just cluttering
both the scanner and the grammar.  This is a leftover from my failed
attempt at removing the scanner altogether.  Since it simplifies the
code I thought I'd keep it.

Index: src/pl/plpgsql/src/gram.y
===================================================================
RCS file: /cvsroot/pgsql/src/pl/plpgsql/src/gram.y,v
retrieving revision 1.121
diff -c -r1.121 gram.y
*** src/pl/plpgsql/src/gram.y    18 Feb 2009 11:33:04 -0000    1.121
--- src/pl/plpgsql/src/gram.y    19 Apr 2009 16:28:04 -0000
***************
*** 62,67 ****
--- 62,69 ----
                                             int lineno);
  static    void             check_sql_expr(const char *stmt);
  static    void             plpgsql_sql_error_callback(void *arg);
+ static    char            *parse_string_token(const char *token);
+ static    void             plpgsql_string_error_callback(void *arg);
  static    char            *check_label(const char *yytxt);
  static    void             check_labels(const char *start_label,
                                        const char *end_label);
***************
*** 228,235 ****
          /*
           * Other tokens
           */
- %token    T_FUNCTION
- %token    T_TRIGGER
  %token    T_STRING
  %token    T_NUMBER
  %token    T_SCALAR                /* a VAR, RECFIELD, or TRIGARG */
--- 230,235 ----
***************
*** 244,256 ****

  %%

! pl_function        : T_FUNCTION comp_optsect pl_block opt_semi
                      {
!                         yylval.program = (PLpgSQL_stmt_block *)$3;
!                     }
!                 | T_TRIGGER comp_optsect pl_block opt_semi
!                     {
!                         yylval.program = (PLpgSQL_stmt_block *)$3;
                      }
                  ;

--- 244,252 ----

  %%

! pl_function        : comp_optsect pl_block opt_semi
                      {
!                         yylval.program = (PLpgSQL_stmt_block *) $2;
                      }
                  ;

***************
*** 1403,1409 ****
                              if (tok == T_STRING)
                              {
                                  /* old style message and parameters */
!                                 new->message = plpgsql_get_string_value();
                                  /*
                                   * We expect either a semi-colon, which
                                   * indicates no parameters, or a comma that
--- 1399,1405 ----
                              if (tok == T_STRING)
                              {
                                  /* old style message and parameters */
!                                 new->message = parse_string_token(yytext);
                                  /*
                                   * We expect either a semi-colon, which
                                   * indicates no parameters, or a comma that
***************
*** 1435,1441 ****

                                      if (yylex() != T_STRING)
                                          yyerror("syntax error");
!                                     sqlstatestr = plpgsql_get_string_value();

                                      if (strlen(sqlstatestr) != 5)
                                          yyerror("invalid SQLSTATE code");
--- 1431,1437 ----

                                      if (yylex() != T_STRING)
                                          yyerror("syntax error");
!                                     sqlstatestr = parse_string_token(yytext);

                                      if (strlen(sqlstatestr) != 5)
                                          yyerror("invalid SQLSTATE code");
***************
*** 1778,1784 ****
                              /* next token should be a string literal */
                              if (yylex() != T_STRING)
                                  yyerror("syntax error");
!                             sqlstatestr = plpgsql_get_string_value();

                              if (strlen(sqlstatestr) != 5)
                                  yyerror("invalid SQLSTATE code");
--- 1774,1780 ----
                              /* next token should be a string literal */
                              if (yylex() != T_STRING)
                                  yyerror("syntax error");
!                             sqlstatestr = parse_string_token(yytext);

                              if (strlen(sqlstatestr) != 5)
                                  yyerror("invalid SQLSTATE code");
***************
*** 2738,2743 ****
--- 2734,2782 ----
      errposition(0);
  }

+ /*
+  * Convert a string-literal token to the represented string value.
+  *
+  * To do this, we need to invoke the core lexer.  To avoid confusion between
+  * the core bison/flex definitions and our own, the actual invocation is in
+  * pl_funcs.c.  Here we are only concerned with setting up the right errcontext
+  * state, which is handled the same as in check_sql_expr().
+  */
+ static char *
+ parse_string_token(const char *token)
+ {
+     char       *result;
+     ErrorContextCallback  syntax_errcontext;
+     ErrorContextCallback *previous_errcontext;
+
+     /* See comments in check_sql_expr() */
+     Assert(error_context_stack->callback == plpgsql_compile_error_callback);
+
+     previous_errcontext = error_context_stack;
+     syntax_errcontext.callback = plpgsql_string_error_callback;
+     syntax_errcontext.arg = (char *) token;
+     syntax_errcontext.previous = error_context_stack->previous;
+     error_context_stack = &syntax_errcontext;
+
+     result = plpgsql_parse_string_token(token);
+
+     /* Restore former ereport callback */
+     error_context_stack = previous_errcontext;
+
+     return result;
+ }
+
+ static void
+ plpgsql_string_error_callback(void *arg)
+ {
+     Assert(plpgsql_error_funcname);
+
+     errcontext("String literal in PL/PgSQL function \"%s\" near line %d",
+                plpgsql_error_funcname, plpgsql_error_lineno);
+     /* representing the string literal as internalquery seems overkill */
+     errposition(0);
+ }
+
  static char *
  check_label(const char *yytxt)
  {
Index: src/pl/plpgsql/src/pl_comp.c
===================================================================
RCS file: /cvsroot/pgsql/src/pl/plpgsql/src/pl_comp.c,v
retrieving revision 1.134
diff -c -r1.134 pl_comp.c
*** src/pl/plpgsql/src/pl_comp.c    18 Feb 2009 11:33:04 -0000    1.134
--- src/pl/plpgsql/src/pl_comp.c    19 Apr 2009 16:28:04 -0000
***************
*** 261,267 ****
             bool forValidator)
  {
      Form_pg_proc procStruct = (Form_pg_proc) GETSTRUCT(procTup);
!     int            functype = CALLED_AS_TRIGGER(fcinfo) ? T_TRIGGER : T_FUNCTION;
      Datum        prosrcdatum;
      bool        isnull;
      char       *proc_source;
--- 261,267 ----
             bool forValidator)
  {
      Form_pg_proc procStruct = (Form_pg_proc) GETSTRUCT(procTup);
!     bool        is_trigger = CALLED_AS_TRIGGER(fcinfo);
      Datum        prosrcdatum;
      bool        isnull;
      char       *proc_source;
***************
*** 293,299 ****
      if (isnull)
          elog(ERROR, "null prosrc");
      proc_source = TextDatumGetCString(prosrcdatum);
!     plpgsql_scanner_init(proc_source, functype);

      plpgsql_error_funcname = pstrdup(NameStr(procStruct->proname));
      plpgsql_error_lineno = 0;
--- 293,299 ----
      if (isnull)
          elog(ERROR, "null prosrc");
      proc_source = TextDatumGetCString(prosrcdatum);
!     plpgsql_scanner_init(proc_source);

      plpgsql_error_funcname = pstrdup(NameStr(procStruct->proname));
      plpgsql_error_lineno = 0;
***************
*** 359,371 ****
      function->fn_oid = fcinfo->flinfo->fn_oid;
      function->fn_xmin = HeapTupleHeaderGetXmin(procTup->t_data);
      function->fn_tid = procTup->t_self;
!     function->fn_functype = functype;
      function->fn_cxt = func_cxt;
      function->out_param_varno = -1;        /* set up for no OUT param */

!     switch (functype)
      {
!         case T_FUNCTION:

              /*
               * Fetch info about the procedure's parameters. Allocations aren't
--- 359,371 ----
      function->fn_oid = fcinfo->flinfo->fn_oid;
      function->fn_xmin = HeapTupleHeaderGetXmin(procTup->t_data);
      function->fn_tid = procTup->t_self;
!     function->fn_is_trigger = is_trigger;
      function->fn_cxt = func_cxt;
      function->out_param_varno = -1;        /* set up for no OUT param */

!     switch (is_trigger)
      {
!         case false:

              /*
               * Fetch info about the procedure's parameters. Allocations aren't
***************
*** 564,570 ****
              ReleaseSysCache(typeTup);
              break;

!         case T_TRIGGER:
              /* Trigger procedure's return type is unknown yet */
              function->fn_rettype = InvalidOid;
              function->fn_retbyval = false;
--- 564,570 ----
              ReleaseSysCache(typeTup);
              break;

!         case true:
              /* Trigger procedure's return type is unknown yet */
              function->fn_rettype = InvalidOid;
              function->fn_retbyval = false;
***************
*** 645,651 ****
              break;

          default:
!             elog(ERROR, "unrecognized function typecode: %u", functype);
              break;
      }

--- 645,651 ----
              break;

          default:
!             elog(ERROR, "unrecognized function typecode: %d", (int) is_trigger);
              break;
      }

***************
*** 790,796 ****
       * Recognize tg_argv when compiling triggers
       * (XXX this sucks, it should be a regular variable in the namestack)
       */
!     if (plpgsql_curr_compile->fn_functype == T_TRIGGER)
      {
          if (strcmp(cp[0], "tg_argv") == 0)
          {
--- 790,796 ----
       * Recognize tg_argv when compiling triggers
       * (XXX this sucks, it should be a regular variable in the namestack)
       */
!     if (plpgsql_curr_compile->fn_is_trigger)
      {
          if (strcmp(cp[0], "tg_argv") == 0)
          {
Index: src/pl/plpgsql/src/pl_funcs.c
===================================================================
RCS file: /cvsroot/pgsql/src/pl/plpgsql/src/pl_funcs.c,v
retrieving revision 1.76
diff -c -r1.76 pl_funcs.c
*** src/pl/plpgsql/src/pl_funcs.c    18 Feb 2009 11:33:04 -0000    1.76
--- src/pl/plpgsql/src/pl_funcs.c    19 Apr 2009 16:28:04 -0000
***************
*** 17,22 ****
--- 17,24 ----

  #include <ctype.h>

+ #include "parser/gramparse.h"
+ #include "parser/gram.h"
  #include "parser/scansup.h"


***************
*** 460,465 ****
--- 462,502 ----


  /*
+  * plpgsql_parse_string_token - get the value represented by a string literal
+  *
+  * We do not make plpgsql's lexer produce the represented value, because
+  * in many cases we don't need it.  Instead this function is invoked when
+  * we do need it.  The input is the T_STRING token as identified by the lexer.
+  *
+  * The result is a palloc'd string.
+  *
+  * Note: this is called only from plpgsql's gram.y, but we can't just put it
+  * there because including parser/gram.h there would cause confusion.
+  */
+ char *
+ plpgsql_parse_string_token(const char *token)
+ {
+     int        ctoken;
+
+     /*
+      * We use the core lexer to do the dirty work.  Aside from getting the
+      * right results for escape sequences and so on, this helps us produce
+      * appropriate warnings for escape_string_warning etc.
+      */
+     scanner_init(token);
+
+     ctoken = base_yylex();
+
+     if (ctoken != SCONST)
+         elog(ERROR, "unexpected result from base lexer: %d", ctoken);
+
+     scanner_finish();
+
+     return base_yylval.str;
+ }
+
+
+ /*
   * Statement type as a string, for use in error messages etc.
   */
  const char *
Index: src/pl/plpgsql/src/plpgsql.h
===================================================================
RCS file: /cvsroot/pgsql/src/pl/plpgsql/src/plpgsql.h,v
retrieving revision 1.110
diff -c -r1.110 plpgsql.h
*** src/pl/plpgsql/src/plpgsql.h    9 Apr 2009 02:57:53 -0000    1.110
--- src/pl/plpgsql/src/plpgsql.h    19 Apr 2009 16:28:04 -0000
***************
*** 650,656 ****
      Oid            fn_oid;
      TransactionId fn_xmin;
      ItemPointerData fn_tid;
!     int            fn_functype;
      PLpgSQL_func_hashkey *fn_hashkey;    /* back-link to hashtable key */
      MemoryContext fn_cxt;

--- 650,656 ----
      Oid            fn_oid;
      TransactionId fn_xmin;
      ItemPointerData fn_tid;
!     bool        fn_is_trigger;
      PLpgSQL_func_hashkey *fn_hashkey;    /* back-link to hashtable key */
      MemoryContext fn_cxt;

***************
*** 880,885 ****
--- 880,886 ----
   * ----------
   */
  extern void plpgsql_convert_ident(const char *s, char **output, int numidents);
+ extern char *plpgsql_parse_string_token(const char *token);
  extern const char *plpgsql_stmt_typename(PLpgSQL_stmt *stmt);
  extern void plpgsql_dumptree(PLpgSQL_function *func);

***************
*** 894,901 ****
  extern void plpgsql_push_back_token(int token);
  extern void plpgsql_yyerror(const char *message);
  extern int    plpgsql_scanner_lineno(void);
! extern void plpgsql_scanner_init(const char *str, int functype);
  extern void plpgsql_scanner_finish(void);
- extern char *plpgsql_get_string_value(void);

  #endif   /* PLPGSQL_H */
--- 895,901 ----
  extern void plpgsql_push_back_token(int token);
  extern void plpgsql_yyerror(const char *message);
  extern int    plpgsql_scanner_lineno(void);
! extern void plpgsql_scanner_init(const char *str);
  extern void plpgsql_scanner_finish(void);

  #endif   /* PLPGSQL_H */
Index: src/pl/plpgsql/src/scan.l
===================================================================
RCS file: /cvsroot/pgsql/src/pl/plpgsql/src/scan.l,v
retrieving revision 1.67
diff -c -r1.67 scan.l
*** src/pl/plpgsql/src/scan.l    18 Feb 2009 11:33:04 -0000    1.67
--- src/pl/plpgsql/src/scan.l    19 Apr 2009 16:28:04 -0000
***************
*** 19,45 ****
  #include "mb/pg_wchar.h"


- /* No reason to constrain amount of data slurped */
- #define YY_READ_BUF_SIZE 16777216
-
  /* Avoid exit() on fatal scanner errors (a bit ugly -- see yy_fatal_error) */
  #undef fprintf
  #define fprintf(file, fmt, msg)  ereport(ERROR, (errmsg_internal("%s", msg)))

  /* Handles to the buffer that the lexer uses internally */
  static YY_BUFFER_STATE scanbufhandle;
  static char *scanbuf;

  static const char *scanstr;        /* original input string */

- static int    scanner_functype;
- static bool    scanner_typereported;
  static int    pushback_token;
  static bool have_pushback_token;
  static const char *cur_line_start;
  static int    cur_line_num;
  static char    *dolqstart;      /* current $foo$ quote start string */
! static int    dolqlen;            /* signal to plpgsql_get_string_value */

  bool plpgsql_SpaceScanned = false;
  %}
--- 19,49 ----
  #include "mb/pg_wchar.h"


  /* Avoid exit() on fatal scanner errors (a bit ugly -- see yy_fatal_error) */
  #undef fprintf
  #define fprintf(file, fmt, msg)  ereport(ERROR, (errmsg_internal("%s", msg)))

+ /*
+  * When we parse a token that requires multiple lexer rules to process,
+  * remember the token's starting position this way.
+  */
+ #define SAVE_TOKEN_START()  \
+     ( start_lineno = plpgsql_scanner_lineno(), start_charpos = yytext )
+
  /* Handles to the buffer that the lexer uses internally */
  static YY_BUFFER_STATE scanbufhandle;
  static char *scanbuf;

  static const char *scanstr;        /* original input string */

  static int    pushback_token;
  static bool have_pushback_token;
  static const char *cur_line_start;
  static int    cur_line_num;
+ static int        xcdepth = 0;    /* depth of nesting in slash-star comments */
  static char    *dolqstart;      /* current $foo$ quote start string */
!
! extern bool        standard_conforming_strings;

  bool plpgsql_SpaceScanned = false;
  %}
***************
*** 54,84 ****

  %option case-insensitive


! %x    IN_STRING
! %x    IN_COMMENT
! %x    IN_DOLLARQUOTE

  digit            [0-9]
  ident_start        [A-Za-z\200-\377_]
  ident_cont        [A-Za-z\200-\377_0-9\$]

  quoted_ident    (\"[^\"]*\")+

  identifier        ({ident_start}{ident_cont}*|{quoted_ident})

  param            \${digit}+

- space            [ \t\n\r\f]
-
- /* $foo$ style quotes ("dollar quoting")
-  * copied straight from the backend SQL parser
-  */
- dolq_start        [A-Za-z\200-\377_]
- dolq_cont        [A-Za-z\200-\377_0-9]
- dolqdelim        \$({dolq_start}{dolq_cont}*)?\$
- dolqinside        [^$]+
-
  %%
      /* ----------
       * Local variables in scanner to remember where
--- 58,130 ----

  %option case-insensitive

+ /*
+  * Exclusive states are a subset of the core lexer's:
+  *  <xc> extended C-style comments
+  *  <xq> standard quoted strings
+  *  <xe> extended quoted strings (support backslash escape sequences)
+  *  <xdolq> $foo$ quoted strings
+  */
+
+ %x xc
+ %x xe
+ %x xq
+ %x xdolq
+
+ /*
+  * Definitions --- these generally must match the core lexer, but in some
+  * cases we can simplify, since we only care about identifying the token
+  * boundaries and not about deriving the represented value.  Also, we
+  * aren't trying to lex multicharacter operators so their interactions
+  * with comments go away.
+  */
+
+ space            [ \t\n\r\f]
+ horiz_space        [ \t\f]
+ newline            [\n\r]
+ non_newline        [^\n\r]
+
+ comment            ("--"{non_newline}*)
+
+ whitespace        ({space}+|{comment})
+ special_whitespace        ({space}+|{comment}{newline})
+ horiz_whitespace        ({horiz_space}|{comment})
+ whitespace_with_newline    ({horiz_whitespace}*{newline}{special_whitespace}*)
+
+ quote            '
+ quotestop        {quote}{whitespace}*
+ quotecontinue    {quote}{whitespace_with_newline}{quote}
+ quotefail        {quote}{whitespace}*"-"
+
+ xestart            [eE]{quote}
+ xeinside        [^\\']+
+ xeescape        [\\].
+
+ xqstart            {quote}
+ xqdouble        {quote}{quote}
+ xqinside        [^']+
+
+ dolq_start        [A-Za-z\200-\377_]
+ dolq_cont        [A-Za-z\200-\377_0-9]
+ dolqdelim        \$({dolq_start}{dolq_cont}*)?\$
+ dolqfailed        \${dolq_start}{dolq_cont}*
+ dolqinside        [^$]+

! xcstart            \/\*
! xcstop            \*+\/
! xcinside        [^*/]+

  digit            [0-9]
  ident_start        [A-Za-z\200-\377_]
  ident_cont        [A-Za-z\200-\377_0-9\$]

+ /* This is a simpler treatment of quoted identifiers than the core uses */
  quoted_ident    (\"[^\"]*\")+

  identifier        ({ident_start}{ident_cont}*|{quoted_ident})

  param            \${digit}+

  %%
      /* ----------
       * Local variables in scanner to remember where
***************
*** 96,112 ****
      plpgsql_SpaceScanned = false;

      /* ----------
-      * On the first call to a new source report the
-      * function's type (T_FUNCTION or T_TRIGGER)
-      * ----------
-      */
-     if (!scanner_typereported)
-     {
-         scanner_typereported = true;
-         return scanner_functype;
-     }
-
-     /* ----------
       * The keyword rules
       * ----------
       */
--- 142,147 ----
***************
*** 225,343 ****

  {digit}+        { return T_NUMBER;            }

! \".                {
!                 plpgsql_error_lineno = plpgsql_scanner_lineno();
!                 ereport(ERROR,
!                         (errcode(ERRCODE_DATATYPE_MISMATCH),
!                          errmsg("unterminated quoted identifier")));
!             }
!
!     /* ----------
!      * Ignore whitespaces but remember this happened
!      * ----------
!      */
! {space}+        { plpgsql_SpaceScanned = true;        }

      /* ----------
!      * Eat up comments
       * ----------
       */
! --[^\r\n]*        ;
!
! \/\*            { start_lineno = plpgsql_scanner_lineno();
!               BEGIN(IN_COMMENT);
!             }
! <IN_COMMENT>\*\/    { BEGIN(INITIAL); plpgsql_SpaceScanned = true; }
! <IN_COMMENT>\n        ;
! <IN_COMMENT>.        ;
! <IN_COMMENT><<EOF>>    {
!                 plpgsql_error_lineno = start_lineno;
!                 ereport(ERROR,
!                         (errcode(ERRCODE_DATATYPE_MISMATCH),
!                          errmsg("unterminated /* comment")));
!             }

      /* ----------
!      * Collect anything inside of ''s and return one STRING token
!      *
!      * Hacking yytext/yyleng here lets us avoid using yymore(), which is
!      * a win for performance.  It's safe because we know the underlying
!      * input buffer is not changing.
       * ----------
       */
! '            {
!               start_lineno = plpgsql_scanner_lineno();
!               start_charpos = yytext;
!               BEGIN(IN_STRING);
!             }
! [eE]'        {
!               /* for now, treat the same as a regular literal */
!               start_lineno = plpgsql_scanner_lineno();
!               start_charpos = yytext;
!               BEGIN(IN_STRING);
!             }
! <IN_STRING>\\.        { }
! <IN_STRING>\\        { /* can only happen with \ at EOF */ }
! <IN_STRING>''        { }
! <IN_STRING>'        {
!               /* tell plpgsql_get_string_value it's not a dollar quote */
!               dolqlen = 0;
!               /* adjust yytext/yyleng to describe whole string token */
!               yyleng += (yytext - start_charpos);
!               yytext = start_charpos;
!               BEGIN(INITIAL);
!               return T_STRING;
!             }
! <IN_STRING>[^'\\]+    { }
! <IN_STRING><<EOF>>    {
!                 plpgsql_error_lineno = start_lineno;
!                 ereport(ERROR,
!                         (errcode(ERRCODE_DATATYPE_MISMATCH),
!                          errmsg("unterminated quoted string")));
!             }
!
! {dolqdelim}        {
!               start_lineno = plpgsql_scanner_lineno();
!               start_charpos = yytext;
!               dolqstart = pstrdup(yytext);
!               BEGIN(IN_DOLLARQUOTE);
!             }
! <IN_DOLLARQUOTE>{dolqdelim} {
!               if (strcmp(yytext, dolqstart) == 0)
!               {
!                     pfree(dolqstart);
!                     /* tell plpgsql_get_string_value it is a dollar quote */
!                     dolqlen = yyleng;
                      /* adjust yytext/yyleng to describe whole string token */
                      yyleng += (yytext - start_charpos);
                      yytext = start_charpos;
-                     BEGIN(INITIAL);
                      return T_STRING;
!               }
!               else
!               {
!                     /*
!                      * When we fail to match $...$ to dolqstart, transfer
!                      * the $... part to the output, but put back the final
!                      * $ for rescanning.  Consider $delim$...$junk$delim$
!                      */
!                     yyless(yyleng-1);
!               }
!             }
! <IN_DOLLARQUOTE>{dolqinside} { }
! <IN_DOLLARQUOTE>.    { /* needed for $ inside the quoted text */ }
! <IN_DOLLARQUOTE><<EOF>>    {
!                 plpgsql_error_lineno = start_lineno;
!                 ereport(ERROR,
!                         (errcode(ERRCODE_DATATYPE_MISMATCH),
!                          errmsg("unterminated dollar-quoted string")));
!             }

      /* ----------
       * Any unmatched character is returned as is
       * ----------
       */
! .            { return yytext[0];            }

  %%

--- 260,393 ----

  {digit}+        { return T_NUMBER;            }

! \".                { yyerror("unterminated quoted identifier"); }

      /* ----------
!      * Ignore whitespace (including comments) but remember this happened
       * ----------
       */
! {whitespace}    { plpgsql_SpaceScanned = true; }

      /* ----------
!      * Comment and literal handling is mostly copied from the core lexer
       * ----------
       */
! {xcstart}        {
!                     /* Set location in case of syntax error in comment */
!                     SAVE_TOKEN_START();
!                     xcdepth = 0;
!                     BEGIN(xc);
!                     plpgsql_SpaceScanned = true;
!                 }
!
! <xc>{xcstart}    {
!                     xcdepth++;
!                 }
!
! <xc>{xcstop}    {
!                     if (xcdepth <= 0)
!                         BEGIN(INITIAL);
!                     else
!                         xcdepth--;
!                 }
!
! <xc>{xcinside}    {
!                     /* ignore */
!                 }
!
! <xc>\/+            {
!                     /* ignore */
!                 }
!
! <xc>\*+            {
!                     /* ignore */
!                 }
!
! <xc><<EOF>>        { yyerror("unterminated /* comment"); }
!
! {xqstart}        {
!                     SAVE_TOKEN_START();
!                     if (standard_conforming_strings)
!                         BEGIN(xq);
!                     else
!                         BEGIN(xe);
!                 }
! {xestart}        {
!                     SAVE_TOKEN_START();
!                     BEGIN(xe);
!                 }
! <xq,xe>{quotestop}    |
! <xq,xe>{quotefail} {
!                     yyless(1);
!                     BEGIN(INITIAL);
                      /* adjust yytext/yyleng to describe whole string token */
                      yyleng += (yytext - start_charpos);
                      yytext = start_charpos;
                      return T_STRING;
!                 }
! <xq,xe>{xqdouble} {
!                 }
! <xq>{xqinside}  {
!                 }
! <xe>{xeinside}  {
!                 }
! <xe>{xeescape}  {
!                 }
! <xq,xe>{quotecontinue} {
!                     /* ignore */
!                 }
! <xe>.            {
!                     /* This is only needed for \ just before EOF */
!                 }
! <xq,xe><<EOF>>        { yyerror("unterminated quoted string"); }
!
! {dolqdelim}        {
!                     SAVE_TOKEN_START();
!                     dolqstart = pstrdup(yytext);
!                     BEGIN(xdolq);
!                 }
! {dolqfailed}    {
!                     /* throw back all but the initial "$" */
!                     yyless(1);
!                     /* and treat it as {other} */
!                     return yytext[0];
!                 }
! <xdolq>{dolqdelim} {
!                     if (strcmp(yytext, dolqstart) == 0)
!                     {
!                         pfree(dolqstart);
!                         BEGIN(INITIAL);
!                         /* adjust yytext/yyleng to describe whole string */
!                         yyleng += (yytext - start_charpos);
!                         yytext = start_charpos;
!                         return T_STRING;
!                     }
!                     else
!                     {
!                         /*
!                          * When we fail to match $...$ to dolqstart, transfer
!                          * the $... part to the output, but put back the final
!                          * $ for rescanning.  Consider $delim$...$junk$delim$
!                          */
!                         yyless(yyleng-1);
!                     }
!                 }
! <xdolq>{dolqinside} {
!                 }
! <xdolq>{dolqfailed} {
!                 }
! <xdolq>.        {
!                     /* This is only needed for $ inside the quoted text */
!                 }
! <xdolq><<EOF>>    { yyerror("unterminated dollar-quoted string"); }

      /* ----------
       * Any unmatched character is returned as is
       * ----------
       */
! .                {
!                     return yytext[0];
!                 }

  %%

***************
*** 437,443 ****
   * to cite in error messages.
   */
  void
! plpgsql_scanner_init(const char *str, int functype)
  {
      Size    slen;

--- 487,493 ----
   * to cite in error messages.
   */
  void
! plpgsql_scanner_init(const char *str)
  {
      Size    slen;

***************
*** 460,468 ****
      /* Other setup */
      scanstr = str;

-     scanner_functype = functype;
-     scanner_typereported = false;
-
      have_pushback_token = false;

      cur_line_start = scanbuf;
--- 510,515 ----
***************
*** 493,569 ****
      yy_delete_buffer(scanbufhandle);
      pfree(scanbuf);
  }
-
- /*
-  * Called after a T_STRING token is read to get the string literal's value
-  * as a palloc'd string.  (We make this a separate call because in many
-  * scenarios there's no need to get the decoded value.)
-  *
-  * Note: we expect the literal to be the most recently lexed token.  This
-  * would not work well if we supported multiple-token pushback or if
-  * plpgsql_yylex() wanted to read ahead beyond a T_STRING token.
-  */
- char *
- plpgsql_get_string_value(void)
- {
-     char       *result;
-     const char *cp;
-     int            len;
-
-     if (dolqlen > 0)
-     {
-         /* Token is a $foo$...$foo$ string */
-         len = yyleng - 2 * dolqlen;
-         Assert(len >= 0);
-         result = (char *) palloc(len + 1);
-         memcpy(result, yytext + dolqlen, len);
-         result[len] = '\0';
-     }
-     else if (*yytext == 'E' || *yytext == 'e')
-     {
-         /* Token is an E'...' string */
-         result = (char *) palloc(yyleng + 1);    /* more than enough room */
-         len = 0;
-         for (cp = yytext + 2; *cp; cp++)
-         {
-             if (*cp == '\'')
-             {
-                 if (cp[1] == '\'')
-                     result[len++] = *cp++;
-                 /* else it must be string end quote */
-             }
-             else if (*cp == '\\')
-             {
-                 if (cp[1] != '\0')    /* just a paranoid check */
-                     result[len++] = *(++cp);
-             }
-             else
-                 result[len++] = *cp;
-         }
-         result[len] = '\0';
-     }
-     else
-     {
-         /* Token is a '...' string */
-         result = (char *) palloc(yyleng + 1);    /* more than enough room */
-         len = 0;
-         for (cp = yytext + 1; *cp; cp++)
-         {
-             if (*cp == '\'')
-             {
-                 if (cp[1] == '\'')
-                     result[len++] = *cp++;
-                 /* else it must be string end quote */
-             }
-             else if (*cp == '\\')
-             {
-                 if (cp[1] != '\0')    /* just a paranoid check */
-                     result[len++] = *(++cp);
-             }
-             else
-                 result[len++] = *cp;
-         }
-         result[len] = '\0';
-     }
-     return result;
- }
--- 540,542 ----

Re: Replacing plpgsql's lexer

From
Greg Stark
Date:
On Sun, Apr 19, 2009 at 5:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> * Nested slash-star comments are now handled properly.

as opposed to?

-- 
greg


Re: Replacing plpgsql's lexer

From
Grzegorz Jaskiewicz
Date:
On 19 Apr 2009, at 17:42, Tom Lane wrote:
>
> The attached proposed patch rips out plpgsql's handling of comments  
> and
> string literals, and puts in scanner rules that are extracted from the
> core lexer (but simplified in a few places where we don't need all the
> complexity).  The net user-visible effects should be:
>
>
> Comments?

Will it also mean, that queries are going to be analyzed deeper ?
Ie, afaik I am able now to create plpgsql function, that tries to run  
query accessing non existent table, or columns.
Or, if I rename column/table/relation now, views, etc are getting  
updated - but not plpgsql functions. Will that change with your patch ?



Re: Replacing plpgsql's lexer

From
Greg Stark
Date:
On Sun, Apr 19, 2009 at 6:24 PM, Grzegorz Jaskiewicz
<gj@pointblue.com.pl> wrote:
> Will it also mean, that queries are going to be analyzed deeper ?
> Ie, afaik I am able now to create plpgsql function, that tries to run query
> accessing non existent table, or columns.
> Or, if I rename column/table/relation now, views, etc are getting updated -
> but not plpgsql functions. Will that change with your patch ?


The scanner isn't responsible for anything like this. It just braeks
the input up into tokens. So its responsible for determining where
strings start and end and where tble names start and end but doesn't
actually look up the name anywhere -- that's up to the parser and
later steps. So no.

-- 
greg


Re: Replacing plpgsql's lexer

From
Grzegorz Jaskiewicz
Date:
On 19 Apr 2009, at 18:28, Greg Stark wrote:

> On Sun, Apr 19, 2009 at 6:24 PM, Grzegorz Jaskiewicz
> <gj@pointblue.com.pl> wrote:
>> Will it also mean, that queries are going to be analyzed deeper ?
>> Ie, afaik I am able now to create plpgsql function, that tries to  
>> run query
>> accessing non existent table, or columns.
>> Or, if I rename column/table/relation now, views, etc are getting  
>> updated -
>> but not plpgsql functions. Will that change with your patch ?
>
>
> The scanner isn't responsible for anything like this. It just braeks
> the input up into tokens. So its responsible for determining where
> strings start and end and where tble names start and end but doesn't
> actually look up the name anywhere -- that's up to the parser and
> later steps. So no.
ok, thanks.
To be honest, That would be the great feature.



Re: Replacing plpgsql's lexer

From
Tom Lane
Date:
Greg Stark <stark@enterprisedb.com> writes:
> On Sun, Apr 19, 2009 at 5:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> * Nested slash-star comments are now handled properly.

> as opposed to?

They nest, as required by the SQL spec and implemented by our core
lexer.  plpgsql didn't use to get this right.
        regards, tom lane