Thread: Doing psql's lexing with flex

Doing psql's lexing with flex

From
Tom Lane
Date:
I got interested enough in the psql-with-flex problem to go off and
solve it.  Attached is a working patch, which I'm now debating whether
to apply.  Comments solicited...

The patch removes about 200 lines of very spaghetti-ish code in
mainloop.c.  However, it adds an 875-line flex source file, which
might be thought a bad tradeoff :-(.  One bright spot is that about
half of that total is a direct copy of the main backend lexer, so
it's not really as much new, separately maintainable code as all that.
Also, Andrew Dunstan's patch for supporting dollar-quoting would add
about 100 lines to mainloop.c, versus only a dozen or so lines in the
flex implementation.  Once that's taken into account I don't think there
is a lot of difference in effective SLOC to maintain.  I'm also of the
opinion that the new C code in psqlscan.l is much more straightforward
than the code removed from mainloop.c, though having just written it,
I'm no doubt pretty biased.

Bruce was asking about speed.  On normal-size queries I cannot measure
any difference at all.  For testing purposes I made up a file containing
a single 750K query (just a "SELECT big-honking-string-constant", with
the string literal broken into lines of 75 bytes).  The client-side
(psql) CPU time to run this file looks about like this on my machine:

              PGCLIENTENCODING
            UNICODE        SJIS

CVS tip            1.57        1.82

flex implementation    0.93        2.33

The flex implementation is consistently faster than CVS tip when dealing
with backend-compatible encodings (such as UTF-8).  It's consistently
slower when it has to deal with a non-backend-safe encoding such as SJIS
or Big5.  But for real-world cases the differential is down in the noise
either way.

I'm inclined to apply this but I can see where a person not comfortable
with flex might feel differently.  Opinions?

            regards, tom lane


Attachment

Re: Doing psql's lexing with flex

From
Bruce Momjian
Date:
Tom Lane wrote:
> I got interested enough in the psql-with-flex problem to go off and
> solve it.  Attached is a working patch, which I'm now debating whether
> to apply.  Comments solicited...
>
> The patch removes about 200 lines of very spaghetti-ish code in
> mainloop.c.  However, it adds an 875-line flex source file, which
> might be thought a bad tradeoff :-(.  One bright spot is that about
> half of that total is a direct copy of the main backend lexer, so
> it's not really as much new, separately maintainable code as all that.
> Also, Andrew Dunstan's patch for supporting dollar-quoting would add
> about 100 lines to mainloop.c, versus only a dozen or so lines in the
> flex implementation.  Once that's taken into account I don't think there
> is a lot of difference in effective SLOC to maintain.  I'm also of the
> opinion that the new C code in psqlscan.l is much more straightforward
> than the code removed from mainloop.c, though having just written it,
> I'm no doubt pretty biased.
>
> Bruce was asking about speed.  On normal-size queries I cannot measure
> any difference at all.  For testing purposes I made up a file containing
> a single 750K query (just a "SELECT big-honking-string-constant", with
> the string literal broken into lines of 75 bytes).  The client-side
> (psql) CPU time to run this file looks about like this on my machine:
>
>               PGCLIENTENCODING
>             UNICODE        SJIS
>
> CVS tip            1.57        1.82
>
> flex implementation    0.93        2.33

Looks good.  I withdraw my performance concerns.  Thanks for testing.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Doing psql's lexing with flex

From
Neil Conway
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> I'm inclined to apply this but I can see where a person not comfortable
> with flex might feel differently.  Opinions?

Looks good to me. The psql cleanup is nice, and ISTM that much of the
flex code is comments or flex boilerplate anyway, so the actual LOC
increase isn't too bad.

-Neil


Re: Doing psql's lexing with flex

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> I'm inclined to apply this but I can see where a person not comfortable
>> with flex might feel differently.  Opinions?

> Looks good to me. The psql cleanup is nice, and ISTM that much of the
> flex code is comments or flex boilerplate anyway, so the actual LOC
> increase isn't too bad.

I just realized that something I had planned to look at later is
actually an essential step if this is to be applied.  I had wanted
to see if lexing of backslash command tokens could be folded into the
flex lexer as well, but thought I could leave it for later.

The case where this doesn't preserve the previous behavior is if the
expansion of a psql variable includes both a backslash command and some
part of its arguments.  As patched, HandleSlashCommand will only look to
the original input string and not see the contents of any psql variables
(because those are only seen inside the lexer, I'm not physically
inserting them into the line buffer as the old code did).  Imagine
for example
    \set foo '\c mydb'
    blah :foo bar
The existing code would interpret this as
    blah \c mydb bar
but my patch as it stands would behave very strangely --- the \c command
would see bar as its argument and then 'mydb' would be regurgitated
after HandleSlashCommand finishes.

The existing code is pretty darn inconsistent on this point anyway when
you look at it carefully.  AFAICS a variable reference is handled quite
differently in the arguments of a backslash command than elsewhere; it
can't supply general text but only a single option value.  The same
variable expanded before control reaches HandleSlashCommand is going to
look indistinguishable from hand-entered text.  If we did translate it
into a flex thing the behavior would be different in corner cases like
this.

Peter, any comments on this?  Is the current behavior actually
intentional, or did it just fall out of the implementation techniques?

            regards, tom lane

Re: Doing psql's lexing with flex

From
Andrew Dunstan
Date:

Tom Lane wrote:

>I got interested enough in the psql-with-flex problem to go off and
>solve it.  Attached is a working patch, which I'm now debating whether
>to apply.  Comments solicited...
>
>The patch removes about 200 lines of very spaghetti-ish code in
>mainloop.c.  However, it adds an 875-line flex source file, which
>might be thought a bad tradeoff :-(.  One bright spot is that about
>half of that total is a direct copy of the main backend lexer, so
>it's not really as much new, separately maintainable code as all that.
>Also, Andrew Dunstan's patch for supporting dollar-quoting would add
>about 100 lines to mainloop.c, versus only a dozen or so lines in the
>flex implementation.
>

Am I missing something, or is dollar quoting not in this patch? Perhaps
you have a followup patch?

>Once that's taken into account I don't think there
>is a lot of difference in effective SLOC to maintain.  I'm also of the
>opinion that the new C code in psqlscan.l is much more straightforward
>than the code removed from mainloop.c, though having just written it,
>I'm no doubt pretty biased.
>
>Bruce was asking about speed.  On normal-size queries I cannot measure
>any difference at all.  For testing purposes I made up a file containing
>a single 750K query (just a "SELECT big-honking-string-constant", with
>the string literal broken into lines of 75 bytes).  The client-side
>(psql) CPU time to run this file looks about like this on my machine:
>
>              PGCLIENTENCODING
>            UNICODE        SJIS
>
>CVS tip            1.57        1.82
>
>flex implementation    0.93        2.33
>
>The flex implementation is consistently faster than CVS tip when dealing
>with backend-compatible encodings (such as UTF-8).  It's consistently
>slower when it has to deal with a non-backend-safe encoding such as SJIS
>or Big5.  But for real-world cases the differential is down in the noise
>either way.
>
>I'm inclined to apply this but I can see where a person not comfortable
>with flex might feel differently.  Opinions?
>
>

In principle this is the Right Thing (tm).

We use flex elsewhere, of course, so the fact that it is a flex scanner
doesn't seem to me to be any sort of strong argument.

If we are going to do this we need to get it in ASAP, so it gets plenty
of beating on.

cheers

andrew


Re: Doing psql's lexing with flex

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> Am I missing something, or is dollar quoting not in this patch?

It is not.  If we go this way, then we'd add essentially identical
flex patches to backend and psql to implement dollar quoting (plus
perhaps a few more lines in psql to support signaling dollar quoting
in the prompt, but that's pretty trivial).  My intent with the given
patch was just to replicate existing functionality with flex.

            regards, tom lane

Re: Doing psql's lexing with flex

From
Peter Eisentraut
Date:
Tom Lane wrote:
> I got interested enough in the psql-with-flex problem to go off and
> solve it.  Attached is a working patch, which I'm now debating
> whether to apply.  Comments solicited...

That should teach me a lesson not to leave random comments lying around
in the source code.  Years later someone might take them seriously. :)


Re: Doing psql's lexing with flex

From
Andrew Dunstan
Date:
Tom Lane wrote:

>Andrew Dunstan <andrew@dunslane.net> writes:
>
>
>>Am I missing something, or is dollar quoting not in this patch?
>>
>>
>
>It is not.  If we go this way, then we'd add essentially identical
>flex patches to backend and psql to implement dollar quoting (plus
>perhaps a few more lines in psql to support signaling dollar quoting
>in the prompt, but that's pretty trivial).  My intent with the given
>patch was just to replicate existing functionality with flex.
>
>
>
>

OK.

I vote for applying this ASAP, then, so we can get on with the dollar
quoting work.

I think we should put big warning signs on both the backend's and psql's
.l files saying they must be kept in sync.

cheers

andrew



Re: Doing psql's lexing with flex

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> I think we should put big warning signs on both the backend's and psql's
> .l files saying they must be kept in sync.

Right.  I was planning to do that, and also to make some trivial
reformatting in the backend's scan.l to make it easier to compare the
two files by diff.  For example, consistently write

    pattern        {
              action;
            }

not

    pattern        { action; }

so that the diffs are confined to the action lines.

            regards, tom lane

Re: Doing psql's lexing with flex

From
Peter Eisentraut
Date:
Tom Lane wrote:
> Imagine for example
>     \set foo '\c mydb'
>     blah :foo bar
> The existing code would interpret this as
>     blah \c mydb bar
> but my patch as it stands would behave very strangely --- the \c
> command would see bar as its argument and then 'mydb' would be
> regurgitated after HandleSlashCommand finishes.

Feel free (or encouraged) to change the old behavior.  (Of course, the
new behavior needs to adjustments, too.)  Generally, I think a variable
should just stand for a data value and should not result in a slash
command being generated.


Re: Doing psql's lexing with flex

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> Feel free (or encouraged) to change the old behavior.  (Of course, the
> new behavior needs to adjustments, too.)  Generally, I think a variable
> should just stand for a data value and should not result in a slash
> command being generated.

That seems workable for :foo references in the body of a backslash
command --- you can just substitute the value but not allow it to be
split into multiple argument tokens.  However, wouldn't you still want
to rescan it for internal :bar references to other variables?  The psql
man page has some arm-waving about how you can do indirection, and seems
like that ought to work in this context too.

Also, this seems inconsistent with the non-backslash-command case.
Seems like you have to scan what goes into the query buffer to keep the
lexer state in sync, you can't just treat it as an uninterpreted token.
So not triggering on backslashes seems inconsistent.

We can certainly implement different behavior for the two cases, I'm
just unconvinced we should...

            regards, tom lane