Re: benchmarking Flex practices - Mailing list pgsql-hackers

From Tom Lane
Subject Re: benchmarking Flex practices
Date
Msg-id 16941.1562106939@sss.pgh.pa.us
Whole thread Raw
In response to Re: benchmarking Flex practices  (John Naylor <john.naylor@2ndquadrant.com>)
Responses Re: benchmarking Flex practices  (John Naylor <john.naylor@2ndquadrant.com>)
Re: benchmarking Flex practices  (John Naylor <john.naylor@2ndquadrant.com>)
List pgsql-hackers
John Naylor <john.naylor@2ndquadrant.com> writes:
> 0001 is a small patch to remove some unneeded generality from the
> current rules. This lowers the number of elements in the yy_transition
> array from 37045 to 36201.

I don't particularly like 0001.  The two bits like this

-whitespace        ({space}+|{comment})
+whitespace        ({space}|{comment})

seem likely to create performance problems for runs of whitespace, in that
the lexer will now have to execute the associated action once per space
character not just once for the whole run.  Those actions are empty, but
I don't think flex optimizes for that, and it's really flex's per-action
overhead that I'm worried about.  Note the comment in the "Performance"
section of the flex manual:

    Another area where the user can increase a scanner's performance (and
    one that's easier to implement) arises from the fact that the longer
    the tokens matched, the faster the scanner will run.  This is because
    with long tokens the processing of most input characters takes place
    in the (short) inner scanning loop, and does not often have to go
    through the additional work of setting up the scanning environment
    (e.g., `yytext') for the action.

There are a bunch of higher-order productions that use "{whitespace}*",
which is surely a bit redundant given the contents of {whitespace}.
But maybe we could address that by replacing "{whitespace}*" with
"{opt_whitespace}" defined as

opt_whitespace        ({space}*|{comment})

Not sure what impact if any that'd have on table size, but I'm quite sure
that {whitespace} was defined with an eye to avoiding unnecessary
lexer action cycles.

As for the other two bits that are like

-<xe>.            {
-                    /* This is only needed for \ just before EOF */
+<xe>\\            {

my recollection is that those productions are defined that way to avoid a
flex warning about not all possible input characters being accounted for
in the <xe> (resp. <xdolq>) state.  Maybe that warning is
flex-version-dependent, or maybe this was just a worry and not something
that actually produced a warning ... but I'm hesitant to change it.
If we ever did get to flex's default action, that action is to echo the
current input character to stdout, which would be Very Bad.

As far as I can see, the point of 0002 is to have just one set of
flex rules for the various variants of quotecontinue processing.
That sounds OK, though I'm a bit surprised it makes this much difference
in the table size. I would suggest that "state_before" needs a less
generic name (maybe "state_before_xqs"?) and more than no comment.
Possibly more to the point, it's not okay to have static state variables
in the core scanner, so that variable needs to be kept in yyextra.
(Don't remember offhand whether it's any more acceptable in the other
scanners.)

            regards, tom lane



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: "long" type is not appropriate for counting tuples
Next
From: Peter Eisentraut
Date:
Subject: Re: Fix two issues after moving to unified logging system forcommand-line utils