Re: benchmarking Flex practices - Mailing list pgsql-hackers

From John Naylor
Subject Re: benchmarking Flex practices
Date
Msg-id CACPNZCvWPVC_RhNgSgQ7aaU38Ru4HXGZX4xusPDeFS1D0WSY5Q@mail.gmail.com
Whole thread Raw
In response to Re: benchmarking Flex practices  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Wed, Jul 3, 2019 at 5:35 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> John Naylor <john.naylor@2ndquadrant.com> writes:
> > 0001 is a small patch to remove some unneeded generality from the
> > current rules. This lowers the number of elements in the yy_transition
> > array from 37045 to 36201.
>
> I don't particularly like 0001.  The two bits like this
>
> -whitespace             ({space}+|{comment})
> +whitespace             ({space}|{comment})
>
> seem likely to create performance problems for runs of whitespace, in that
> the lexer will now have to execute the associated action once per space
> character not just once for the whole run.

Okay.

> There are a bunch of higher-order productions that use "{whitespace}*",
> which is surely a bit redundant given the contents of {whitespace}.
> But maybe we could address that by replacing "{whitespace}*" with
> "{opt_whitespace}" defined as
>
> opt_whitespace          ({space}*|{comment})
>
> Not sure what impact if any that'd have on table size, but I'm quite sure
> that {whitespace} was defined with an eye to avoiding unnecessary
> lexer action cycles.

It turns out that {opt_whitespace} as defined above is not equivalent
to {whitespace}* , since the former is either a single comment or a
single run of 0 or more whitespace chars (if I understand correctly).
Using {opt_whitespace} for the UESCAPE rules on top of v3-0002, the
regression tests pass, but queries like this fail with a syntax error:

# select U&'d!0061t!+000061' uescape  --comment
'!';

There was in fact a substantial size reduction, though, so for
curiosity's sake I tried just replacing {whitespace}* with {space}* in
the UESCAPE rules, and the table shrank from 30367 (that's with 0002
only) to 24661.

> As for the other two bits that are like
>
> -<xe>.                  {
> -                                       /* This is only needed for \ just before EOF */
> +<xe>\\                 {
>
> my recollection is that those productions are defined that way to avoid a
> flex warning about not all possible input characters being accounted for
> in the <xe> (resp. <xdolq>) state.  Maybe that warning is
> flex-version-dependent, or maybe this was just a worry and not something
> that actually produced a warning ... but I'm hesitant to change it.
> If we ever did get to flex's default action, that action is to echo the
> current input character to stdout, which would be Very Bad.

FWIW, I tried Flex 2.5.35 and 2.6.4 with no warnings, and I did get a
warning when I deleted any of those two rules. I'll leave them out for
now, since this change was only good for ~500 fewer elements in the
transition array.

> As far as I can see, the point of 0002 is to have just one set of
> flex rules for the various variants of quotecontinue processing.
> That sounds OK, though I'm a bit surprised it makes this much difference
> in the table size. I would suggest that "state_before" needs a less
> generic name (maybe "state_before_xqs"?) and more than no comment.
> Possibly more to the point, it's not okay to have static state variables
> in the core scanner, so that variable needs to be kept in yyextra.
> (Don't remember offhand whether it's any more acceptable in the other
> scanners.)

Ah yes, I got this idea from the ECPG scanner, which is not reentrant. Will fix.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Surafel Temesgen
Date:
Subject: Re: Conflict handling for COPY FROM
Next
From: Anthony Nowocien
Date:
Subject: Re: Conflict handling for COPY FROM