Thread: reducing the footprint of ScanKeyword (was Re: Large writable variables)

reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On 10/15/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2018-10-15 16:36:26 -0400, Tom Lane wrote:
>>> We could possibly fix these by changing the data structure so that
>>> what's in a ScanKeywords entry is an offset into some giant string
>>> constant somewhere.  No idea how that would affect performance, but
>>> I do notice that we could reduce the sizeof(ScanKeyword), which can't
>>> hurt.
>
>> Yea, that might even help performancewise. Alternatively we could change
>> ScanKeyword to store the keyword name inline, but that'd be a measurable
>> size increase...
>
> Yeah.  It also seems like doing it this way would improve locality of
> access: the pieces of the giant string would presumably be in the same
> order as the ScanKeywords entries, whereas with the current setup,
> who knows where the compiler has put 'em or in what order.
>
> We'd need some tooling to generate the constants that way, though;
> I can't see how to make it directly from kwlist.h.

A few months ago I was looking into faster search algorithms for
ScanKeywordLookup(), so this is interesting to me. While an optimal
full replacement would be a lot of work, the above ideas are much less
invasive and would still have some benefit. Unless anyone intends to
work on this, I'd like to flesh out the offset-into-giant-string
approach a bit further:

Since there are several callers of the current approach that don't use
the core keyword list, we'd have to keep the existing struct and
lookup function, to keep the complexity manageable. Once we have an
offset-based struct and function, it makes sense to use it for all
searches of core keywords. This includes not only the core scanner,
but also adt/rule_utils.c, fe_utils/string_utils.c, and
ecpg/preproc/keywords.c.

There would need to be a header with offsets replacing name strings,
generated from parser/kwlist.h, maybe kwlist_offset.h. It'd probably
be convenient if it was emitted into the common/ dir. The giant string
would likely need its own header (kwlist_string.h?).

Since PL/pgSQL uses the core scanner, we'd need to use offsets in its
reserved_keywords[], too. Those don't change much, so we can probably
get away with hard-coding the offsets and the giant string in that
case. (If that's not acceptable, we could separate that out to
pl_reserved_kwlist.h and reuse the above tooling to generate
pl_reserved_kwlist_{offset,string}.h, but that's more complex.)

The rest should be just a SMOP. Any issues I left out?

-John Naylor


John Naylor <jcnaylor@gmail.com> writes:
> A few months ago I was looking into faster search algorithms for
> ScanKeywordLookup(), so this is interesting to me. While an optimal
> full replacement would be a lot of work, the above ideas are much less
> invasive and would still have some benefit. Unless anyone intends to
> work on this, I'd like to flesh out the offset-into-giant-string
> approach a bit further:

Have at it...

> Since PL/pgSQL uses the core scanner, we'd need to use offsets in its
> reserved_keywords[], too. Those don't change much, so we can probably
> get away with hard-coding the offsets and the giant string in that
> case. (If that's not acceptable, we could separate that out to
> pl_reserved_kwlist.h and reuse the above tooling to generate
> pl_reserved_kwlist_{offset,string}.h, but that's more complex.)

plpgsql isn't as stable as all that: people propose new syntax for it
all the time.  I do not think a hand-maintained array would be pleasant
at all.

Also, wouldn't we also adopt this technology for its unreserved keywords,
too?

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On 12/17/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> John Naylor <jcnaylor@gmail.com> writes:
>> Since PL/pgSQL uses the core scanner, we'd need to use offsets in its
>> reserved_keywords[], too. Those don't change much, so we can probably
>> get away with hard-coding the offsets and the giant string in that
>> case. (If that's not acceptable, we could separate that out to
>> pl_reserved_kwlist.h and reuse the above tooling to generate
>> pl_reserved_kwlist_{offset,string}.h, but that's more complex.)
>
> plpgsql isn't as stable as all that: people propose new syntax for it
> all the time.  I do not think a hand-maintained array would be pleasant
> at all.

Okay.

> Also, wouldn't we also adopt this technology for its unreserved keywords,
> too?

We wouldn't be forced to, but there might be other reasons to do so.
Were you thinking of code consistency (within pl_scanner.c or
globally)? Or something else?

If we did adopt this setup for plpgsql unreserved keywords,
ecpg/preproc/ecpg_keywords.c and ecpg/preproc/c_keywords.c would be
left using the current ScanKeyword struct for search. Using offset
search for all 5 types of keywords would be globally consistent, but
it also means additional headers, generated headers, and makefile
rules.

-John Naylor


John Naylor <jcnaylor@gmail.com> writes:
> On 12/17/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Also, wouldn't we also adopt this technology for its unreserved keywords,
>> too?

> We wouldn't be forced to, but there might be other reasons to do so.
> Were you thinking of code consistency (within pl_scanner.c or
> globally)? Or something else?

> If we did adopt this setup for plpgsql unreserved keywords,
> ecpg/preproc/ecpg_keywords.c and ecpg/preproc/c_keywords.c would be
> left using the current ScanKeyword struct for search. Using offset
> search for all 5 types of keywords would be globally consistent, but
> it also means additional headers, generated headers, and makefile
> rules.

I'd be kind of inclined to convert all uses of ScanKeyword to the new way,
if only for consistency's sake.  On the other hand, I'm not the one
volunteering to do the work.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On 12/18/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'd be kind of inclined to convert all uses of ScanKeyword to the new way,
> if only for consistency's sake.  On the other hand, I'm not the one
> volunteering to do the work.

That's reasonable, as long as the design is nailed down first. Along
those lines, attached is a heavily WIP patch that only touches plpgsql
unreserved keywords, to test out the new methodology in a limited
area. After settling APIs and name/directory bikeshedding, I'll move
on to the other four keyword types.

There's a new Perl script, src/common/gen_keywords.pl, which takes
pl_unreserved_kwlist.h as input and outputs
pl_unreserved_kwlist_offset.h and pl_unreserved_kwlist_string.h. The
output headers are not installed or symlinked anywhere. Since the
input keyword lists will never be #included directly, they might be
better as .txt files, like errcodes.txt. If we went that far, we might
also remove the PG_KEYWORD macros (they'd still be in the output
files) and rename parser/kwlist.h to common/core_kwlist.txt. There's
also a case for not changing things unnecessarily, especially if
there's ever a new reason to include the base keyword list directly.

To keep the other keyword types functional, I had to add a separate
new struct ScanKeywordOffset and new function
ScanKeywordLookupOffset(), so the patch is a bit messier than the
final will be. With a 4-byte offset, ScankeyWordOffset is 8 bytes,
down from 12, and is now a power of 2.

I used the global .gitignore, but maybe that's an abuse of it.

Make check passes, but I don't know how well it stresses keyword use.
I'll create a commitfest entry soon.

-John Naylor

Attachment

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
Andrew Gierth
Date:
>>>>> "John" == John Naylor <jcnaylor@gmail.com> writes:

 > On 12/18/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
 >> I'd be kind of inclined to convert all uses of ScanKeyword to the
 >> new way, if only for consistency's sake. On the other hand, I'm not
 >> the one volunteering to do the work.

 John> That's reasonable, as long as the design is nailed down first.
 John> Along those lines, attached is a heavily WIP patch that only
 John> touches plpgsql unreserved keywords, to test out the new
 John> methodology in a limited area. After settling APIs and
 John> name/directory bikeshedding, I'll move on to the other four
 John> keyword types.

Is there any particular reason not to go further and use a perfect hash
function for the lookup, rather than binary search?

-- 
Andrew (irc:RhodiumToad)


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andres Freund
Date:
Hi,

On 2018-12-20 00:54:39 +0000, Andrew Gierth wrote:
> >>>>> "John" == John Naylor <jcnaylor@gmail.com> writes:
> 
>  > On 12/18/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>  >> I'd be kind of inclined to convert all uses of ScanKeyword to the
>  >> new way, if only for consistency's sake. On the other hand, I'm not
>  >> the one volunteering to do the work.
> 
>  John> That's reasonable, as long as the design is nailed down first.
>  John> Along those lines, attached is a heavily WIP patch that only
>  John> touches plpgsql unreserved keywords, to test out the new
>  John> methodology in a limited area. After settling APIs and
>  John> name/directory bikeshedding, I'll move on to the other four
>  John> keyword types.
> 
> Is there any particular reason not to go further and use a perfect hash
> function for the lookup, rather than binary search?

The last time I looked into perfect hash functions, it wasn't easy to
find a generator that competed with a decent normal hashtable (in
particular gperf's are very unconvincing). The added tooling is a
concern imo.  OTOH, we're comparing not with a hashtable, but a binary
search, where the latter will usually loose.  Wonder if we shouldn't
generate a serialized non-perfect hashtable instead. The lookup code for
a read-only hashtable without concern for adversarial input is pretty
trivial.

Greetings,

Andres Freund


Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> Is there any particular reason not to go further and use a perfect hash
> function for the lookup, rather than binary search?

Tooling?  I seem to recall having looked at gperf and deciding that it
pretty much sucked, so it's not real clear to me what we would use.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On 12/19/18, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote:
> Is there any particular reason not to go further and use a perfect hash
> function for the lookup, rather than binary search?

When I was investigating faster algorithms, I ruled out gperf based on
discussions in the archives. The approach here has modest goals and
shouldn't be too invasive.

With the makefile support and separate keyword files in place, that'll
be one less thing to do if we ever decide to replace binary search.
The giant string will likely be useful as well.

Since we're on the subject, I think some kind of trie would be ideal
performance-wise, but a large amount of work. The nice thing about a
trie is that it can be faster then a hash table for a key miss. I
found a paper that described some space-efficient trie variations [1],
but we'd likely have to code the algorithm and a way to emit a C code
representation of it. I've found some libraries, but that would have
more of the same difficulties in practicality that gperf had.

[1] https://infoscience.epfl.ch/record/64394/files/triesearches.pdf

-John Naylor


John Naylor <jcnaylor@gmail.com> writes:
> On 12/18/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'd be kind of inclined to convert all uses of ScanKeyword to the new way,
>> if only for consistency's sake.  On the other hand, I'm not the one
>> volunteering to do the work.

> That's reasonable, as long as the design is nailed down first. Along
> those lines, attached is a heavily WIP patch that only touches plpgsql
> unreserved keywords, to test out the new methodology in a limited
> area. After settling APIs and name/directory bikeshedding, I'll move
> on to the other four keyword types.

Let the bikeshedding begin ...

> There's a new Perl script, src/common/gen_keywords.pl,

I'd be inclined to put the script in src/tools, I think.  IMO src/common
is for code that actually gets built into our executables.

> which takes
> pl_unreserved_kwlist.h as input and outputs
> pl_unreserved_kwlist_offset.h and pl_unreserved_kwlist_string.h.

I wonder whether we'd not be better off producing just one output
file, in which we have the offsets emitted as PG_KEYWORD macros
and then the giant string emitted as a macro definition, ie
something like

#define PG_KEYWORD_STRING \
    "absolute\0" \
    "alias\0" \
    ...

That simplifies the Makefile-hacking, at least, and it possibly gives
callers more flexibility about what they actually want to do with the
string.

> The
> output headers are not installed or symlinked anywhere. Since the
> input keyword lists will never be #included directly, they might be
> better as .txt files, like errcodes.txt. If we went that far, we might
> also remove the PG_KEYWORD macros (they'd still be in the output
> files) and rename parser/kwlist.h to common/core_kwlist.txt. There's
> also a case for not changing things unnecessarily, especially if
> there's ever a new reason to include the base keyword list directly.

I'm for "not change things unnecessarily".  People might well be
scraping the keyword list out of parser/kwlist.h for other purposes
right now --- indeed, it's defined the way it is exactly to let
people do that.  I don't see a good reason to force them to redo
whatever tooling they have that depends on that.  So let's build
kwlist_offsets.h alongside that, but not change kwlist.h itself.

> To keep the other keyword types functional, I had to add a separate
> new struct ScanKeywordOffset and new function
> ScanKeywordLookupOffset(), so the patch is a bit messier than the
> final will be.

Check.

> I used the global .gitignore, but maybe that's an abuse of it.

Yeah, I'd say it is.

> +# TODO: Error out if the keyword names are not in ASCII order.

+many for including such a check.

Also note that we don't require people to have Perl installed when
building from a tarball.  Therefore, these derived headers must get
built during "make distprep" and removed by maintainer-clean but
not distclean.  I think this also has some implications for VPATH
builds, but as long as you follow the pattern used for other
derived header files (e.g. fmgroids.h), you should be fine.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On 12/20/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'd be inclined to put the script in src/tools, I think.  IMO src/common
> is for code that actually gets built into our executables.

Done.

>> which takes
>> pl_unreserved_kwlist.h as input and outputs
>> pl_unreserved_kwlist_offset.h and pl_unreserved_kwlist_string.h.
>
> I wonder whether we'd not be better off producing just one output
> file, in which we have the offsets emitted as PG_KEYWORD macros
> and then the giant string emitted as a macro definition, ie
> something like
>
> #define PG_KEYWORD_STRING \
>     "absolute\0" \
>     "alias\0" \
>     ...
>
> That simplifies the Makefile-hacking, at least, and it possibly gives
> callers more flexibility about what they actually want to do with the
> string.

Okay, I tried that. Since the script is turning one header into
another, I borrowed the "*_d.h" nomenclature from the catalogs. Using
a single file required some #ifdef hacks in the output file. Maybe
there's a cleaner way to do this, but I don't know what it is.

Using a single file also gave me another idea: Take value and category
out of ScanKeyword, and replace them with an index into another array
containing those, which will only be accessed in the event of a hit.
That would shrink ScanKeyword to 4 bytes (offset, index), further
increasing locality of reference. Might not be worth it, but I can try
it after moving on to the core scanner.

> I'm for "not change things unnecessarily".  People might well be
> scraping the keyword list out of parser/kwlist.h for other purposes
> right now --- indeed, it's defined the way it is exactly to let
> people do that.  I don't see a good reason to force them to redo
> whatever tooling they have that depends on that.  So let's build
> kwlist_offsets.h alongside that, but not change kwlist.h itself.

Done.

>> I used the global .gitignore, but maybe that's an abuse of it.
>
> Yeah, I'd say it is.

Moved.

>> +# TODO: Error out if the keyword names are not in ASCII order.
>
> +many for including such a check.

Done.

> Also note that we don't require people to have Perl installed when
> building from a tarball.  Therefore, these derived headers must get
> built during "make distprep" and removed by maintainer-clean but
> not distclean.  I think this also has some implications for VPATH
> builds, but as long as you follow the pattern used for other
> derived header files (e.g. fmgroids.h), you should be fine.

Done. I also blindly added support for MSVC.

-John Naylor

Attachment
John Naylor <jcnaylor@gmail.com> writes:
> Using a single file also gave me another idea: Take value and category
> out of ScanKeyword, and replace them with an index into another array
> containing those, which will only be accessed in the event of a hit.
> That would shrink ScanKeyword to 4 bytes (offset, index), further
> increasing locality of reference. Might not be worth it, but I can try
> it after moving on to the core scanner.

I like that idea a *lot*, actually, because it offers the opportunity
to decouple this mechanism from all assumptions about what the
auxiliary data for a keyword is.  Basically, we'd redefine
ScanKeywordLookup as having the API "given a string, return a
keyword index if it is a keyword, -1 if it isn't"; then the caller
would use the keyword index to look up the auxiliary data in a table
that it owns, and ScanKeywordLookup doesn't know about at all.

So that leads to a design like this: the master data is in a header
that's just like kwlist.h is today, except now we are thinking of
PG_KEYWORD as an N-argument macro not necessarily exactly 3 arguments.
The Perl script reads that, paying attention only to the first argument
of the macro calls, and outputs a file containing, say,

static const uint16 kw_offsets[] = { 0, 6, 15, ... };

static const char kw_strings[] =
    "abort\0"
    "absolute\0"
    ...
;

(it'd be a good idea to have a switch that allows specifying the
prefix of these constant names).  Then ScanKeywordLookup has the
signature

int ScanKeywordLookup(const char *string_to_lookup,
                      const char *kw_strings,
                      const uint16 *kw_offsets,
                      int num_keywords);

and a file using this stuff looks something like

/* Payload data for keywords */
typedef struct MyKeyword
{
    int16        value;
    int16        category;
} MyKeyword;

#define PG_KEYWORD(kwname, value, category) {value, category},

static const MyKeyword MyKeywords[] = {
#include "kwlist.h"
};

/* String lookup table for keywords */
#include "kwlist_d.h"

/* Lookup code looks about like this: */
    kwnum = ScanKeywordLookup(str,
                              kw_strings,
                              kw_offsets,
                              lengthof(kw_offsets));
    if (kwnum >= 0)
       ... look into MyKeywords[kwnum] for info ...

Aside from being arguably better from the locality-of-reference
standpoint, this gets us out of the weird ifdef'ing you've got in
the v2 patch.  The kwlist_d.h headers can be very ordinary headers.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andres Freund
Date:
Hi,

On 2018-12-22 12:20:00 -0500, Tom Lane wrote:
> John Naylor <jcnaylor@gmail.com> writes:
> > Using a single file also gave me another idea: Take value and category
> > out of ScanKeyword, and replace them with an index into another array
> > containing those, which will only be accessed in the event of a hit.
> > That would shrink ScanKeyword to 4 bytes (offset, index), further
> > increasing locality of reference. Might not be worth it, but I can try
> > it after moving on to the core scanner.
> 
> I like that idea a *lot*, actually, because it offers the opportunity
> to decouple this mechanism from all assumptions about what the
> auxiliary data for a keyword is.

OTOH, it doubles or triples the number of cachelines accessed when
encountering a keyword. The fraction of keywords to not-keywords in SQL
makes me wonder whether that makes it a good deal.

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> On 2018-12-22 12:20:00 -0500, Tom Lane wrote:
>> I like that idea a *lot*, actually, because it offers the opportunity
>> to decouple this mechanism from all assumptions about what the
>> auxiliary data for a keyword is.

> OTOH, it doubles or triples the number of cachelines accessed when
> encountering a keyword.

Compared to what?  The current situation in that regard is a mess.

Also, AFAICS this proposal involves the least amount of data touched
during the lookup phase of anything we've discussed, so I do not even
accept that your criticism is correct.  One extra cacheline fetch
to get the aux data for a particular keyword after the search is not
going to tip the scales away from this being a win.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On 12/22/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> John Naylor <jcnaylor@gmail.com> writes:
>> Using a single file also gave me another idea: Take value and category
>> out of ScanKeyword, and replace them with an index into another array
>> containing those, which will only be accessed in the event of a hit.
>> That would shrink ScanKeyword to 4 bytes (offset, index), further
>> increasing locality of reference. Might not be worth it, but I can try
>> it after moving on to the core scanner.
>
> I like that idea a *lot*, actually, because it offers the opportunity
> to decouple this mechanism from all assumptions about what the
> auxiliary data for a keyword is.

Okay, in that case I went ahead and did it for WIP v3.

> (it'd be a good idea to have a switch that allows specifying the
> prefix of these constant names).

Done as an optional switch, and tested, but not yet used in favor of
the previous method as a fallback. I'll probably do it in the final
version to keep lines below 80, and to add 'core_' to the core keyword
vars.

> /* Payload data for keywords */
> typedef struct MyKeyword
> {
>     int16        value;
>     int16        category;
> } MyKeyword;

I tweaked this a bit to

typedef struct ScanKeywordAux
{
    int16    value;        /* grammar's token code */
    char        category;        /* see codes above */
} ScanKeywordAux;

It seems that category was only 2 bytes to make ScanKeyword a power of
2 (of course that was on 32 bit machines and doesn't hold true
anymore). Using char will save another few hundred bytes in the core
scanner. Since we're only accessing this once per identifier, we may
not need to worry so much about memory alignment.

> Aside from being arguably better from the locality-of-reference
> standpoint, this gets us out of the weird ifdef'ing you've got in
> the v2 patch.  The kwlist_d.h headers can be very ordinary headers.

Yeah, that's a nice (and for me unexpected) bonus.

-John Naylor

Attachment

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
Robert Haas
Date:
On Wed, Dec 19, 2018 at 8:01 PM Andres Freund <andres@anarazel.de> wrote:
> The last time I looked into perfect hash functions, it wasn't easy to
> find a generator that competed with a decent normal hashtable (in
> particular gperf's are very unconvincing). The added tooling is a
> concern imo.  OTOH, we're comparing not with a hashtable, but a binary
> search, where the latter will usually loose.  Wonder if we shouldn't
> generate a serialized non-perfect hashtable instead. The lookup code for
> a read-only hashtable without concern for adversarial input is pretty
> trivial.

I wonder if we could do something really simple like a lookup based on
the first character of the scan keyword. It looks to me like there are
440 keywords right now, and the most common starting letter is 'c',
which is the first letter of 51 keywords. So dispatching based on the
first letter clips at least 3 steps off the binary search.  I don't
know whether that's enough to be worthwhile, but it's probably pretty
simple to implement.

I'm not sure that I understand quite what you have in mind for a
serialized non-perfect hashtable.  Are you thinking that we'd just
construct a simplehash and serialize it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Dec 19, 2018 at 8:01 PM Andres Freund <andres@anarazel.de> wrote:
>> The last time I looked into perfect hash functions, it wasn't easy to
>> find a generator that competed with a decent normal hashtable (in
>> particular gperf's are very unconvincing). The added tooling is a
>> concern imo.  OTOH, we're comparing not with a hashtable, but a binary
>> search, where the latter will usually loose.  Wonder if we shouldn't
>> generate a serialized non-perfect hashtable instead. The lookup code for
>> a read-only hashtable without concern for adversarial input is pretty
>> trivial.

> I wonder if we could do something really simple like a lookup based on
> the first character of the scan keyword. It looks to me like there are
> 440 keywords right now, and the most common starting letter is 'c',
> which is the first letter of 51 keywords. So dispatching based on the
> first letter clips at least 3 steps off the binary search.  I don't
> know whether that's enough to be worthwhile, but it's probably pretty
> simple to implement.

I think there's a lot of goalpost-moving going on here.  The original
idea was to trim the physical size of the data structure, as stated
in the thread subject, and just reap whatever cache benefits we got
along the way from that.  I am dubious that we actually have any
performance problem in this code that needs a big dollop of added
complexity to fix.

In my hands, the only part of the low-level parsing code that commonly
shows up as interesting in profiles is the Bison engine.  That's probably
because the grammar tables are circa half a megabyte and blow out cache
pretty badly :-(.  I don't know of any way to make that better,
unfortunately.  I suspect that it's just going to get worse, because
people keep submitting additions to the grammar.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
David Fetter
Date:
On Wed, Dec 26, 2018 at 11:22:39AM -0500, Tom Lane wrote:
> 
> In my hands, the only part of the low-level parsing code that
> commonly shows up as interesting in profiles is the Bison engine.

Should we be considering others? As I understand it, steps have been
made in this field since yacc was originally designed. Is LALR
actually suitable for languages like SQL, or is it just there for
historical reasons?

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
Robert Haas
Date:
On Wed, Dec 26, 2018 at 11:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think there's a lot of goalpost-moving going on here.  The original
> idea was to trim the physical size of the data structure, as stated
> in the thread subject, and just reap whatever cache benefits we got
> along the way from that.  I am dubious that we actually have any
> performance problem in this code that needs a big dollop of added
> complexity to fix.

I have seen ScanKeywordLookup show up in profiles quite often and
fairly prominently -- like several percent of total runtime. I'm not
trying to impose requirements on John's patch, and I agree that
reducing the physical size of the structure is a good step whether
anything else is done or not. However, I don't see that as a reason to
shut down further discussion of other possible improvements.  If his
patch makes this disappear from profiles, cool, but if it doesn't,
then sooner or later somebody's going to want to do more.

FWIW, my bet is this helps but isn't enough to get rid of the problem
completely.  A 9-step binary search has got to be slower than a really
well-optimized hash table lookup. In a perfect world the latter
touches the cache line containing the keyword -- which presumably is
already in cache since we just scanned it -- then computes a hash
value without touching any other cache lines -- and then goes straight
to the right entry.  So it touches ONE new cache line.  That might a
level of optimization that's hard to achieve in practice, but I don't
think it's crazy to want to get there.

> In my hands, the only part of the low-level parsing code that commonly
> shows up as interesting in profiles is the Bison engine.  That's probably
> because the grammar tables are circa half a megabyte and blow out cache
> pretty badly :-(.  I don't know of any way to make that better,
> unfortunately.  I suspect that it's just going to get worse, because
> people keep submitting additions to the grammar.

I'm kinda surprised that you haven't seen ScanKeywordLookup() in
there, but I agree with you that the size of the main parser tables is
a real issue, and that there's no easy solution. At various times
there has been discussion of using some other parser generator, and
I've also toyed with the idea of writing one specifically for
PostgreSQL. Unfortunately, it seems like bison is all but
unmaintained; the alternatives are immature and have limited adoption
and limited community; and writing something from scratch is a ton of
work.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


David Fetter <david@fetter.org> writes:
> On Wed, Dec 26, 2018 at 11:22:39AM -0500, Tom Lane wrote:
>> In my hands, the only part of the low-level parsing code that
>> commonly shows up as interesting in profiles is the Bison engine.

> Should we be considering others?

We've looked around before, IIRC, and not really seen any arguably
better tools.

            regards, tom lane


Robert Haas <robertmhaas@gmail.com> writes:
> I'm kinda surprised that you haven't seen ScanKeywordLookup() in
> there, but I agree with you that the size of the main parser tables is
> a real issue, and that there's no easy solution. At various times
> there has been discussion of using some other parser generator, and
> I've also toyed with the idea of writing one specifically for
> PostgreSQL. Unfortunately, it seems like bison is all but
> unmaintained; the alternatives are immature and have limited adoption
> and limited community; and writing something from scratch is a ton of
> work.  :-(

Yeah, and also: SQL is a damn big and messy language, and so it's not
very clear that it's really bison's fault that it's slow to parse.
We might do a ton of work to implement an alternative, and then find
ourselves no better off.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andres Freund
Date:
Hi,

On 2018-12-26 11:50:18 -0500, Robert Haas wrote:
> On Wed, Dec 26, 2018 at 11:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I think there's a lot of goalpost-moving going on here.  The original
> > idea was to trim the physical size of the data structure, as stated
> > in the thread subject, and just reap whatever cache benefits we got
> > along the way from that.  I am dubious that we actually have any
> > performance problem in this code that needs a big dollop of added
> > complexity to fix.
> 
> I have seen ScanKeywordLookup show up in profiles quite often and
> fairly prominently -- like several percent of total runtime. I'm not
> trying to impose requirements on John's patch, and I agree that
> reducing the physical size of the structure is a good step whether
> anything else is done or not. However, I don't see that as a reason to
> shut down further discussion of other possible improvements.  If his
> patch makes this disappear from profiles, cool, but if it doesn't,
> then sooner or later somebody's going to want to do more.

I agree. And most of the patch would be a pre-requisite for anything
more elaborate anyway.


> FWIW, my bet is this helps but isn't enough to get rid of the problem
> completely.  A 9-step binary search has got to be slower than a really
> well-optimized hash table lookup.

Yea, at least with a non-optimized layout. If we'd used a binary search
optimized lookup order it might be different, but probably at best
equivalent to a good hashtable.

> > In my hands, the only part of the low-level parsing code that commonly
> > shows up as interesting in profiles is the Bison engine.  That's probably
> > because the grammar tables are circa half a megabyte and blow out cache
> > pretty badly :-(.  I don't know of any way to make that better,
> > unfortunately.  I suspect that it's just going to get worse, because
> > people keep submitting additions to the grammar.
> 
> I'm kinda surprised that you haven't seen ScanKeywordLookup() in
> there, but I agree with you that the size of the main parser tables is
> a real issue, and that there's no easy solution. At various times
> there has been discussion of using some other parser generator, and
> I've also toyed with the idea of writing one specifically for
> PostgreSQL. Unfortunately, it seems like bison is all but
> unmaintained; the alternatives are immature and have limited adoption
> and limited community; and writing something from scratch is a ton of
> work.  :-(

My bet is, and has been for quite a while, that we'll have to go for a
hand-written recursive descent type parser.  They can be *substantially*
faster, and performance isn't as affected by the grammar size. And,
about as important, they also allow for a lot more heuristics around
grammar errors - I do think we'll soon have to better than to throw a
generic syntax error for the cases where the grammar doesn't match at
all.

Greetings,

Andres Freund


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andres Freund
Date:
Hi,

On 2018-12-26 10:45:11 -0500, Robert Haas wrote:
> I'm not sure that I understand quite what you have in mind for a
> serialized non-perfect hashtable.  Are you thinking that we'd just
> construct a simplehash and serialize it?

I was basically thinking that we'd have the perl script implement a
simple hash and put the keyword (pointers) into an array, handling
conflicts with the simplest linear probing thinkable. As there's never a
need for modifications, that ought to be fairly simple.

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> My bet is, and has been for quite a while, that we'll have to go for a
> hand-written recursive descent type parser.

I will state right up front that that will happen over my dead body.

It's impossible to write correct RD parsers by hand for any but the most
trivial, conflict-free languages, and what we have got to deal with
is certainly neither of those; moreover, it's a constantly moving target.
We'd be buying into an endless landscape of parser bugs if we go that way.
It's *not* worth it.

            regards, tom lane


Andres Freund <andres@anarazel.de> writes:
> On 2018-12-26 10:45:11 -0500, Robert Haas wrote:
>> I'm not sure that I understand quite what you have in mind for a
>> serialized non-perfect hashtable.  Are you thinking that we'd just
>> construct a simplehash and serialize it?

> I was basically thinking that we'd have the perl script implement a
> simple hash and put the keyword (pointers) into an array, handling
> conflicts with the simplest linear probing thinkable. As there's never a
> need for modifications, that ought to be fairly simple.

I think it was Knuth who said that when you use hashing, you are putting
a great deal of faith in the average case, because the worst case is
terrible.  The applicability of that to this problem is that if you hit
a bad case (say, a long collision chain affecting some common keywords)
you could end up with poor performance that affects a lot of people for
a long time.  And our keyword list is not so static that you could prove
once that the behavior is OK and then forget about it.

So I'm suspicious of proposals to use simplistic hashing here.

There might well be some value in Robert's idea of keying off the first
letter to get rid of the first few binary-search steps, not least because
those steps are particularly terrible from a cache-footprint perspective.
I'm not sold on doing anything significantly more invasive than that.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On 12/26/18, Robert Haas <robertmhaas@gmail.com> wrote:
> I wonder if we could do something really simple like a lookup based on
> the first character of the scan keyword. It looks to me like there are
> 440 keywords right now, and the most common starting letter is 'c',
> which is the first letter of 51 keywords. So dispatching based on the
> first letter clips at least 3 steps off the binary search.  I don't
> know whether that's enough to be worthwhile, but it's probably pretty
> simple to implement.

Using radix tree structures for the top couple of node levels is a
known technique to optimize tries that need to be more space-efficient
at lower levels, so this has precedent. In this case there would be a
space trade off of

(alphabet size, rounded up) * (size of index to lower boundary + size
of index to upper boundary) = 32 * (2 + 2) = 128 bytes

which is pretty small compared to what we'll save by offset-based
lookup. On average, there'd be 4.1 binary search steps, which is nice.
I agree it'd be fairly simple to do, and might raise the bar for doing
anything more complex.

-John Naylor


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andres Freund
Date:
Hi,

On 2018-12-26 14:03:57 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > My bet is, and has been for quite a while, that we'll have to go for a
> > hand-written recursive descent type parser.
> 
> I will state right up front that that will happen over my dead body.
> 
> It's impossible to write correct RD parsers by hand for any but the most
> trivial, conflict-free languages, and what we have got to deal with
> is certainly neither of those; moreover, it's a constantly moving target.
> We'd be buying into an endless landscape of parser bugs if we go that way.
> It's *not* worth it.

It's not exactly new that people end up moving to bison to recursive
descent parsers once they hit the performance problems and want to give
better error messages. E.g. both gcc and clang have hand-written
recursive-descent parsers for C and C++ these days.   I don't buy that
we're inable to write a descent parser that way.

What I *do* buy is that it's more problematic for the design of our SQL
dialect, because the use of bison often uncovers ambiguities in new
extensions of the language. And I don't really have a good idea how to
handle that.

Greetings,

Andres Freund


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On 12/26/18, John Naylor <jcnaylor@gmail.com> wrote:
> On 12/26/18, Robert Haas <robertmhaas@gmail.com> wrote:
>> I wonder if we could do something really simple like a lookup based on
>> the first character of the scan keyword. It looks to me like there are
>> 440 keywords right now, and the most common starting letter is 'c',
>> which is the first letter of 51 keywords. So dispatching based on the
>> first letter clips at least 3 steps off the binary search.  I don't
>> know whether that's enough to be worthwhile, but it's probably pretty
>> simple to implement.

> I agree it'd be fairly simple to do, and might raise the bar for doing
> anything more complex.

I went ahead and did this for v4, but split out into a separate patch.
In addition, I used a heuristic to bypass binary search for the most
common keywords. Normally, the middle value is computed
mathematically, but I found that in each range of keywords beginning
with the same letter, there is often 1 or 2 common keywords that are
good first guesses, such as select, from, join, limit, where. I taught
the lookup to try those first, and then compute subsequent steps the
usual way.

Barring additional bikeshedding on 0001, I'll plan on implementing
offset-based lookup for the other keyword types and retire the old
ScanKeyword. Once that's done, we can benchmark and compare with the
optimizations in 0002.

-John Naylor

Attachment
Andres Freund <andres@anarazel.de> writes:
> On 2018-12-26 14:03:57 -0500, Tom Lane wrote:
>> It's impossible to write correct RD parsers by hand for any but the most
>> trivial, conflict-free languages, and what we have got to deal with
>> is certainly neither of those; moreover, it's a constantly moving target.
>> We'd be buying into an endless landscape of parser bugs if we go that way.
>> It's *not* worth it.

> It's not exactly new that people end up moving to bison to recursive
> descent parsers once they hit the performance problems and want to give
> better error messages. E.g. both gcc and clang have hand-written
> recursive-descent parsers for C and C++ these days.

Note that they are dealing with fixed language definitions.  Furthermore,
there's no need to worry about whether that code has to be hacked on by
less-than-expert people.  Neither condition applies to us.

The thing that most concerns me about not using a grammar tool of some
sort is that with handwritten RD, it's very easy to get into situations
where you've "defined" (well, implemented, because you never did have
a formal definition) a language that is ambiguous, admitting of more
than one valid parse interpretation.  You won't find out until someone
files a bug report complaining that some apparently-valid statement
isn't doing what they expect.  At that point you are in a world of hurt,
because it's too late to fix it without changing the language definition
and thus creating user-visible compatibility breakage.

Now bison isn't perfect in this regard, because you can shoot yourself
in the foot with ill-considered precedence specifications (and we've
done so ;-(), but it is light-years more likely to detect ambiguous
grammar up-front than any handwritten parser logic is.

If we had a tool that proved a BNF grammar non-ambiguous and then
wrote an RD parser for it, that'd be fine with me --- but we need
a tool, not somebody claiming he can write an error-free RD parser
for an arbitrary language.  My position is that anyone claiming that
is just plain deluded.

I also do not buy your unsupported-by-any-evidence claim that the
error reports would be better.  I've worked on RD parsers in the
past, and they're not really better, at least not without expending
enormous amounts of effort --- and run-time cycles --- specifically
on the error reporting aspect.  Again, I don't see that happening
for us.

> I don't buy that we're inable to write a descent parser that way.

I do not think that we could write one for the current state of the
PG grammar without an investment of effort so large that it's not
going to happen.  Even if such a parser were to spring fully armed
from somebody's forehead, we absolutely cannot expect that it would
continue to work correctly after non-wizard contributors modify it.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andrew Dunstan
Date:
On 12/27/18 12:12 PM, Tom Lane wrote:

>> I don't buy that we're inable to write a descent parser that way.
> I do not think that we could write one for the current state of the
> PG grammar without an investment of effort so large that it's not
> going to happen.  Even if such a parser were to spring fully armed
> from somebody's forehead, we absolutely cannot expect that it would
> continue to work correctly after non-wizard contributors modify it.

I just did a quick survey of generator tools. Unfortunately, the best
candidate alternative (ANTLR) no longer supports generating plain C
code. I don't know of another tool that is well maintained, supports C,
and generates top down parsers. Twenty-five years ago or so I wrote a
top-down table-driven parser generator, but that was in another country,
and besides, the wench is dead.


There are well known techniques (See s 4.4 of the Dragon Book, if you
have a copy) for formal analysis of grammars to determine predictive
parser action. They aren't hard, and the tables they produce are
typically much smaller than those used for LALR parsers. Still, probably
not for the faint of heart.


The tools that have moved to using hand cut RD parsers have done so
precisely because they get a significant performance benefit from doing so.


RD parsers are not terribly hard to write. Yes, the JSON grammar is
tiny, but I think I wrote the basics of the RD parser we use for JSON in
about an hour. I think arguing that our hacker base is not competent to
maintain such a thing for the SQL grammar is wrong.  We successfully
maintain vastly more complex pieces of code.


Having said all that, I don't intend to spend any time on implementing
an alternative parser. It would as you say involve a heck of a lot of
time, which I don't have. It would be a fine academic research project
for some student.


A smaller project might be to see if we can replace the binary keyword
search  in ScanKeyword with a perfect hashing function generated by
gperf, or something similar. I had a quick look at that, too.
Unfortunately the smallest hash table I could generate for our 440
symbols had 1815 entries, so I'm not sure how well that would work.
Worth investigating, though.


cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



John Naylor <jcnaylor@gmail.com> writes:
> Barring additional bikeshedding on 0001, I'll plan on implementing
> offset-based lookup for the other keyword types and retire the old
> ScanKeyword. Once that's done, we can benchmark and compare with the
> optimizations in 0002.

Sounds like a plan.

Assorted minor bikeshedding on v4-0001 (just from eyeballing it, I didn't
test it):

+/* Like ScanKeywordLookup, but uses offsets into a keyword string. */
+int
+ScanKeywordLookupOffset(const char *string_to_lookup,
+                        const char *kw_strings,

Not really "like" it, since the return value is totally different and
so is the representation of the keyword list.  I realize that your
plan is probably to get rid of ScanKeywordLookup and then adapt the
latter's comment for this code, but don't forget that you need to
adjust said comment.

+/* Payload data for keywords */
+typedef struct ScanKeywordAux
+{
+    int16        value;            /* grammar's token code */
+    char        category;        /* see codes above */
+} ScanKeywordAux;

There isn't really any point in changing category to "char", because
alignment considerations will mandate that sizeof(ScanKeywordAux) be
a multiple of 2 anyway.  With some compilers we could get around that
with a pragma to force non-aligned storage, but doing so would be a
net loss on most non-Intel architectures.

If you really are hot about saving that other 440 bytes, the way to
do it would be to drop the struct entirely and use two parallel
arrays, an int16[] for value and a char[] (or better uint8[]) for
category.  Those would be filled by reading kwlist.h twice with
different definitions for PG_KEYWORD.  Not sure it's worth the
trouble though --- in particular, not clear that it's a win from
the standpoint of number of cache lines touched.

diff --git a/src/pl/plpgsql/src/.gitignore b/src/pl/plpgsql/src/.gitignore
@@ -1,3 +1,4 @@
+/*kwlist_d.h

Not a fan of using wildcards in .gitignore files, at least not when
there's just one or two files you intend to match.

 # Force these dependencies to be known even without dependency info built:
-pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o: plpgsql.h pl_gram.h plerrcodes.h
+pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o: plpgsql.h pl_gram.h plerrcodes.h
pl_unreserved_kwlist_d.h

Hm, do we really need any more than pl_scanner.o to depend on that header?

+/* FIXME: Have to redefine this symbol for the WIP. */
+#undef PG_KEYWORD
+#define PG_KEYWORD(kwname, value, category) {value, category},
+
+static const ScanKeywordAux unreserved_keywords[] = {
+#include "pl_unreserved_kwlist.h"
 };

The category isn't useful for this keyword list, so couldn't you
just make this an array of uint16 values?

diff --git a/src/pl/plpgsql/src/pl_unreserved_kwlist.h b/src/pl/plpgsql/src/pl_unreserved_kwlist.h
+/* name, value, category */
+PG_KEYWORD("absolute", K_ABSOLUTE, UNRESERVED_KEYWORD)

Likewise, I'd just have these be two-argument macros.  There's no reason
for the various kwlist.h headers to agree on the number of payload
arguments for PG_KEYWORD.

diff --git a/src/tools/gen_keywords.pl b/src/tools/gen_keywords.pl
+    elsif ($arg =~ /^-o/)
+    {
+        $output_path = length($arg) > 2 ? substr($arg, 2) : shift @ARGV;
+    }

My perl-fu is not great, but it looks like this will accept arguments
like "-ofilename", which is a style I don't like at all.  I'd rather
either insist on the filename being separate or write the switch like
"-o=filename".  Also, project style when taking both forms is usually
more like
    -o filename
    --offset=filename

+$kw_input_file =~ /((\w*)kwlist)\.h/;
+my $base_filename = $1;
+$prefix = $2 if !defined $prefix;

Hmm, what happens if the input filename does not end with "kwlist.h"?

+# Parse keyword header for names.
+my @keywords;
+while (<$kif>)
+{
+    if (/^PG_KEYWORD\("(\w+)",\s*\w+,\s*\w+\)/)

This is assuming more than it should about the number of arguments for
PG_KEYWORD, as well as what's in them.  I think it'd be sufficient to
match like this:

    if (/^PG_KEYWORD\("(\w+)",/)

+Options:
+    -o               output path
+    -p               optional prefix for generated data structures

This usage message is pretty vague about how you write the options
(cf gripe above).


I looked very briefly at v4-0002, and I'm not very convinced about
the "middle" aspect of that optimization.  It seems unmaintainable,
plus you've not exhibited how the preferred keywords would get selected
in the first place (wiring them into the Perl script is surely not
acceptable).  If you want to pursue that, please separate it into
an 0002 that just adds the letter-range aspect and then an 0003
that adds the "middle" business on top.  Then we can do testing to
see whether either of those ideas are worthwhile.

            regards, tom lane


Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> RD parsers are not terribly hard to write.

Sure, as long as they are for grammars that are (a) small, (b) static,
and (c) LL(1), which is strictly weaker than the LALR(1) grammar class
that bison can handle.  We already have a whole lot of constructs that
are at the edges of what bison can handle, which makes me dubious that
an RD parser could be built at all without a lot of performance-eating
lookahead and/or backtracking.

> A smaller project might be to see if we can replace the binary keyword
> search in ScanKeyword with a perfect hashing function generated by
> gperf, or something similar. I had a quick look at that, too.

Yeah, we've looked at gperf before, eg

https://www.postgresql.org/message-id/20170927183156.jqzcsy7ocjcbdnmo@alap3.anarazel.de

Perhaps it'd be a win but I'm not very convinced.

I don't know much about the theory of perfect hashing, but I wonder
if we could just roll our own tool for that.  Since we're not dealing
with extremely large keyword sets, perhaps brute force search for a
set of multipliers for a hash computation like
(char[0] * some_prime + char[1] * some_other_prime ...) mod table_size
would work.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On 12/27/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> diff --git a/src/tools/gen_keywords.pl b/src/tools/gen_keywords.pl
> +    elsif ($arg =~ /^-o/)
> +    {
> +        $output_path = length($arg) > 2 ? substr($arg, 2) : shift @ARGV;
> +    }
>
> My perl-fu is not great, but it looks like this will accept arguments
> like "-ofilename", which is a style I don't like at all.  I'd rather
> either insist on the filename being separate or write the switch like
> "-o=filename".  Also, project style when taking both forms is usually
> more like
>     -o filename
>     --offset=filename

This style was cargo-culted from the catalog scripts. I can settle on
just the first form if you like.

> +$kw_input_file =~ /((\w*)kwlist)\.h/;
> +my $base_filename = $1;
> +$prefix = $2 if !defined $prefix;
>
> Hmm, what happens if the input filename does not end with "kwlist.h"?

If that's a maintainability hazard, I can force every invocation to
provide a prefix instead.

> I looked very briefly at v4-0002, and I'm not very convinced about
> the "middle" aspect of that optimization.  It seems unmaintainable,
> plus you've not exhibited how the preferred keywords would get selected
> in the first place (wiring them into the Perl script is surely not
> acceptable).

What if the second argument of the macro held this info? Something like:

PG_KEYWORD("security", FULL_SEARCH, UNRESERVED_KEYWORD)
PG_KEYWORD("select", OPTIMIZE, SELECT, RESERVED_KEYWORD)

with a warning emitted if more than one keyword per range has
OPTIMIZE. That would require all keyword lists to have that second
argument, but selecting a preferred keyword would be optional.

-John Naylor


John Naylor <jcnaylor@gmail.com> writes:
> On 12/27/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> +$kw_input_file =~ /((\w*)kwlist)\.h/;
>> +my $base_filename = $1;
>> +$prefix = $2 if !defined $prefix;
>> 
>> Hmm, what happens if the input filename does not end with "kwlist.h"?

> If that's a maintainability hazard, I can force every invocation to
> provide a prefix instead.

I don't mind allowing the prefix to default to empty.  What I was
concerned about was that base_filename could end up undefined.
Probably the thing to do is to generate base_filename separately,
say by stripping any initial ".*/" sequence and then substitute
'_' for '.'.

>> I looked very briefly at v4-0002, and I'm not very convinced about
>> the "middle" aspect of that optimization.  It seems unmaintainable,
>> plus you've not exhibited how the preferred keywords would get selected
>> in the first place (wiring them into the Perl script is surely not
>> acceptable).

> What if the second argument of the macro held this info?

Yeah, you'd have to do something like that.  But I'm still concerned
about the maintainability aspect: if we mark say "commit" as the
starting point in the "c" group, future additions or deletions of
keywords starting with "c" might render that an increasingly poor
choice.  But most likely nobody would ever notice that the marking
was getting more and more suboptimal.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andrew Dunstan
Date:
On 12/27/18 3:00 PM, John Naylor wrote:
> On 12/27/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> diff --git a/src/tools/gen_keywords.pl b/src/tools/gen_keywords.pl
>> +    elsif ($arg =~ /^-o/)
>> +    {
>> +        $output_path = length($arg) > 2 ? substr($arg, 2) : shift @ARGV;
>> +    }
>>
>> My perl-fu is not great, but it looks like this will accept arguments
>> like "-ofilename", which is a style I don't like at all.  I'd rather
>> either insist on the filename being separate or write the switch like
>> "-o=filename".  Also, project style when taking both forms is usually
>> more like
>>     -o filename
>>     --offset=filename
> This style was cargo-culted from the catalog scripts. I can settle on
> just the first form if you like.
>


I would rather we used the standard perl module Getopt::Long, as
numerous programs we have already do.


cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 12/27/18 3:00 PM, John Naylor wrote:
>> This style was cargo-culted from the catalog scripts. I can settle on
>> just the first form if you like.

> I would rather we used the standard perl module Getopt::Long, as
> numerous programs we have already do.

Hmm ... grepping finds that used only in

src/tools/pgindent/pgindent
src/tools/git_changelog
src/pl/plperl/text2macro.pl

so I'm not quite sure about the "numerous" claim.  Adopting that
here would possibly impose the requirement of having Getopt::Long
on some developers who are getting by without it today.  However,
that's a pretty thin argument, and if Getopt::Long is present even
in the most minimal Perl installations then it's certainly moot.

On the whole I'm +1 for this.  Perhaps also, as an independent patch,
we should change the catalog scripts to use Getopt::Long.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andrew Dunstan
Date:
On 12/27/18 3:34 PM, Tom Lane wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>> On 12/27/18 3:00 PM, John Naylor wrote:
>>> This style was cargo-culted from the catalog scripts. I can settle on
>>> just the first form if you like.
>> I would rather we used the standard perl module Getopt::Long, as
>> numerous programs we have already do.
> Hmm ... grepping finds that used only in
>
> src/tools/pgindent/pgindent
> src/tools/git_changelog
> src/pl/plperl/text2macro.pl
>
> so I'm not quite sure about the "numerous" claim.  Adopting that
> here would possibly impose the requirement of having Getopt::Long
> on some developers who are getting by without it today.  However,
> that's a pretty thin argument, and if Getopt::Long is present even
> in the most minimal Perl installations then it's certainly moot.


It's bundled separately, but on both systems I looked at it's needed by
the base perl package. I don't recall ever seeing a system where it's
not available. I'm reasonably careful about what packages the buildfarm
requires, and it's used Getopt::Long from day one.


>
> On the whole I'm +1 for this.  Perhaps also, as an independent patch,
> we should change the catalog scripts to use Getopt::Long.
>
>             



Probably some others, too.


cheers


andrew



-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 12/27/18 3:34 PM, Tom Lane wrote:
>> ... that's a pretty thin argument, and if Getopt::Long is present even
>> in the most minimal Perl installations then it's certainly moot.

> It's bundled separately, but on both systems I looked at it's needed by
> the base perl package. I don't recall ever seeing a system where it's
> not available. I'm reasonably careful about what packages the buildfarm
> requires, and it's used Getopt::Long from day one.

I poked around a little on my own machines, and I can confirm that
Getopt::Long is present in a default Perl install-from-source at
least as far back as perl 5.6.1.  It's barely conceivable that some
packager might omit it from their minimal package, but Red Hat,
Apple, NetBSD, and OpenBSD all include it.  So it sure looks to
me like relying on it should be non-problematic.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Alvaro Herrera
Date:
On 2018-Dec-27, Tom Lane wrote:

> I poked around a little on my own machines, and I can confirm that
> Getopt::Long is present in a default Perl install-from-source at
> least as far back as perl 5.6.1.  It's barely conceivable that some
> packager might omit it from their minimal package, but Red Hat,
> Apple, NetBSD, and OpenBSD all include it.  So it sure looks to
> me like relying on it should be non-problematic.

In Debian it's included in package perl-modules-5.24, which packages
perl and libperl5.24 depend on.  I suppose it's possible to install
perl-base and not install perl-modules, but it'd be a really bare-bones
machine.  I'm not sure it's possible to build Postgres in such a
machine.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
David Fetter
Date:
On Thu, Dec 27, 2018 at 07:04:41PM -0300, Alvaro Herrera wrote:
> On 2018-Dec-27, Tom Lane wrote:
> 
> > I poked around a little on my own machines, and I can confirm that
> > Getopt::Long is present in a default Perl install-from-source at
> > least as far back as perl 5.6.1.  It's barely conceivable that some
> > packager might omit it from their minimal package, but Red Hat,
> > Apple, NetBSD, and OpenBSD all include it.  So it sure looks to
> > me like relying on it should be non-problematic.
> 
> In Debian it's included in package perl-modules-5.24, which packages
> perl and libperl5.24 depend on.  I suppose it's possible to install
> perl-base and not install perl-modules, but it'd be a really bare-bones
> machine.  I'm not sure it's possible to build Postgres in such a
> machine.

$ corelist -a Getopt::Long

Data for 2018-11-29
Getopt::Long was first released with perl 5
  5          undef     
  5.001      undef     
  5.002      2.01      
  5.00307    2.04      
  5.004      2.10      
  5.00405    2.19      
  5.005      2.17      
  5.00503    2.19      
  5.00504    2.20      
[much output elided]

Fortunately, this has been part of Perl core a lot further back than
we promise to support for builds, so I think we're clear to use it
everywhere we process options.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andres Freund
Date:
On 2018-12-27 14:22:11 -0500, Tom Lane wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> > A smaller project might be to see if we can replace the binary keyword
> > search in ScanKeyword with a perfect hashing function generated by
> > gperf, or something similar. I had a quick look at that, too.
> 
> Yeah, we've looked at gperf before, eg
> 
> https://www.postgresql.org/message-id/20170927183156.jqzcsy7ocjcbdnmo@alap3.anarazel.de
> 
> Perhaps it'd be a win but I'm not very convinced.

Note that the tradeoffs mentioned there, by memory, aren't necessarily
applicable here. As we're dealing with strings anyway, gperf wanting to
deal with strings rather than being able to deal with numbers isn't
problematic.


> I don't know much about the theory of perfect hashing, but I wonder
> if we could just roll our own tool for that.  Since we're not dealing
> with extremely large keyword sets, perhaps brute force search for a
> set of multipliers for a hash computation like
> (char[0] * some_prime + char[1] * some_other_prime ...) mod table_size
> would work.

The usual way to do do perfect hashing is to bascially have a two stage
hashtable, with the first stage keyed by a "normal" hash fuinction, and
the second one disambiguating the values that hash into the same bucket,
by additionally keying a hash-function with the value in the cell in the
intermediate hash table. Determining the parameters in the intermediate
table is what takes time.  That most perfect hash functions look like
that way is also a good part of the reason why I doubt it's worthwhile
to go there over a simple linear probing hashtable, with a good
hashfunction - computing two hash-values will usually be worse than
linear probing for *small* and *not modified* hashtables.

A simple (i.e. slow for large numbers of keys) implementation for
generating a perfect hash function isn't particularly hard. E.g. look at
the python implementation at http://iswsa.acm.org/mphf/index.html and
http://stevehanov.ca/blog/index.php?id=119 for an easy explanation with
graphics.

Greetings,

Andres Freund


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
I think 0001 with complete keyword lookup replacement is in decent
enough shape to post. Make check-world passes. A few notes and
caveats:

-I added an --extern option to the script for the core keyword
headers. This also capitalizes variables.
-ECPG keyword lookup is a bit different in that the ecpg and sql
lookup functions are wrapped in a single function rather than called
separately within pgc.l. It might be worth untangling that, but I have
not done so.
-Some variable names haven't changed even though now they're only
referring to token values, which might be confusing.
-I haven't checked if I need to install the generated headers.
-I haven't measured performance or binary size. If anyone is excited
enough to do that, great, otherwise I'll do that as time permits.
-There are probably makefile bugs.

Now, on to previous review points:

On 12/27/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> +/* Payload data for keywords */
> +typedef struct ScanKeywordAux
> +{
> +    int16        value;            /* grammar's token code */
> +    char        category;        /* see codes above */
> +} ScanKeywordAux;
>
> There isn't really any point in changing category to "char", because
> alignment considerations will mandate that sizeof(ScanKeywordAux) be
> a multiple of 2 anyway.  With some compilers we could get around that
> with a pragma to force non-aligned storage, but doing so would be a
> net loss on most non-Intel architectures.

Reverted, especially since we can skip the struct entirely for some
callers as you pointed out below.

> diff --git a/src/pl/plpgsql/src/.gitignore b/src/pl/plpgsql/src/.gitignore
> @@ -1,3 +1,4 @@
> +/*kwlist_d.h
>
> Not a fan of using wildcards in .gitignore files, at least not when
> there's just one or two files you intend to match.

Removed.

>  # Force these dependencies to be known even without dependency info built:
> -pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o:
> plpgsql.h pl_gram.h plerrcodes.h
> +pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o:
> plpgsql.h pl_gram.h plerrcodes.h pl_unreserved_kwlist_d.h
>
> Hm, do we really need any more than pl_scanner.o to depend on that header?

I think you're right, so separated into a new rule.

> +# Parse keyword header for names.
> +my @keywords;
> +while (<$kif>)
> +{
> +    if (/^PG_KEYWORD\("(\w+)",\s*\w+,\s*\w+\)/)
>
> This is assuming more than it should about the number of arguments for
> PG_KEYWORD, as well as what's in them.  I think it'd be sufficient to
> match like this:
>
>     if (/^PG_KEYWORD\("(\w+)",/)

...and...

> diff --git a/src/pl/plpgsql/src/pl_unreserved_kwlist.h
> b/src/pl/plpgsql/src/pl_unreserved_kwlist.h
> +/* name, value, category */
> +PG_KEYWORD("absolute", K_ABSOLUTE, UNRESERVED_KEYWORD)
>
> Likewise, I'd just have these be two-argument macros.  There's no reason
> for the various kwlist.h headers to agree on the number of payload
> arguments for PG_KEYWORD.

Both done, however...

> +/* FIXME: Have to redefine this symbol for the WIP. */
> +#undef PG_KEYWORD
> +#define PG_KEYWORD(kwname, value, category) {value, category},
> +
> +static const ScanKeywordAux unreserved_keywords[] = {
> +#include "pl_unreserved_kwlist.h"
>  };
>
> The category isn't useful for this keyword list, so couldn't you
> just make this an array of uint16 values?

Yes, this works for the unreserved keywords. The reserved ones still
need the aux struct to work with the core scanner, even though scan.l
doesn't reference category either. This has the consequence that we
can't dispense with category, e.g.:

PG_KEYWORD("all", K_ALL, RESERVED_KEYWORD)

...unless we do without the struct entirely, but that's not without
disadvantages as you mentioned.

I decided to export the struct (rather than just int16 for category)
to the frontend, even though we have to set the token values to zero,
since there might someday be another field of use to the frontend.
Also to avoid confusion.

> I don't mind allowing the prefix to default to empty.  What I was
> concerned about was that base_filename could end up undefined.
> Probably the thing to do is to generate base_filename separately,
> say by stripping any initial ".*/" sequence and then substitute
> '_' for '.'.

I removed assumptions about the filename.

> +Options:
> +    -o               output path
> +    -p               optional prefix for generated data structures
>
> This usage message is pretty vague about how you write the options
> (cf gripe above).

I tried it like this:

Usage: gen_keywords.pl [--output/-o <path>] [--prefix/-p <prefix>] input_file
    --output  Output directory
    --prefix  String prepended to var names in the output file


On 12/27/18, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
> I would rather we used the standard perl module Getopt::Long, as
> numerous programs we have already do.

Done. I'll also send a patch later to bring some other scripts in line.

-John Naylor

Attachment

Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andres Freund
Date:
Hi,

On 2018-12-29 16:59:52 -0500, John Naylor wrote:
> I think 0001 with complete keyword lookup replacement is in decent
> enough shape to post. Make check-world passes. A few notes and
> caveats:

I tried to take this for a spin, an for me the build fails because various
frontend programs don't have KeywordOffsets/Strings defined, but reference it
through various functions exposed to the frontend (like fmtId()).  That I see
that error but you don't is probably related to me using -fuse-ld=gold in
CFLAGS.

I can "fix" this by including kwlist_d.h in common/keywords.c
regardless of FRONTEND. That also lead me to discover that the build
dependencies somewhere aren't correctly set-up, because I need to
force a clean rebuild to trigger the problem again, just changing
keywords.c back doesn't trigger the problem.

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> On 2018-12-29 16:59:52 -0500, John Naylor wrote:
>> I think 0001 with complete keyword lookup replacement is in decent
>> enough shape to post. Make check-world passes. A few notes and
>> caveats:

> I tried to take this for a spin, an for me the build fails because various
> frontend programs don't have KeywordOffsets/Strings defined, but reference it
> through various functions exposed to the frontend (like fmtId()).  That I see
> that error but you don't is probably related to me using -fuse-ld=gold in
> CFLAGS.

I was just about to point out that the cfbot is seeing that too ...

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Joerg Sonnenberger
Date:
On Sun, Dec 16, 2018 at 11:50:15AM -0500, John Naylor wrote:
> A few months ago I was looking into faster search algorithms for
> ScanKeywordLookup(), so this is interesting to me. While an optimal
> full replacement would be a lot of work, the above ideas are much less
> invasive and would still have some benefit. Unless anyone intends to
> work on this, I'd like to flesh out the offset-into-giant-string
> approach a bit further:

Hello John,
I was pointed at your patch on IRC and decided to look into adding my
own pieces. What I can provide you is a fast perfect hash function
generator.  I've attached a sample hash function based on the current
main keyword list. hash() essentially gives you the number of the only
possible match, a final strcmp/memcmp is still necessary to verify that
it is an actual keyword though. The |0x20 can be dropped if all cases
have pre-lower-cased the input already. This would replace the binary
search in the lookup functions. Returning offsets directly would be easy
as well. That allows writing a single string where each entry is prefixed
with a type mask, the token id, the length of the keyword and the actual
keyword text. Does that sound useful to you?

Joerg

Attachment
I wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2018-12-29 16:59:52 -0500, John Naylor wrote:
>>> I think 0001 with complete keyword lookup replacement is in decent
>>> enough shape to post. Make check-world passes. A few notes and
>>> caveats:

>> I tried to take this for a spin, an for me the build fails because various
>> frontend programs don't have KeywordOffsets/Strings defined, but reference it
>> through various functions exposed to the frontend (like fmtId()).  That I see
>> that error but you don't is probably related to me using -fuse-ld=gold in
>> CFLAGS.

> I was just about to point out that the cfbot is seeing that too ...

Aside from the possible linkage problem, this will need a minor rebase
over 4879a5172, which rearranged some of plpgsql's calls of
ScanKeywordLookup.

While I don't think it's going to be hard to resolve these issues,
I'm wondering where we want to go with this.  Is anyone excited
about pursuing the perfect-hash-function idea?  (Joerg's example
function looked pretty neat to me.)  If we are going to do that,
does it make sense to push this version beforehand?

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On 12/30/18, Andres Freund <andres@anarazel.de> wrote:
> I tried to take this for a spin, an for me the build fails because various
> frontend programs don't have KeywordOffsets/Strings defined, but reference
> it
> through various functions exposed to the frontend (like fmtId()).  That I
> see
> that error but you don't is probably related to me using -fuse-ld=gold in
> CFLAGS.
>
> I can "fix" this by including kwlist_d.h in common/keywords.c
> regardless of FRONTEND. That also lead me to discover that the build
> dependencies somewhere aren't correctly set-up, because I need to
> force a clean rebuild to trigger the problem again, just changing
> keywords.c back doesn't trigger the problem.

Hmm, that was a typo, and I didn't notice even when I found I had to
include kwlist_d.h in ecpg/keywords.c. :-(  I've fixed both of those
in the attached v6.

As far as dependencies, I'm far from sure I have it up to par. That
piece could use some discussion.

On 1/4/19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Aside from the possible linkage problem, this will need a minor rebase
> over 4879a5172, which rearranged some of plpgsql's calls of
> ScanKeywordLookup.
>
> While I don't think it's going to be hard to resolve these issues,
> I'm wondering where we want to go with this.  Is anyone excited
> about pursuing the perfect-hash-function idea?  (Joerg's example
> function looked pretty neat to me.)  If we are going to do that,
> does it make sense to push this version beforehand?

If it does, for v6 I've also done the rebase, updated the copyright
year, and fixed an error in MSVC.

-John Naylor

Attachment

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On 1/3/19, Joerg Sonnenberger <joerg@bec.de> wrote:
> Hello John,
> I was pointed at your patch on IRC and decided to look into adding my
> own pieces. What I can provide you is a fast perfect hash function
> generator.  I've attached a sample hash function based on the current
> main keyword list. hash() essentially gives you the number of the only
> possible match, a final strcmp/memcmp is still necessary to verify that
> it is an actual keyword though. The |0x20 can be dropped if all cases
> have pre-lower-cased the input already. This would replace the binary
> search in the lookup functions. Returning offsets directly would be easy
> as well. That allows writing a single string where each entry is prefixed
> with a type mask, the token id, the length of the keyword and the actual
> keyword text. Does that sound useful to you?

Judging by previous responses, there is still interest in using
perfect hash functions, so thanks for this. I'm not knowledgeable
enough to judge its implementation, so I'll leave that for others.

-John Naylor


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andres Freund
Date:
On 2019-01-04 12:26:18 -0500, Tom Lane wrote:
> I wrote:
> > Andres Freund <andres@anarazel.de> writes:
> >> On 2018-12-29 16:59:52 -0500, John Naylor wrote:
> >>> I think 0001 with complete keyword lookup replacement is in decent
> >>> enough shape to post. Make check-world passes. A few notes and
> >>> caveats:
> 
> >> I tried to take this for a spin, an for me the build fails because various
> >> frontend programs don't have KeywordOffsets/Strings defined, but reference it
> >> through various functions exposed to the frontend (like fmtId()).  That I see
> >> that error but you don't is probably related to me using -fuse-ld=gold in
> >> CFLAGS.
> 
> > I was just about to point out that the cfbot is seeing that too ...
> 
> Aside from the possible linkage problem, this will need a minor rebase
> over 4879a5172, which rearranged some of plpgsql's calls of
> ScanKeywordLookup.
> 
> While I don't think it's going to be hard to resolve these issues,
> I'm wondering where we want to go with this.  Is anyone excited
> about pursuing the perfect-hash-function idea?  (Joerg's example
> function looked pretty neat to me.)  If we are going to do that,
> does it make sense to push this version beforehand?

I think it does make sense to push this version beforehand. Most of
the code would be needed anyway, so it's not like this is going to
cause a lot of churn.

Greetings,

Andres Freund


John Naylor <jcnaylor@gmail.com> writes:
> On 1/3/19, Joerg Sonnenberger <joerg@bec.de> wrote:
>> I was pointed at your patch on IRC and decided to look into adding my
>> own pieces. What I can provide you is a fast perfect hash function
>> generator.  I've attached a sample hash function based on the current
>> main keyword list. hash() essentially gives you the number of the only
>> possible match, a final strcmp/memcmp is still necessary to verify that
>> it is an actual keyword though. The |0x20 can be dropped if all cases
>> have pre-lower-cased the input already. This would replace the binary
>> search in the lookup functions. Returning offsets directly would be easy
>> as well. That allows writing a single string where each entry is prefixed
>> with a type mask, the token id, the length of the keyword and the actual
>> keyword text. Does that sound useful to you?

> Judging by previous responses, there is still interest in using
> perfect hash functions, so thanks for this. I'm not knowledgeable
> enough to judge its implementation, so I'll leave that for others.

We haven't actually seen the implementation, so it's hard to judge ;-).

The sample hash function certainly looks great.  I'm not terribly on board
with using |0x20 as a substitute for lower-casing, but that's a minor
detail.

The foremost questions in my mind are:

* What's the generator written in?  (if the answer's not "Perl", wedging
it into our build processes might be painful)

* What license is it under?

* Does it always suceed in producing a single-level lookup table?

            regards, tom lane


Andres Freund <andres@anarazel.de> writes:
> On 2019-01-04 12:26:18 -0500, Tom Lane wrote:
>> I'm wondering where we want to go with this.  Is anyone excited
>> about pursuing the perfect-hash-function idea?  (Joerg's example
>> function looked pretty neat to me.)  If we are going to do that,
>> does it make sense to push this version beforehand?

> I think it does make sense to push this version beforehand. Most of
> the code would be needed anyway, so it's not like this is going to
> cause a lot of churn.

Yeah, I'm leaning in that direction too, first on the grounds of
"don't let the perfect be the enemy of the good", and second because
if we do end up with perfect hashing, we'd still need a table-generation
step.  The build infrastructure this adds would support a generator
that produces perfect hashes just as well as what this is doing,
even if we end up having to whack the API of ScanKeywordLookup around
some more.  So barring objections, I'll have a look at pushing this,
and then we can think about using perfect hashing instead.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On 12/27/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> If you really are hot about saving that other 440 bytes, the way to
> do it would be to drop the struct entirely and use two parallel
> arrays, an int16[] for value and a char[] (or better uint8[]) for
> category.  Those would be filled by reading kwlist.h twice with
> different definitions for PG_KEYWORD.  Not sure it's worth the
> trouble though --- in particular, not clear that it's a win from
> the standpoint of number of cache lines touched.

Understood. That said, after re-implementing all keyword lookups, I
wondered if there'd be a notational benefit to dropping the struct,
especially since as yet no caller uses both token and category. It
makes pl_scanner.c and its reserved keyword list a bit nicer, and gets
rid of the need to force frontend to have 'zero' token numbers, but
I'm not sure it's a clear win. I've attached a patch (applies on top
of v6), gzipped to avoid confusing the cfbot.

-John Naylor

Attachment

Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Joerg Sonnenberger
Date:
On Fri, Jan 04, 2019 at 03:31:11PM -0500, Tom Lane wrote:
> John Naylor <jcnaylor@gmail.com> writes:
> > On 1/3/19, Joerg Sonnenberger <joerg@bec.de> wrote:
> >> I was pointed at your patch on IRC and decided to look into adding my
> >> own pieces. What I can provide you is a fast perfect hash function
> >> generator.  I've attached a sample hash function based on the current
> >> main keyword list. hash() essentially gives you the number of the only
> >> possible match, a final strcmp/memcmp is still necessary to verify that
> >> it is an actual keyword though. The |0x20 can be dropped if all cases
> >> have pre-lower-cased the input already. This would replace the binary
> >> search in the lookup functions. Returning offsets directly would be easy
> >> as well. That allows writing a single string where each entry is prefixed
> >> with a type mask, the token id, the length of the keyword and the actual
> >> keyword text. Does that sound useful to you?
> 
> > Judging by previous responses, there is still interest in using
> > perfect hash functions, so thanks for this. I'm not knowledgeable
> > enough to judge its implementation, so I'll leave that for others.
> 
> We haven't actually seen the implementation, so it's hard to judge ;-).

It's a temporary hacked version of nbperf in the NetBSD tree.

> The sample hash function certainly looks great.  I'm not terribly on board
> with using |0x20 as a substitute for lower-casing, but that's a minor
> detail.

Yeah, I've included that part more because I don't know the current use
cases enough. If all instances are already doing lower-casing in
advance, it is trivial to drop.

> The foremost questions in my mind are:
> 
> * What's the generator written in?  (if the answer's not "Perl", wedging
> it into our build processes might be painful)

Plain C, nothing really fancy in it.

> * What license is it under?

Two clause BSD license.

> * Does it always suceed in producing a single-level lookup table?

This question is a bit tricky. The short answer is: yes. The longer
answer: The choosen hash function in the example is very simple (e.g.
just two variations of DJB-style hash), so with that: no, not without
potentially fiddling a bit with the hash function if things ever get
nasty like having two keywords that hit a funnel for both variants. The
main concern for the choice was to be fast. When using two families of
independent hash functions, the generator requires a probalistic linear
time in the number of keys. That means with a strong enough hash function
like the Jenkins hash used in PG elsewhere, it will succeed very fast.
So if it fails on new keywords, making the mixing a bit stronger should
be enough.

Joerg


Joerg Sonnenberger <joerg@bec.de> writes:
> On Fri, Jan 04, 2019 at 03:31:11PM -0500, Tom Lane wrote:
>> The sample hash function certainly looks great.  I'm not terribly on board
>> with using |0x20 as a substitute for lower-casing, but that's a minor
>> detail.

> Yeah, I've included that part more because I don't know the current use
> cases enough. If all instances are already doing lower-casing in
> advance, it is trivial to drop.

I think we probably don't need that, because we'd always need to generate
a lower-cased version of the input anyway: either to compare to the
potential keyword match, or to use as the normalized identifier if it
turns out not to be a keyword.  I don't think there are any cases where
it's useful to delay downcasing till after the keyword lookup.

>> * What's the generator written in?  (if the answer's not "Perl", wedging
>> it into our build processes might be painful)

> Plain C, nothing really fancy in it.

That's actually a bigger problem than you might think, because it
doesn't fit in very nicely in a cross-compiling build: we might not
have any C compiler at hand that generates programs that can execute
on the build machine.  That's why we prefer Perl for tools that need
to execute during the build.  However, if the code is pretty small
and fast, maybe translating it to Perl is feasible.  Or perhaps
we could add sufficient autoconfiscation infrastructure to identify
a native C compiler.  It's not very likely that there isn't one,
but it is possible that nothing we learned about the configured
target compiler would apply to it :-(

>> * What license is it under?

> Two clause BSD license.

OK, that works, at least.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andres Freund
Date:
Hi,

On 2019-01-04 16:43:39 -0500, Tom Lane wrote:
> Joerg Sonnenberger <joerg@bec.de> writes:
> >> * What's the generator written in?  (if the answer's not "Perl", wedging
> >> it into our build processes might be painful)
>
> > Plain C, nothing really fancy in it.
>
> That's actually a bigger problem than you might think, because it
> doesn't fit in very nicely in a cross-compiling build: we might not
> have any C compiler at hand that generates programs that can execute
> on the build machine.  That's why we prefer Perl for tools that need
> to execute during the build.  However, if the code is pretty small
> and fast, maybe translating it to Perl is feasible.  Or perhaps
> we could add sufficient autoconfiscation infrastructure to identify
> a native C compiler.  It's not very likely that there isn't one,
> but it is possible that nothing we learned about the configured
> target compiler would apply to it :-(

I think it might be ok if we included the output of the generator in the
buildtree? Not being able to add keywords while cross-compiling sounds like
an acceptable restriction to me.  I assume we'd likely grow further users
of such a generator over time, and some of the input lists might be big
enough that we'd not want to force it to be recomputed on every machine.

Greetings,

Andres Freund


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Joerg Sonnenberger
Date:
On Fri, Jan 04, 2019 at 02:36:15PM -0800, Andres Freund wrote:
> Hi,
> 
> On 2019-01-04 16:43:39 -0500, Tom Lane wrote:
> > Joerg Sonnenberger <joerg@bec.de> writes:
> > >> * What's the generator written in?  (if the answer's not "Perl", wedging
> > >> it into our build processes might be painful)
> >
> > > Plain C, nothing really fancy in it.
> >
> > That's actually a bigger problem than you might think, because it
> > doesn't fit in very nicely in a cross-compiling build: we might not
> > have any C compiler at hand that generates programs that can execute
> > on the build machine.  That's why we prefer Perl for tools that need
> > to execute during the build.  However, if the code is pretty small
> > and fast, maybe translating it to Perl is feasible.  Or perhaps
> > we could add sufficient autoconfiscation infrastructure to identify
> > a native C compiler.  It's not very likely that there isn't one,
> > but it is possible that nothing we learned about the configured
> > target compiler would apply to it :-(

There is a pre-made autoconf macro for doing the basic glue for
CC_FOR_BUILD, it's been used by various projects already including libXt
and friends. 

> I think it might be ok if we included the output of the generator in the
> buildtree? Not being able to add keywords while cross-compiling sounds like
> an acceptable restriction to me.  I assume we'd likely grow further users
> of such a generator over time, and some of the input lists might be big
> enough that we'd not want to force it to be recomputed on every machine.

This is quite reasonable as well. I wouldn't worry about the size of the
input list at all. Processing the Webster dictionary needs something
less than 0.4s on my laptop for 235k entries.

Joerg


John Naylor <jcnaylor@gmail.com> writes:
> [ v6-0001-Use-offset-based-keyword-lookup.patch ]

I spent some time hacking on this today, and I think it's committable
now, but I'm putting it back up in case anyone wants to have another
look (and also so the cfbot can check it on Windows).

Given the discussion about possibly switching to perfect hashing,
I thought it'd be a good idea to try to make the APIs less dependent
on the exact table representation.  So in the attached, I created
a struct ScanKeywordList that holds all the data ScanKeywordLookup
needs, and the generated headers declare variables of that type,
and we just pass around a pointer to that instead of passing several
different things.

I also went ahead with the idea of splitting the category and token
data into separate arrays.  That allows moving the backend token
array out of src/common entirely, which I think is a good thing
because of the dependency situation: we no longer need to run the
bison build before we can compile src/common/keywords_srv.o.

There's one remaining refactoring issue that I think we'd want to consider
before trying to jack this up and wheel a perfect-hash lookup under it:
where to do the downcasing transform.  Right now, ecpg's c_keywords.c
has its own copy of the binary-search logic because it doesn't want the
downcasing transform that ScanKeywordLookup does.  So unless we want it
to also have a copy of the hash lookup logic, we need to rearrange that
somehow.  We could give ScanKeywordLookup a "bool downcase" argument,
or we could refactor things so that the downcasing is done by callers
if they need it (which many don't).  I'm not very sure which of those
three alternatives is best.

My argument upthread that we could always do the downcasing before
keyword lookup now feels a bit shaky, because I was reminded while
working on this code that we actually have different downcasing rules
for keywords and identifiers (yes, really), so that it's not possible
for those code paths to share a downcasing transform.  So the idea of
moving the keyword-downcasing logic to the callers is likely to not
work out quite as nicely as I thought.  (This might also mean that
I was overly hasty to reject Joerg's |0x20 hack.  It's still an
ugly hack, but it would save doing the keyword downcasing transform
if we don't get a hashcode match...)

            regards, tom lane

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index e8ef966..9131991 100644
*** a/contrib/pg_stat_statements/pg_stat_statements.c
--- b/contrib/pg_stat_statements/pg_stat_statements.c
*************** fill_in_constant_lengths(pgssJumbleState
*** 3075,3082 ****
      /* initialize the flex scanner --- should match raw_parser() */
      yyscanner = scanner_init(query,
                               &yyextra,
!                              ScanKeywords,
!                              NumScanKeywords);

      /* we don't want to re-emit any escape string warnings */
      yyextra.escape_string_warning = false;
--- 3075,3082 ----
      /* initialize the flex scanner --- should match raw_parser() */
      yyscanner = scanner_init(query,
                               &yyextra,
!                              &ScanKeywords,
!                              ScanKeywordTokens);

      /* we don't want to re-emit any escape string warnings */
      yyextra.escape_string_warning = false;
diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index 7e9b122..4c0c258 100644
*** a/src/backend/parser/parser.c
--- b/src/backend/parser/parser.c
*************** raw_parser(const char *str)
*** 41,47 ****

      /* initialize the flex scanner */
      yyscanner = scanner_init(str, &yyextra.core_yy_extra,
!                              ScanKeywords, NumScanKeywords);

      /* base_yylex() only needs this much initialization */
      yyextra.have_lookahead = false;
--- 41,47 ----

      /* initialize the flex scanner */
      yyscanner = scanner_init(str, &yyextra.core_yy_extra,
!                              &ScanKeywords, ScanKeywordTokens);

      /* base_yylex() only needs this much initialization */
      yyextra.have_lookahead = false;
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index fbeb86f..e1cae85 100644
*** a/src/backend/parser/scan.l
--- b/src/backend/parser/scan.l
*************** bool        escape_string_warning = true;
*** 67,72 ****
--- 67,87 ----
  bool        standard_conforming_strings = true;

  /*
+  * Constant data exported from this file.  This array maps from the
+  * zero-based keyword numbers returned by ScanKeywordLookup to the
+  * Bison token numbers needed by gram.y.  This is exported because
+  * callers need to pass it to scanner_init, if they are using the
+  * standard keyword list ScanKeywords.
+  */
+ #define PG_KEYWORD(kwname, value, category) value,
+
+ const uint16 ScanKeywordTokens[] = {
+ #include "parser/kwlist.h"
+ };
+
+ #undef PG_KEYWORD
+
+ /*
   * Set the type of YYSTYPE.
   */
  #define YYSTYPE core_YYSTYPE
*************** other            .
*** 504,521 ****
                       * We will pass this along as a normal character string,
                       * but preceded with an internally-generated "NCHAR".
                       */
!                     const ScanKeyword *keyword;

                      SET_YYLLOC();
                      yyless(1);    /* eat only 'n' this time */

!                     keyword = ScanKeywordLookup("nchar",
!                                                 yyextra->keywords,
!                                                 yyextra->num_keywords);
!                     if (keyword != NULL)
                      {
!                         yylval->keyword = keyword->name;
!                         return keyword->value;
                      }
                      else
                      {
--- 519,536 ----
                       * We will pass this along as a normal character string,
                       * but preceded with an internally-generated "NCHAR".
                       */
!                     int        kwnum;

                      SET_YYLLOC();
                      yyless(1);    /* eat only 'n' this time */

!                     kwnum = ScanKeywordLookup("nchar",
!                                               yyextra->keywordlist);
!                     if (kwnum >= 0)
                      {
!                         yylval->keyword = GetScanKeyword(kwnum,
!                                                          yyextra->keywordlist);
!                         return yyextra->keyword_tokens[kwnum];
                      }
                      else
                      {
*************** other            .
*** 1021,1039 ****


  {identifier}    {
!                     const ScanKeyword *keyword;
                      char       *ident;

                      SET_YYLLOC();

                      /* Is it a keyword? */
!                     keyword = ScanKeywordLookup(yytext,
!                                                 yyextra->keywords,
!                                                 yyextra->num_keywords);
!                     if (keyword != NULL)
                      {
!                         yylval->keyword = keyword->name;
!                         return keyword->value;
                      }

                      /*
--- 1036,1054 ----


  {identifier}    {
!                     int            kwnum;
                      char       *ident;

                      SET_YYLLOC();

                      /* Is it a keyword? */
!                     kwnum = ScanKeywordLookup(yytext,
!                                               yyextra->keywordlist);
!                     if (kwnum >= 0)
                      {
!                         yylval->keyword = GetScanKeyword(kwnum,
!                                                          yyextra->keywordlist);
!                         return yyextra->keyword_tokens[kwnum];
                      }

                      /*
*************** scanner_yyerror(const char *message, cor
*** 1142,1149 ****
  core_yyscan_t
  scanner_init(const char *str,
               core_yy_extra_type *yyext,
!              const ScanKeyword *keywords,
!              int num_keywords)
  {
      Size        slen = strlen(str);
      yyscan_t    scanner;
--- 1157,1164 ----
  core_yyscan_t
  scanner_init(const char *str,
               core_yy_extra_type *yyext,
!              const ScanKeywordList *keywordlist,
!              const uint16 *keyword_tokens)
  {
      Size        slen = strlen(str);
      yyscan_t    scanner;
*************** scanner_init(const char *str,
*** 1153,1160 ****

      core_yyset_extra(yyext, scanner);

!     yyext->keywords = keywords;
!     yyext->num_keywords = num_keywords;

      yyext->backslash_quote = backslash_quote;
      yyext->escape_string_warning = escape_string_warning;
--- 1168,1175 ----

      core_yyset_extra(yyext, scanner);

!     yyext->keywordlist = keywordlist;
!     yyext->keyword_tokens = keyword_tokens;

      yyext->backslash_quote = backslash_quote;
      yyext->escape_string_warning = escape_string_warning;
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 7b69b82..746b7d2 100644
*** a/src/backend/utils/adt/misc.c
--- b/src/backend/utils/adt/misc.c
*************** pg_get_keywords(PG_FUNCTION_ARGS)
*** 417,431 ****

      funcctx = SRF_PERCALL_SETUP();

!     if (funcctx->call_cntr < NumScanKeywords)
      {
          char       *values[3];
          HeapTuple    tuple;

          /* cast-away-const is ugly but alternatives aren't much better */
!         values[0] = unconstify(char *, ScanKeywords[funcctx->call_cntr].name);

!         switch (ScanKeywords[funcctx->call_cntr].category)
          {
              case UNRESERVED_KEYWORD:
                  values[1] = "U";
--- 417,433 ----

      funcctx = SRF_PERCALL_SETUP();

!     if (funcctx->call_cntr < ScanKeywords.num_keywords)
      {
          char       *values[3];
          HeapTuple    tuple;

          /* cast-away-const is ugly but alternatives aren't much better */
!         values[0] = unconstify(char *,
!                                GetScanKeyword(funcctx->call_cntr,
!                                               &ScanKeywords));

!         switch (ScanKeywordCategories[funcctx->call_cntr])
          {
              case UNRESERVED_KEYWORD:
                  values[1] = "U";
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 368eacf..77811f6 100644
*** a/src/backend/utils/adt/ruleutils.c
--- b/src/backend/utils/adt/ruleutils.c
*************** quote_identifier(const char *ident)
*** 10601,10611 ****
           * Note: ScanKeywordLookup() does case-insensitive comparison, but
           * that's fine, since we already know we have all-lower-case.
           */
!         const ScanKeyword *keyword = ScanKeywordLookup(ident,
!                                                        ScanKeywords,
!                                                        NumScanKeywords);

!         if (keyword != NULL && keyword->category != UNRESERVED_KEYWORD)
              safe = false;
      }

--- 10601,10609 ----
           * Note: ScanKeywordLookup() does case-insensitive comparison, but
           * that's fine, since we already know we have all-lower-case.
           */
!         int            kwnum = ScanKeywordLookup(ident, &ScanKeywords);

!         if (kwnum >= 0 && ScanKeywordCategories[kwnum] != UNRESERVED_KEYWORD)
              safe = false;
      }

diff --git a/src/common/.gitignore b/src/common/.gitignore
index ...ffa3284 .
*** a/src/common/.gitignore
--- b/src/common/.gitignore
***************
*** 0 ****
--- 1 ----
+ /kwlist_d.h
diff --git a/src/common/Makefile b/src/common/Makefile
index ec8139f..317b071 100644
*** a/src/common/Makefile
--- b/src/common/Makefile
*************** override CPPFLAGS += -DVAL_LDFLAGS_EX="\
*** 41,51 ****
  override CPPFLAGS += -DVAL_LDFLAGS_SL="\"$(LDFLAGS_SL)\""
  override CPPFLAGS += -DVAL_LIBS="\"$(LIBS)\""

! override CPPFLAGS := -DFRONTEND $(CPPFLAGS)
  LIBS += $(PTHREAD_LIBS)

  OBJS_COMMON = base64.o config_info.o controldata_utils.o exec.o file_perm.o \
!     ip.o keywords.o link-canary.o md5.o pg_lzcompress.o \
      pgfnames.o psprintf.o relpath.o \
      rmtree.o saslprep.o scram-common.o string.o unicode_norm.o \
      username.o wait_error.o
--- 41,51 ----
  override CPPFLAGS += -DVAL_LDFLAGS_SL="\"$(LDFLAGS_SL)\""
  override CPPFLAGS += -DVAL_LIBS="\"$(LIBS)\""

! override CPPFLAGS := -DFRONTEND -I. -I$(top_srcdir)/src/common $(CPPFLAGS)
  LIBS += $(PTHREAD_LIBS)

  OBJS_COMMON = base64.o config_info.o controldata_utils.o exec.o file_perm.o \
!     ip.o keywords.o kwlookup.o link-canary.o md5.o pg_lzcompress.o \
      pgfnames.o psprintf.o relpath.o \
      rmtree.o saslprep.o scram-common.o string.o unicode_norm.o \
      username.o wait_error.o
*************** OBJS_SRV = $(OBJS_COMMON:%.o=%_srv.o)
*** 65,70 ****
--- 65,72 ----

  all: libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a

+ distprep: kwlist_d.h
+
  # libpgcommon is needed by some contrib
  install: all installdirs
      $(INSTALL_STLIB) libpgcommon.a '$(DESTDIR)$(libdir)/libpgcommon.a'
*************** libpgcommon_srv.a: $(OBJS_SRV)
*** 115,130 ****
  %_srv.o: %.c %.o
      $(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@

! # Dependencies of keywords.o need to be managed explicitly to make sure
! # that you don't get broken parsing code, even in a non-enable-depend build.
! # Note that gram.h isn't required for the frontend versions of keywords.o.
! $(top_builddir)/src/include/parser/gram.h: $(top_srcdir)/src/backend/parser/gram.y
!     $(MAKE) -C $(top_builddir)/src/backend $(top_builddir)/src/include/parser/gram.h

! keywords.o: $(top_srcdir)/src/include/parser/kwlist.h
! keywords_shlib.o: $(top_srcdir)/src/include/parser/kwlist.h
! keywords_srv.o: $(top_builddir)/src/include/parser/gram.h $(top_srcdir)/src/include/parser/kwlist.h

! clean distclean maintainer-clean:
      rm -f libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a
      rm -f $(OBJS_FRONTEND) $(OBJS_SHLIB) $(OBJS_SRV)
--- 117,134 ----
  %_srv.o: %.c %.o
      $(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@

! # generate SQL keyword lookup table to be included into keywords*.o.
! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(top_srcdir)/src/tools/gen_keywordlist.pl
!     $(PERL) $(top_srcdir)/src/tools/gen_keywordlist.pl --extern $<

! # Dependencies of keywords*.o need to be managed explicitly to make sure
! # that you don't get broken parsing code, even in a non-enable-depend build.
! keywords.o keywords_shlib.o keywords_srv.o: kwlist_d.h

! # kwlist_d.h is in the distribution tarball, so it is not cleaned here.
! clean distclean:
      rm -f libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a
      rm -f $(OBJS_FRONTEND) $(OBJS_SHLIB) $(OBJS_SRV)
+
+ maintainer-clean: distclean
+     rm -f kwlist_d.h
diff --git a/src/common/keywords.c b/src/common/keywords.c
index 6f99090..103166c 100644
*** a/src/common/keywords.c
--- b/src/common/keywords.c
***************
*** 1,7 ****
  /*-------------------------------------------------------------------------
   *
   * keywords.c
!  *      lexical token lookup for key words in PostgreSQL
   *
   *
   * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
--- 1,7 ----
  /*-------------------------------------------------------------------------
   *
   * keywords.c
!  *      PostgreSQL's list of SQL keywords
   *
   *
   * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
***************
*** 19,114 ****
  #include "postgres_fe.h"
  #endif

! #ifndef FRONTEND
!
! #include "parser/gramparse.h"
!
! #define PG_KEYWORD(a,b,c) {a,b,c},

- #else

! #include "common/keywords.h"

! /*
!  * We don't need the token number for frontend uses, so leave it out to avoid
!  * requiring backend headers that won't compile cleanly here.
!  */
! #define PG_KEYWORD(a,b,c) {a,0,c},

! #endif                            /* FRONTEND */


! const ScanKeyword ScanKeywords[] = {
  #include "parser/kwlist.h"
  };

! const int    NumScanKeywords = lengthof(ScanKeywords);
!
!
! /*
!  * ScanKeywordLookup - see if a given word is a keyword
!  *
!  * The table to be searched is passed explicitly, so that this can be used
!  * to search keyword lists other than the standard list appearing above.
!  *
!  * Returns a pointer to the ScanKeyword table entry, or NULL if no match.
!  *
!  * The match is done case-insensitively.  Note that we deliberately use a
!  * dumbed-down case conversion that will only translate 'A'-'Z' into 'a'-'z',
!  * even if we are in a locale where tolower() would produce more or different
!  * translations.  This is to conform to the SQL99 spec, which says that
!  * keywords are to be matched in this way even though non-keyword identifiers
!  * receive a different case-normalization mapping.
!  */
! const ScanKeyword *
! ScanKeywordLookup(const char *text,
!                   const ScanKeyword *keywords,
!                   int num_keywords)
! {
!     int            len,
!                 i;
!     char        word[NAMEDATALEN];
!     const ScanKeyword *low;
!     const ScanKeyword *high;
!
!     len = strlen(text);
!     /* We assume all keywords are shorter than NAMEDATALEN. */
!     if (len >= NAMEDATALEN)
!         return NULL;
!
!     /*
!      * Apply an ASCII-only downcasing.  We must not use tolower() since it may
!      * produce the wrong translation in some locales (eg, Turkish).
!      */
!     for (i = 0; i < len; i++)
!     {
!         char        ch = text[i];
!
!         if (ch >= 'A' && ch <= 'Z')
!             ch += 'a' - 'A';
!         word[i] = ch;
!     }
!     word[len] = '\0';
!
!     /*
!      * Now do a binary search using plain strcmp() comparison.
!      */
!     low = keywords;
!     high = keywords + (num_keywords - 1);
!     while (low <= high)
!     {
!         const ScanKeyword *middle;
!         int            difference;
!
!         middle = low + (high - low) / 2;
!         difference = strcmp(middle->name, word);
!         if (difference == 0)
!             return middle;
!         else if (difference < 0)
!             low = middle + 1;
!         else
!             high = middle - 1;
!     }
!
!     return NULL;
! }
--- 19,37 ----
  #include "postgres_fe.h"
  #endif

! #include "common/keywords.h"


! /* ScanKeywordList lookup data for SQL keywords */

! #include "kwlist_d.h"

! /* Keyword categories for SQL keywords */

+ #define PG_KEYWORD(kwname, value, category) category,

! const uint8 ScanKeywordCategories[SCANKEYWORDS_NUM_KEYWORDS] = {
  #include "parser/kwlist.h"
  };

! #undef PG_KEYWORD
diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c
index ...db62623 .
*** a/src/common/kwlookup.c
--- b/src/common/kwlookup.c
***************
*** 0 ****
--- 1,91 ----
+ /*-------------------------------------------------------------------------
+  *
+  * kwlookup.c
+  *      Key word lookup for PostgreSQL
+  *
+  *
+  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *      src/common/kwlookup.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ #include "c.h"
+
+ #include "common/kwlookup.h"
+
+
+ /*
+  * ScanKeywordLookup - see if a given word is a keyword
+  *
+  * The list of keywords to be matched against is passed as a ScanKeywordList.
+  *
+  * Returns the keyword number (0..N-1) of the keyword, or -1 if no match.
+  * Callers typically use the keyword number to index into information
+  * arrays, but that is no concern of this code.
+  *
+  * The match is done case-insensitively.  Note that we deliberately use a
+  * dumbed-down case conversion that will only translate 'A'-'Z' into 'a'-'z',
+  * even if we are in a locale where tolower() would produce more or different
+  * translations.  This is to conform to the SQL99 spec, which says that
+  * keywords are to be matched in this way even though non-keyword identifiers
+  * receive a different case-normalization mapping.
+  */
+ int
+ ScanKeywordLookup(const char *text,
+                   const ScanKeywordList *keywords)
+ {
+     int            len,
+                 i;
+     char        word[NAMEDATALEN];
+     const char *kw_string;
+     const uint16 *kw_offsets;
+     const uint16 *low;
+     const uint16 *high;
+
+     len = strlen(text);
+     /* We assume all keywords are shorter than NAMEDATALEN. */
+     if (len >= NAMEDATALEN)
+         return -1;
+
+     /*
+      * Apply an ASCII-only downcasing.  We must not use tolower() since it may
+      * produce the wrong translation in some locales (eg, Turkish).
+      */
+     for (i = 0; i < len; i++)
+     {
+         char        ch = text[i];
+
+         if (ch >= 'A' && ch <= 'Z')
+             ch += 'a' - 'A';
+         word[i] = ch;
+     }
+     word[len] = '\0';
+
+     /*
+      * Now do a binary search using plain strcmp() comparison.
+      */
+     kw_string = keywords->kw_string;
+     kw_offsets = keywords->kw_offsets;
+     low = kw_offsets;
+     high = kw_offsets + (keywords->num_keywords - 1);
+     while (low <= high)
+     {
+         const uint16 *middle;
+         int            difference;
+
+         middle = low + (high - low) / 2;
+         difference = strcmp(kw_string + *middle, word);
+         if (difference == 0)
+             return middle - kw_offsets;
+         else if (difference < 0)
+             low = middle + 1;
+         else
+             high = middle - 1;
+     }
+
+     return -1;
+ }
diff --git a/src/fe_utils/string_utils.c b/src/fe_utils/string_utils.c
index 9b47b62..5c1732a 100644
*** a/src/fe_utils/string_utils.c
--- b/src/fe_utils/string_utils.c
*************** fmtId(const char *rawid)
*** 104,114 ****
           * Note: ScanKeywordLookup() does case-insensitive comparison, but
           * that's fine, since we already know we have all-lower-case.
           */
!         const ScanKeyword *keyword = ScanKeywordLookup(rawid,
!                                                        ScanKeywords,
!                                                        NumScanKeywords);

!         if (keyword != NULL && keyword->category != UNRESERVED_KEYWORD)
              need_quotes = true;
      }

--- 104,112 ----
           * Note: ScanKeywordLookup() does case-insensitive comparison, but
           * that's fine, since we already know we have all-lower-case.
           */
!         int            kwnum = ScanKeywordLookup(rawid, &ScanKeywords);

!         if (kwnum >= 0 && ScanKeywordCategories[kwnum] != UNRESERVED_KEYWORD)
              need_quotes = true;
      }

diff --git a/src/include/common/keywords.h b/src/include/common/keywords.h
index 8f22f32..fb18858 100644
*** a/src/include/common/keywords.h
--- b/src/include/common/keywords.h
***************
*** 1,7 ****
  /*-------------------------------------------------------------------------
   *
   * keywords.h
!  *      lexical token lookup for key words in PostgreSQL
   *
   *
   * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
--- 1,7 ----
  /*-------------------------------------------------------------------------
   *
   * keywords.h
!  *      PostgreSQL's list of SQL keywords
   *
   *
   * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
***************
*** 14,44 ****
  #ifndef KEYWORDS_H
  #define KEYWORDS_H

  /* Keyword categories --- should match lists in gram.y */
  #define UNRESERVED_KEYWORD        0
  #define COL_NAME_KEYWORD        1
  #define TYPE_FUNC_NAME_KEYWORD    2
  #define RESERVED_KEYWORD        3

-
- typedef struct ScanKeyword
- {
-     const char *name;            /* in lower case */
-     int16        value;            /* grammar's token code */
-     int16        category;        /* see codes above */
- } ScanKeyword;
-
  #ifndef FRONTEND
! extern PGDLLIMPORT const ScanKeyword ScanKeywords[];
! extern PGDLLIMPORT const int NumScanKeywords;
  #else
! extern const ScanKeyword ScanKeywords[];
! extern const int NumScanKeywords;
  #endif

-
- extern const ScanKeyword *ScanKeywordLookup(const char *text,
-                   const ScanKeyword *keywords,
-                   int num_keywords);
-
  #endif                            /* KEYWORDS_H */
--- 14,33 ----
  #ifndef KEYWORDS_H
  #define KEYWORDS_H

+ #include "common/kwlookup.h"
+
  /* Keyword categories --- should match lists in gram.y */
  #define UNRESERVED_KEYWORD        0
  #define COL_NAME_KEYWORD        1
  #define TYPE_FUNC_NAME_KEYWORD    2
  #define RESERVED_KEYWORD        3

  #ifndef FRONTEND
! extern PGDLLIMPORT const ScanKeywordList ScanKeywords;
! extern PGDLLIMPORT const uint8 ScanKeywordCategories[];
  #else
! extern const ScanKeywordList ScanKeywords;
! extern const uint8 ScanKeywordCategories[];
  #endif

  #endif                            /* KEYWORDS_H */
diff --git a/src/include/common/kwlookup.h b/src/include/common/kwlookup.h
index ...3098df3 .
*** a/src/include/common/kwlookup.h
--- b/src/include/common/kwlookup.h
***************
*** 0 ****
--- 1,39 ----
+ /*-------------------------------------------------------------------------
+  *
+  * kwlookup.h
+  *      Key word lookup for PostgreSQL
+  *
+  *
+  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/include/common/kwlookup.h
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef KWLOOKUP_H
+ #define KWLOOKUP_H
+
+ /*
+  * This struct contains the data needed by ScanKeywordLookup to perform a
+  * search within a set of keywords.  The contents are typically generated by
+  * src/tools/gen_keywordlist.pl from a header containing PG_KEYWORD macros.
+  */
+ typedef struct ScanKeywordList
+ {
+     const char *kw_string;        /* all keywords in order, separated by \0 */
+     const uint16 *kw_offsets;    /* offsets to the start of each keyword */
+     int            num_keywords;    /* number of keywords */
+ } ScanKeywordList;
+
+
+ extern int    ScanKeywordLookup(const char *text, const ScanKeywordList *keywords);
+
+ /* Code that wants to retrieve the text of the N'th keyword should use this. */
+ static inline const char *
+ GetScanKeyword(int n, const ScanKeywordList *keywords)
+ {
+     return keywords->kw_string + keywords->kw_offsets[n];
+ }
+
+ #endif                            /* KWLOOKUP_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 0256d53..b8902d3 100644
*** a/src/include/parser/kwlist.h
--- b/src/include/parser/kwlist.h
***************
*** 2,8 ****
   *
   * kwlist.h
   *
!  * The keyword list is kept in its own source file for possible use by
   * automatic tools.  The exact representation of a keyword is determined
   * by the PG_KEYWORD macro, which is not defined in this file; it can
   * be defined by the caller for special purposes.
--- 2,8 ----
   *
   * kwlist.h
   *
!  * The keyword lists are kept in their own source files for use by
   * automatic tools.  The exact representation of a keyword is determined
   * by the PG_KEYWORD macro, which is not defined in this file; it can
   * be defined by the caller for special purposes.
diff --git a/src/include/parser/scanner.h b/src/include/parser/scanner.h
index 009550f..91e1c83 100644
*** a/src/include/parser/scanner.h
--- b/src/include/parser/scanner.h
*************** typedef struct core_yy_extra_type
*** 73,82 ****
      Size        scanbuflen;

      /*
!      * The keyword list to use.
       */
!     const ScanKeyword *keywords;
!     int            num_keywords;

      /*
       * Scanner settings to use.  These are initialized from the corresponding
--- 73,82 ----
      Size        scanbuflen;

      /*
!      * The keyword list to use, and the associated grammar token codes.
       */
!     const ScanKeywordList *keywordlist;
!     const uint16 *keyword_tokens;

      /*
       * Scanner settings to use.  These are initialized from the corresponding
*************** typedef struct core_yy_extra_type
*** 116,126 ****
  typedef void *core_yyscan_t;


  /* Entry points in parser/scan.l */
  extern core_yyscan_t scanner_init(const char *str,
               core_yy_extra_type *yyext,
!              const ScanKeyword *keywords,
!              int num_keywords);
  extern void scanner_finish(core_yyscan_t yyscanner);
  extern int core_yylex(core_YYSTYPE *lvalp, YYLTYPE *llocp,
             core_yyscan_t yyscanner);
--- 116,129 ----
  typedef void *core_yyscan_t;


+ /* Constant data exported from parser/scan.l */
+ extern PGDLLIMPORT const uint16 ScanKeywordTokens[];
+
  /* Entry points in parser/scan.l */
  extern core_yyscan_t scanner_init(const char *str,
               core_yy_extra_type *yyext,
!              const ScanKeywordList *keywordlist,
!              const uint16 *keyword_tokens);
  extern void scanner_finish(core_yyscan_t yyscanner);
  extern int core_yylex(core_YYSTYPE *lvalp, YYLTYPE *llocp,
             core_yyscan_t yyscanner);
diff --git a/src/interfaces/ecpg/preproc/.gitignore b/src/interfaces/ecpg/preproc/.gitignore
index 38ae2fe..958a826 100644
*** a/src/interfaces/ecpg/preproc/.gitignore
--- b/src/interfaces/ecpg/preproc/.gitignore
***************
*** 2,6 ****
--- 2,8 ----
  /preproc.c
  /preproc.h
  /pgc.c
+ /c_kwlist_d.h
+ /ecpg_kwlist_d.h
  /typename.c
  /ecpg
diff --git a/src/interfaces/ecpg/preproc/Makefile b/src/interfaces/ecpg/preproc/Makefile
index 69ddd8e..9b145a1 100644
*** a/src/interfaces/ecpg/preproc/Makefile
--- b/src/interfaces/ecpg/preproc/Makefile
*************** OBJS=    preproc.o pgc.o type.o ecpg.o outp
*** 28,33 ****
--- 28,35 ----
      keywords.o c_keywords.o ecpg_keywords.o typename.o descriptor.o variable.o \
      $(WIN32RES)

+ GEN_KEYWORDLIST = $(top_srcdir)/src/tools/gen_keywordlist.pl
+
  # Suppress parallel build to avoid a bug in GNU make 3.82
  # (see comments in ../Makefile)
  ifeq ($(MAKE_VERSION),3.82)
*************** preproc.y: ../../../backend/parser/gram.
*** 53,61 ****
      $(PERL) $(srcdir)/parse.pl $(srcdir) < $< > $@
      $(PERL) $(srcdir)/check_rules.pl $(srcdir) $<

  ecpg_keywords.o c_keywords.o keywords.o preproc.o pgc.o parser.o: preproc.h

! distprep: preproc.y preproc.c preproc.h pgc.c

  install: all installdirs
      $(INSTALL_PROGRAM) ecpg$(X) '$(DESTDIR)$(bindir)'
--- 55,73 ----
      $(PERL) $(srcdir)/parse.pl $(srcdir) < $< > $@
      $(PERL) $(srcdir)/check_rules.pl $(srcdir) $<

+ # generate keyword headers
+ c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST)
+     $(PERL) $(GEN_KEYWORDLIST) --varname ScanCKeywords $<
+
+ ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST)
+     $(PERL) $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $<
+
+ # Force these dependencies to be known even without dependency info built:
  ecpg_keywords.o c_keywords.o keywords.o preproc.o pgc.o parser.o: preproc.h
+ ecpg_keywords.o: ecpg_kwlist_d.h
+ c_keywords.o: c_kwlist_d.h

! distprep: preproc.y preproc.c preproc.h pgc.c c_kwlist_d.h ecpg_kwlist_d.h

  install: all installdirs
      $(INSTALL_PROGRAM) ecpg$(X) '$(DESTDIR)$(bindir)'
*************** installdirs:
*** 66,77 ****
  uninstall:
      rm -f '$(DESTDIR)$(bindir)/ecpg$(X)'

  clean distclean:
      rm -f *.o ecpg$(X)
      rm -f typename.c

- # `make distclean' must not remove preproc.y, preproc.c, preproc.h, or pgc.c
- # since we want to ship those files in the distribution for people with
- # inadequate tools.  Instead, `make maintainer-clean' will remove them.
  maintainer-clean: distclean
!     rm -f preproc.y preproc.c preproc.h pgc.c
--- 78,88 ----
  uninstall:
      rm -f '$(DESTDIR)$(bindir)/ecpg$(X)'

+ # preproc.y, preproc.c, preproc.h, pgc.c, c_kwlist_d.h, and ecpg_kwlist_d.h
+ # are in the distribution tarball, so they are not cleaned here.
  clean distclean:
      rm -f *.o ecpg$(X)
      rm -f typename.c

  maintainer-clean: distclean
!     rm -f preproc.y preproc.c preproc.h pgc.c c_kwlist_d.h ecpg_kwlist_d.h
diff --git a/src/interfaces/ecpg/preproc/c_keywords.c b/src/interfaces/ecpg/preproc/c_keywords.c
index c367dbf..521992f 100644
*** a/src/interfaces/ecpg/preproc/c_keywords.c
--- b/src/interfaces/ecpg/preproc/c_keywords.c
***************
*** 14,85 ****
  #include "preproc_extern.h"
  #include "preproc.h"

! /*
!  * List of (keyword-name, keyword-token-value) pairs.
!  *
!  * !!WARNING!!: This list must be sorted, because binary
!  *         search is used to locate entries.
!  */
! static const ScanKeyword ScanCKeywords[] = {
!     /* name, value, category */

!     /*
!      * category is not needed in ecpg, it is only here so we can share the
!      * data structure with the backend
!      */
!     {"VARCHAR", VARCHAR, 0},
!     {"auto", S_AUTO, 0},
!     {"bool", SQL_BOOL, 0},
!     {"char", CHAR_P, 0},
!     {"const", S_CONST, 0},
!     {"enum", ENUM_P, 0},
!     {"extern", S_EXTERN, 0},
!     {"float", FLOAT_P, 0},
!     {"hour", HOUR_P, 0},
!     {"int", INT_P, 0},
!     {"long", SQL_LONG, 0},
!     {"minute", MINUTE_P, 0},
!     {"month", MONTH_P, 0},
!     {"register", S_REGISTER, 0},
!     {"second", SECOND_P, 0},
!     {"short", SQL_SHORT, 0},
!     {"signed", SQL_SIGNED, 0},
!     {"static", S_STATIC, 0},
!     {"struct", SQL_STRUCT, 0},
!     {"to", TO, 0},
!     {"typedef", S_TYPEDEF, 0},
!     {"union", UNION, 0},
!     {"unsigned", SQL_UNSIGNED, 0},
!     {"varchar", VARCHAR, 0},
!     {"volatile", S_VOLATILE, 0},
!     {"year", YEAR_P, 0},
  };


  /*
   * Do a binary search using plain strcmp() comparison.  This is much like
   * ScanKeywordLookup(), except we want case-sensitive matching.
   */
! const ScanKeyword *
  ScanCKeywordLookup(const char *text)
  {
!     const ScanKeyword *low = &ScanCKeywords[0];
!     const ScanKeyword *high = &ScanCKeywords[lengthof(ScanCKeywords) - 1];

      while (low <= high)
      {
!         const ScanKeyword *middle;
          int            difference;

          middle = low + (high - low) / 2;
!         difference = strcmp(middle->name, text);
          if (difference == 0)
!             return middle;
          else if (difference < 0)
              low = middle + 1;
          else
              high = middle - 1;
      }

!     return NULL;
  }
--- 14,67 ----
  #include "preproc_extern.h"
  #include "preproc.h"

! /* ScanKeywordList lookup data for C keywords */
! #include "c_kwlist_d.h"

! /* Token codes for C keywords */
! #define PG_KEYWORD(kwname, value) value,
!
! static const uint16 ScanCKeywordTokens[] = {
! #include "c_kwlist.h"
  };

+ #undef PG_KEYWORD
+

  /*
+  * ScanCKeywordLookup - see if a given word is a keyword
+  *
+  * Returns the token value of the keyword, or -1 if no match.
+  *
   * Do a binary search using plain strcmp() comparison.  This is much like
   * ScanKeywordLookup(), except we want case-sensitive matching.
   */
! int
  ScanCKeywordLookup(const char *text)
  {
!     const char *kw_string;
!     const uint16 *kw_offsets;
!     const uint16 *low;
!     const uint16 *high;
!
!     kw_string = ScanCKeywords.kw_string;
!     kw_offsets = ScanCKeywords.kw_offsets;
!     low = kw_offsets;
!     high = kw_offsets + (ScanCKeywords.num_keywords - 1);

      while (low <= high)
      {
!         const uint16 *middle;
          int            difference;

          middle = low + (high - low) / 2;
!         difference = strcmp(kw_string + *middle, text);
          if (difference == 0)
!             return ScanCKeywordTokens[middle - kw_offsets];
          else if (difference < 0)
              low = middle + 1;
          else
              high = middle - 1;
      }

!     return -1;
  }
diff --git a/src/interfaces/ecpg/preproc/c_kwlist.h b/src/interfaces/ecpg/preproc/c_kwlist.h
index ...4545505 .
*** a/src/interfaces/ecpg/preproc/c_kwlist.h
--- b/src/interfaces/ecpg/preproc/c_kwlist.h
***************
*** 0 ****
--- 1,53 ----
+ /*-------------------------------------------------------------------------
+  *
+  * c_kwlist.h
+  *
+  * The keyword lists are kept in their own source files for use by
+  * automatic tools.  The exact representation of a keyword is determined
+  * by the PG_KEYWORD macro, which is not defined in this file; it can
+  * be defined by the caller for special purposes.
+  *
+  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/interfaces/ecpg/preproc/c_kwlist.h
+  *
+  *-------------------------------------------------------------------------
+  */
+
+ /* There is deliberately not an #ifndef C_KWLIST_H here. */
+
+ /*
+  * List of (keyword-name, keyword-token-value) pairs.
+  *
+  * !!WARNING!!: This list must be sorted by ASCII name, because binary
+  *         search is used to locate entries.
+  */
+
+ /* name, value */
+ PG_KEYWORD("VARCHAR", VARCHAR)
+ PG_KEYWORD("auto", S_AUTO)
+ PG_KEYWORD("bool", SQL_BOOL)
+ PG_KEYWORD("char", CHAR_P)
+ PG_KEYWORD("const", S_CONST)
+ PG_KEYWORD("enum", ENUM_P)
+ PG_KEYWORD("extern", S_EXTERN)
+ PG_KEYWORD("float", FLOAT_P)
+ PG_KEYWORD("hour", HOUR_P)
+ PG_KEYWORD("int", INT_P)
+ PG_KEYWORD("long", SQL_LONG)
+ PG_KEYWORD("minute", MINUTE_P)
+ PG_KEYWORD("month", MONTH_P)
+ PG_KEYWORD("register", S_REGISTER)
+ PG_KEYWORD("second", SECOND_P)
+ PG_KEYWORD("short", SQL_SHORT)
+ PG_KEYWORD("signed", SQL_SIGNED)
+ PG_KEYWORD("static", S_STATIC)
+ PG_KEYWORD("struct", SQL_STRUCT)
+ PG_KEYWORD("to", TO)
+ PG_KEYWORD("typedef", S_TYPEDEF)
+ PG_KEYWORD("union", UNION)
+ PG_KEYWORD("unsigned", SQL_UNSIGNED)
+ PG_KEYWORD("varchar", VARCHAR)
+ PG_KEYWORD("volatile", S_VOLATILE)
+ PG_KEYWORD("year", YEAR_P)
diff --git a/src/interfaces/ecpg/preproc/ecpg_keywords.c b/src/interfaces/ecpg/preproc/ecpg_keywords.c
index 37c97e1..4839c37 100644
*** a/src/interfaces/ecpg/preproc/ecpg_keywords.c
--- b/src/interfaces/ecpg/preproc/ecpg_keywords.c
***************
*** 16,97 ****
  #include "preproc_extern.h"
  #include "preproc.h"

! /*
!  * List of (keyword-name, keyword-token-value) pairs.
!  *
!  * !!WARNING!!: This list must be sorted, because binary
!  *         search is used to locate entries.
!  */
! static const ScanKeyword ECPGScanKeywords[] = {
!     /* name, value, category */

!     /*
!      * category is not needed in ecpg, it is only here so we can share the
!      * data structure with the backend
!      */
!     {"allocate", SQL_ALLOCATE, 0},
!     {"autocommit", SQL_AUTOCOMMIT, 0},
!     {"bool", SQL_BOOL, 0},
!     {"break", SQL_BREAK, 0},
!     {"cardinality", SQL_CARDINALITY, 0},
!     {"connect", SQL_CONNECT, 0},
!     {"count", SQL_COUNT, 0},
!     {"datetime_interval_code", SQL_DATETIME_INTERVAL_CODE, 0},
!     {"datetime_interval_precision", SQL_DATETIME_INTERVAL_PRECISION, 0},
!     {"describe", SQL_DESCRIBE, 0},
!     {"descriptor", SQL_DESCRIPTOR, 0},
!     {"disconnect", SQL_DISCONNECT, 0},
!     {"found", SQL_FOUND, 0},
!     {"free", SQL_FREE, 0},
!     {"get", SQL_GET, 0},
!     {"go", SQL_GO, 0},
!     {"goto", SQL_GOTO, 0},
!     {"identified", SQL_IDENTIFIED, 0},
!     {"indicator", SQL_INDICATOR, 0},
!     {"key_member", SQL_KEY_MEMBER, 0},
!     {"length", SQL_LENGTH, 0},
!     {"long", SQL_LONG, 0},
!     {"nullable", SQL_NULLABLE, 0},
!     {"octet_length", SQL_OCTET_LENGTH, 0},
!     {"open", SQL_OPEN, 0},
!     {"output", SQL_OUTPUT, 0},
!     {"reference", SQL_REFERENCE, 0},
!     {"returned_length", SQL_RETURNED_LENGTH, 0},
!     {"returned_octet_length", SQL_RETURNED_OCTET_LENGTH, 0},
!     {"scale", SQL_SCALE, 0},
!     {"section", SQL_SECTION, 0},
!     {"short", SQL_SHORT, 0},
!     {"signed", SQL_SIGNED, 0},
!     {"sqlerror", SQL_SQLERROR, 0},
!     {"sqlprint", SQL_SQLPRINT, 0},
!     {"sqlwarning", SQL_SQLWARNING, 0},
!     {"stop", SQL_STOP, 0},
!     {"struct", SQL_STRUCT, 0},
!     {"unsigned", SQL_UNSIGNED, 0},
!     {"var", SQL_VAR, 0},
!     {"whenever", SQL_WHENEVER, 0},
  };

  /*
   * ScanECPGKeywordLookup - see if a given word is a keyword
   *
!  * Returns a pointer to the ScanKeyword table entry, or NULL if no match.
   * Keywords are matched using the same case-folding rules as in the backend.
   */
! const ScanKeyword *
  ScanECPGKeywordLookup(const char *text)
  {
!     const ScanKeyword *res;

      /* First check SQL symbols defined by the backend. */
!     res = ScanKeywordLookup(text, SQLScanKeywords, NumSQLScanKeywords);
!     if (res)
!         return res;

      /* Try ECPG-specific keywords. */
!     res = ScanKeywordLookup(text, ECPGScanKeywords, lengthof(ECPGScanKeywords));
!     if (res)
!         return res;

!     return NULL;
  }
--- 16,55 ----
  #include "preproc_extern.h"
  #include "preproc.h"

! /* ScanKeywordList lookup data for ECPG keywords */
! #include "ecpg_kwlist_d.h"

! /* Token codes for ECPG keywords */
! #define PG_KEYWORD(kwname, value) value,
!
! static const uint16 ECPGScanKeywordTokens[] = {
! #include "ecpg_kwlist.h"
  };

+ #undef PG_KEYWORD
+
+
  /*
   * ScanECPGKeywordLookup - see if a given word is a keyword
   *
!  * Returns the token value of the keyword, or -1 if no match.
!  *
   * Keywords are matched using the same case-folding rules as in the backend.
   */
! int
  ScanECPGKeywordLookup(const char *text)
  {
!     int            kwnum;

      /* First check SQL symbols defined by the backend. */
!     kwnum = ScanKeywordLookup(text, &ScanKeywords);
!     if (kwnum >= 0)
!         return SQLScanKeywordTokens[kwnum];

      /* Try ECPG-specific keywords. */
!     kwnum = ScanKeywordLookup(text, &ScanECPGKeywords);
!     if (kwnum >= 0)
!         return ECPGScanKeywordTokens[kwnum];

!     return -1;
  }
diff --git a/src/interfaces/ecpg/preproc/ecpg_kwlist.h b/src/interfaces/ecpg/preproc/ecpg_kwlist.h
index ...97ef254 .
*** a/src/interfaces/ecpg/preproc/ecpg_kwlist.h
--- b/src/interfaces/ecpg/preproc/ecpg_kwlist.h
***************
*** 0 ****
--- 1,68 ----
+ /*-------------------------------------------------------------------------
+  *
+  * ecpg_kwlist.h
+  *
+  * The keyword lists are kept in their own source files for use by
+  * automatic tools.  The exact representation of a keyword is determined
+  * by the PG_KEYWORD macro, which is not defined in this file; it can
+  * be defined by the caller for special purposes.
+  *
+  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/interfaces/ecpg/preproc/ecpg_kwlist.h
+  *
+  *-------------------------------------------------------------------------
+  */
+
+ /* There is deliberately not an #ifndef ECPG_KWLIST_H here. */
+
+ /*
+  * List of (keyword-name, keyword-token-value) pairs.
+  *
+  * !!WARNING!!: This list must be sorted by ASCII name, because binary
+  *         search is used to locate entries.
+  */
+
+ /* name, value */
+ PG_KEYWORD("allocate", SQL_ALLOCATE)
+ PG_KEYWORD("autocommit", SQL_AUTOCOMMIT)
+ PG_KEYWORD("bool", SQL_BOOL)
+ PG_KEYWORD("break", SQL_BREAK)
+ PG_KEYWORD("cardinality", SQL_CARDINALITY)
+ PG_KEYWORD("connect", SQL_CONNECT)
+ PG_KEYWORD("count", SQL_COUNT)
+ PG_KEYWORD("datetime_interval_code", SQL_DATETIME_INTERVAL_CODE)
+ PG_KEYWORD("datetime_interval_precision", SQL_DATETIME_INTERVAL_PRECISION)
+ PG_KEYWORD("describe", SQL_DESCRIBE)
+ PG_KEYWORD("descriptor", SQL_DESCRIPTOR)
+ PG_KEYWORD("disconnect", SQL_DISCONNECT)
+ PG_KEYWORD("found", SQL_FOUND)
+ PG_KEYWORD("free", SQL_FREE)
+ PG_KEYWORD("get", SQL_GET)
+ PG_KEYWORD("go", SQL_GO)
+ PG_KEYWORD("goto", SQL_GOTO)
+ PG_KEYWORD("identified", SQL_IDENTIFIED)
+ PG_KEYWORD("indicator", SQL_INDICATOR)
+ PG_KEYWORD("key_member", SQL_KEY_MEMBER)
+ PG_KEYWORD("length", SQL_LENGTH)
+ PG_KEYWORD("long", SQL_LONG)
+ PG_KEYWORD("nullable", SQL_NULLABLE)
+ PG_KEYWORD("octet_length", SQL_OCTET_LENGTH)
+ PG_KEYWORD("open", SQL_OPEN)
+ PG_KEYWORD("output", SQL_OUTPUT)
+ PG_KEYWORD("reference", SQL_REFERENCE)
+ PG_KEYWORD("returned_length", SQL_RETURNED_LENGTH)
+ PG_KEYWORD("returned_octet_length", SQL_RETURNED_OCTET_LENGTH)
+ PG_KEYWORD("scale", SQL_SCALE)
+ PG_KEYWORD("section", SQL_SECTION)
+ PG_KEYWORD("short", SQL_SHORT)
+ PG_KEYWORD("signed", SQL_SIGNED)
+ PG_KEYWORD("sqlerror", SQL_SQLERROR)
+ PG_KEYWORD("sqlprint", SQL_SQLPRINT)
+ PG_KEYWORD("sqlwarning", SQL_SQLWARNING)
+ PG_KEYWORD("stop", SQL_STOP)
+ PG_KEYWORD("struct", SQL_STRUCT)
+ PG_KEYWORD("unsigned", SQL_UNSIGNED)
+ PG_KEYWORD("var", SQL_VAR)
+ PG_KEYWORD("whenever", SQL_WHENEVER)
diff --git a/src/interfaces/ecpg/preproc/keywords.c b/src/interfaces/ecpg/preproc/keywords.c
index 12409e9..0380409 100644
*** a/src/interfaces/ecpg/preproc/keywords.c
--- b/src/interfaces/ecpg/preproc/keywords.c
***************
*** 17,40 ****

  /*
   * This is much trickier than it looks.  We are #include'ing kwlist.h
!  * but the "value" numbers that go into the table are from preproc.h
!  * not the backend's gram.h.  Therefore this table will recognize all
!  * keywords known to the backend, but will supply the token numbers used
   * by ecpg's grammar, which is what we need.  The ecpg grammar must
   * define all the same token names the backend does, else we'll get
   * undefined-symbol failures in this compile.
   */

- #include "common/keywords.h"
-
  #include "preproc_extern.h"
  #include "preproc.h"


! #define PG_KEYWORD(a,b,c) {a,b,c},
!
! const ScanKeyword SQLScanKeywords[] = {
  #include "parser/kwlist.h"
  };

! const int    NumSQLScanKeywords = lengthof(SQLScanKeywords);
--- 17,38 ----

  /*
   * This is much trickier than it looks.  We are #include'ing kwlist.h
!  * but the token numbers that go into the table are from preproc.h
!  * not the backend's gram.h.  Therefore this token table will match
!  * the ScanKeywords table supplied from common/keywords.c, including all
!  * keywords known to the backend, but it will supply the token numbers used
   * by ecpg's grammar, which is what we need.  The ecpg grammar must
   * define all the same token names the backend does, else we'll get
   * undefined-symbol failures in this compile.
   */

  #include "preproc_extern.h"
  #include "preproc.h"

+ #define PG_KEYWORD(kwname, value, category) value,

! const uint16 SQLScanKeywordTokens[] = {
  #include "parser/kwlist.h"
  };

! #undef PG_KEYWORD
diff --git a/src/interfaces/ecpg/preproc/pgc.l b/src/interfaces/ecpg/preproc/pgc.l
index a60564c..3131f5f 100644
*** a/src/interfaces/ecpg/preproc/pgc.l
--- b/src/interfaces/ecpg/preproc/pgc.l
*************** cppline            {space}*#([^i][A-Za-z]*|{if}|{
*** 920,938 ****
                  }

  {identifier}    {
-                     const ScanKeyword  *keyword;
-
                      if (!isdefine())
                      {
                          /* Is it an SQL/ECPG keyword? */
!                         keyword = ScanECPGKeywordLookup(yytext);
!                         if (keyword != NULL)
!                             return keyword->value;

                          /* Is it a C keyword? */
!                         keyword = ScanCKeywordLookup(yytext);
!                         if (keyword != NULL)
!                             return keyword->value;

                          /*
                           * None of the above.  Return it as an identifier.
--- 920,938 ----
                  }

  {identifier}    {
                      if (!isdefine())
                      {
+                         int        kwvalue;
+
                          /* Is it an SQL/ECPG keyword? */
!                         kwvalue = ScanECPGKeywordLookup(yytext);
!                         if (kwvalue >= 0)
!                             return kwvalue;

                          /* Is it a C keyword? */
!                         kwvalue = ScanCKeywordLookup(yytext);
!                         if (kwvalue >= 0)
!                             return kwvalue;

                          /*
                           * None of the above.  Return it as an identifier.
*************** cppline            {space}*#([^i][A-Za-z]*|{if}|{
*** 1010,1021 ****
                          return CPP_LINE;
                      }
  <C>{identifier}        {
-                         const ScanKeyword        *keyword;
-
                          /*
                           * Try to detect a function name:
                           * look for identifiers at the global scope
!                          * keep the last identifier before the first '(' and '{' */
                          if (braces_open == 0 && parenths_open == 0)
                          {
                              if (current_function)
--- 1010,1020 ----
                          return CPP_LINE;
                      }
  <C>{identifier}        {
                          /*
                           * Try to detect a function name:
                           * look for identifiers at the global scope
!                          * keep the last identifier before the first '(' and '{'
!                          */
                          if (braces_open == 0 && parenths_open == 0)
                          {
                              if (current_function)
*************** cppline            {space}*#([^i][A-Za-z]*|{if}|{
*** 1026,1034 ****
                          /* however, some defines have to be taken care of for compatibility */
                          if ((!INFORMIX_MODE || !isinformixdefine()) && !isdefine())
                          {
!                             keyword = ScanCKeywordLookup(yytext);
!                             if (keyword != NULL)
!                                 return keyword->value;
                              else
                              {
                                  base_yylval.str = mm_strdup(yytext);
--- 1025,1035 ----
                          /* however, some defines have to be taken care of for compatibility */
                          if ((!INFORMIX_MODE || !isinformixdefine()) && !isdefine())
                          {
!                             int        kwvalue;
!
!                             kwvalue = ScanCKeywordLookup(yytext);
!                             if (kwvalue >= 0)
!                                 return kwvalue;
                              else
                              {
                                  base_yylval.str = mm_strdup(yytext);
diff --git a/src/interfaces/ecpg/preproc/preproc_extern.h b/src/interfaces/ecpg/preproc/preproc_extern.h
index 13eda67..9746780 100644
*** a/src/interfaces/ecpg/preproc/preproc_extern.h
--- b/src/interfaces/ecpg/preproc/preproc_extern.h
*************** extern struct when when_error,
*** 59,66 ****
  extern struct ECPGstruct_member *struct_member_list[STRUCT_DEPTH];

  /* Globals from keywords.c */
! extern const ScanKeyword SQLScanKeywords[];
! extern const int NumSQLScanKeywords;

  /* functions */

--- 59,65 ----
  extern struct ECPGstruct_member *struct_member_list[STRUCT_DEPTH];

  /* Globals from keywords.c */
! extern const uint16 SQLScanKeywordTokens[];

  /* functions */

*************** extern void check_indicator(struct ECPGt
*** 102,109 ****
  extern void remove_typedefs(int);
  extern void remove_variables(int);
  extern struct variable *new_variable(const char *, struct ECPGtype *, int);
! extern const ScanKeyword *ScanCKeywordLookup(const char *);
! extern const ScanKeyword *ScanECPGKeywordLookup(const char *text);
  extern void parser_init(void);
  extern int    filtered_base_yylex(void);

--- 101,108 ----
  extern void remove_typedefs(int);
  extern void remove_variables(int);
  extern struct variable *new_variable(const char *, struct ECPGtype *, int);
! extern int    ScanCKeywordLookup(const char *text);
! extern int    ScanECPGKeywordLookup(const char *text);
  extern void parser_init(void);
  extern int    filtered_base_yylex(void);

diff --git a/src/pl/plpgsql/src/.gitignore b/src/pl/plpgsql/src/.gitignore
index ff6ac96..3ab9a22 100644
*** a/src/pl/plpgsql/src/.gitignore
--- b/src/pl/plpgsql/src/.gitignore
***************
*** 1,5 ****
--- 1,7 ----
  /pl_gram.c
  /pl_gram.h
+ /pl_reserved_kwlist_d.h
+ /pl_unreserved_kwlist_d.h
  /plerrcodes.h
  /log/
  /results/
diff --git a/src/pl/plpgsql/src/Makefile b/src/pl/plpgsql/src/Makefile
index 25a5a9d..9dd4a74 100644
*** a/src/pl/plpgsql/src/Makefile
--- b/src/pl/plpgsql/src/Makefile
*************** REGRESS_OPTS = --dbname=$(PL_TESTDB)
*** 29,34 ****
--- 29,36 ----
  REGRESS = plpgsql_call plpgsql_control plpgsql_domain plpgsql_record \
      plpgsql_cache plpgsql_transaction plpgsql_varprops

+ GEN_KEYWORDLIST = $(top_srcdir)/src/tools/gen_keywordlist.pl
+
  all: all-lib

  # Shared library stuff
*************** uninstall-headers:
*** 61,66 ****
--- 63,69 ----

  # Force these dependencies to be known even without dependency info built:
  pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o: plpgsql.h pl_gram.h plerrcodes.h
+ pl_scanner.o: pl_reserved_kwlist_d.h pl_unreserved_kwlist_d.h

  # See notes in src/backend/parser/Makefile about the following two rules
  pl_gram.h: pl_gram.c
*************** pl_gram.c: BISONFLAGS += -d
*** 72,77 ****
--- 75,87 ----
  plerrcodes.h: $(top_srcdir)/src/backend/utils/errcodes.txt generate-plerrcodes.pl
      $(PERL) $(srcdir)/generate-plerrcodes.pl $< > $@

+ # generate keyword headers for the scanner
+ pl_reserved_kwlist_d.h: pl_reserved_kwlist.h $(GEN_KEYWORDLIST)
+     $(PERL) $(GEN_KEYWORDLIST) --varname ReservedPLKeywords $<
+
+ pl_unreserved_kwlist_d.h: pl_unreserved_kwlist.h $(GEN_KEYWORDLIST)
+     $(PERL) $(GEN_KEYWORDLIST) --varname UnreservedPLKeywords $<
+

  check: submake
      $(pg_regress_check) $(REGRESS_OPTS) $(REGRESS)
*************** submake:
*** 84,96 ****
      $(MAKE) -C $(top_builddir)/src/test/regress pg_regress$(X)


! distprep: pl_gram.h pl_gram.c plerrcodes.h

! # pl_gram.c, pl_gram.h and plerrcodes.h are in the distribution tarball,
! # so they are not cleaned here.
  clean distclean: clean-lib
      rm -f $(OBJS)
      rm -rf $(pg_regress_clean_files)

  maintainer-clean: distclean
!     rm -f pl_gram.c pl_gram.h plerrcodes.h
--- 94,107 ----
      $(MAKE) -C $(top_builddir)/src/test/regress pg_regress$(X)


! distprep: pl_gram.h pl_gram.c plerrcodes.h pl_reserved_kwlist_d.h pl_unreserved_kwlist_d.h

! # pl_gram.c, pl_gram.h, plerrcodes.h, pl_reserved_kwlist_d.h, and
! # pl_unreserved_kwlist_d.h are in the distribution tarball, so they
! # are not cleaned here.
  clean distclean: clean-lib
      rm -f $(OBJS)
      rm -rf $(pg_regress_clean_files)

  maintainer-clean: distclean
!     rm -f pl_gram.c pl_gram.h plerrcodes.h pl_reserved_kwlist_d.h pl_unreserved_kwlist_d.h
diff --git a/src/pl/plpgsql/src/pl_reserved_kwlist.h b/src/pl/plpgsql/src/pl_reserved_kwlist.h
index ...5c2e0c1 .
*** a/src/pl/plpgsql/src/pl_reserved_kwlist.h
--- b/src/pl/plpgsql/src/pl_reserved_kwlist.h
***************
*** 0 ****
--- 1,53 ----
+ /*-------------------------------------------------------------------------
+  *
+  * pl_reserved_kwlist.h
+  *
+  * The keyword lists are kept in their own source files for use by
+  * automatic tools.  The exact representation of a keyword is determined
+  * by the PG_KEYWORD macro, which is not defined in this file; it can
+  * be defined by the caller for special purposes.
+  *
+  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/pl/plpgsql/src/pl_reserved_kwlist.h
+  *
+  *-------------------------------------------------------------------------
+  */
+
+ /* There is deliberately not an #ifndef PL_RESERVED_KWLIST_H here. */
+
+ /*
+  * List of (keyword-name, keyword-token-value) pairs.
+  *
+  * Be careful not to put the same word in both lists.
+  *
+  * !!WARNING!!: This list must be sorted by ASCII name, because binary
+  *         search is used to locate entries.
+  */
+
+ /* name, value */
+ PG_KEYWORD("all", K_ALL)
+ PG_KEYWORD("begin", K_BEGIN)
+ PG_KEYWORD("by", K_BY)
+ PG_KEYWORD("case", K_CASE)
+ PG_KEYWORD("declare", K_DECLARE)
+ PG_KEYWORD("else", K_ELSE)
+ PG_KEYWORD("end", K_END)
+ PG_KEYWORD("execute", K_EXECUTE)
+ PG_KEYWORD("for", K_FOR)
+ PG_KEYWORD("foreach", K_FOREACH)
+ PG_KEYWORD("from", K_FROM)
+ PG_KEYWORD("if", K_IF)
+ PG_KEYWORD("in", K_IN)
+ PG_KEYWORD("into", K_INTO)
+ PG_KEYWORD("loop", K_LOOP)
+ PG_KEYWORD("not", K_NOT)
+ PG_KEYWORD("null", K_NULL)
+ PG_KEYWORD("or", K_OR)
+ PG_KEYWORD("strict", K_STRICT)
+ PG_KEYWORD("then", K_THEN)
+ PG_KEYWORD("to", K_TO)
+ PG_KEYWORD("using", K_USING)
+ PG_KEYWORD("when", K_WHEN)
+ PG_KEYWORD("while", K_WHILE)
diff --git a/src/pl/plpgsql/src/pl_scanner.c b/src/pl/plpgsql/src/pl_scanner.c
index 8340628..c260438 100644
*** a/src/pl/plpgsql/src/pl_scanner.c
--- b/src/pl/plpgsql/src/pl_scanner.c
***************
*** 22,37 ****
  #include "pl_gram.h"            /* must be after parser/scanner.h */


- #define PG_KEYWORD(a,b,c) {a,b,c},
-
-
  /* Klugy flag to tell scanner how to look up identifiers */
  IdentifierLookup plpgsql_IdentifierLookup = IDENTIFIER_LOOKUP_NORMAL;

  /*
   * A word about keywords:
   *
!  * We keep reserved and unreserved keywords in separate arrays.  The
   * reserved keywords are passed to the core scanner, so they will be
   * recognized before (and instead of) any variable name.  Unreserved words
   * are checked for separately, usually after determining that the identifier
--- 22,36 ----
  #include "pl_gram.h"            /* must be after parser/scanner.h */


  /* Klugy flag to tell scanner how to look up identifiers */
  IdentifierLookup plpgsql_IdentifierLookup = IDENTIFIER_LOOKUP_NORMAL;

  /*
   * A word about keywords:
   *
!  * We keep reserved and unreserved keywords in separate headers.  Be careful
!  * not to put the same word in both headers.  Also be sure that pl_gram.y's
!  * unreserved_keyword production agrees with the unreserved header.  The
   * reserved keywords are passed to the core scanner, so they will be
   * recognized before (and instead of) any variable name.  Unreserved words
   * are checked for separately, usually after determining that the identifier
*************** IdentifierLookup plpgsql_IdentifierLooku
*** 57,186 ****
   * BEGIN BY DECLARE EXECUTE FOREACH IF LOOP STRICT WHILE
   */

! /*
!  * Lists of keyword (name, token-value, category) entries.
!  *
!  * !!WARNING!!: These lists must be sorted by ASCII name, because binary
!  *         search is used to locate entries.
!  *
!  * Be careful not to put the same word in both lists.  Also be sure that
!  * pl_gram.y's unreserved_keyword production agrees with the second list.
!  */

! static const ScanKeyword reserved_keywords[] = {
!     PG_KEYWORD("all", K_ALL, RESERVED_KEYWORD)
!     PG_KEYWORD("begin", K_BEGIN, RESERVED_KEYWORD)
!     PG_KEYWORD("by", K_BY, RESERVED_KEYWORD)
!     PG_KEYWORD("case", K_CASE, RESERVED_KEYWORD)
!     PG_KEYWORD("declare", K_DECLARE, RESERVED_KEYWORD)
!     PG_KEYWORD("else", K_ELSE, RESERVED_KEYWORD)
!     PG_KEYWORD("end", K_END, RESERVED_KEYWORD)
!     PG_KEYWORD("execute", K_EXECUTE, RESERVED_KEYWORD)
!     PG_KEYWORD("for", K_FOR, RESERVED_KEYWORD)
!     PG_KEYWORD("foreach", K_FOREACH, RESERVED_KEYWORD)
!     PG_KEYWORD("from", K_FROM, RESERVED_KEYWORD)
!     PG_KEYWORD("if", K_IF, RESERVED_KEYWORD)
!     PG_KEYWORD("in", K_IN, RESERVED_KEYWORD)
!     PG_KEYWORD("into", K_INTO, RESERVED_KEYWORD)
!     PG_KEYWORD("loop", K_LOOP, RESERVED_KEYWORD)
!     PG_KEYWORD("not", K_NOT, RESERVED_KEYWORD)
!     PG_KEYWORD("null", K_NULL, RESERVED_KEYWORD)
!     PG_KEYWORD("or", K_OR, RESERVED_KEYWORD)
!     PG_KEYWORD("strict", K_STRICT, RESERVED_KEYWORD)
!     PG_KEYWORD("then", K_THEN, RESERVED_KEYWORD)
!     PG_KEYWORD("to", K_TO, RESERVED_KEYWORD)
!     PG_KEYWORD("using", K_USING, RESERVED_KEYWORD)
!     PG_KEYWORD("when", K_WHEN, RESERVED_KEYWORD)
!     PG_KEYWORD("while", K_WHILE, RESERVED_KEYWORD)
! };

! static const int num_reserved_keywords = lengthof(reserved_keywords);

! static const ScanKeyword unreserved_keywords[] = {
!     PG_KEYWORD("absolute", K_ABSOLUTE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("alias", K_ALIAS, UNRESERVED_KEYWORD)
!     PG_KEYWORD("array", K_ARRAY, UNRESERVED_KEYWORD)
!     PG_KEYWORD("assert", K_ASSERT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("backward", K_BACKWARD, UNRESERVED_KEYWORD)
!     PG_KEYWORD("call", K_CALL, UNRESERVED_KEYWORD)
!     PG_KEYWORD("close", K_CLOSE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("collate", K_COLLATE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("column", K_COLUMN, UNRESERVED_KEYWORD)
!     PG_KEYWORD("column_name", K_COLUMN_NAME, UNRESERVED_KEYWORD)
!     PG_KEYWORD("commit", K_COMMIT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("constant", K_CONSTANT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("constraint", K_CONSTRAINT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("constraint_name", K_CONSTRAINT_NAME, UNRESERVED_KEYWORD)
!     PG_KEYWORD("continue", K_CONTINUE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("current", K_CURRENT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("cursor", K_CURSOR, UNRESERVED_KEYWORD)
!     PG_KEYWORD("datatype", K_DATATYPE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("debug", K_DEBUG, UNRESERVED_KEYWORD)
!     PG_KEYWORD("default", K_DEFAULT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("detail", K_DETAIL, UNRESERVED_KEYWORD)
!     PG_KEYWORD("diagnostics", K_DIAGNOSTICS, UNRESERVED_KEYWORD)
!     PG_KEYWORD("do", K_DO, UNRESERVED_KEYWORD)
!     PG_KEYWORD("dump", K_DUMP, UNRESERVED_KEYWORD)
!     PG_KEYWORD("elseif", K_ELSIF, UNRESERVED_KEYWORD)
!     PG_KEYWORD("elsif", K_ELSIF, UNRESERVED_KEYWORD)
!     PG_KEYWORD("errcode", K_ERRCODE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("error", K_ERROR, UNRESERVED_KEYWORD)
!     PG_KEYWORD("exception", K_EXCEPTION, UNRESERVED_KEYWORD)
!     PG_KEYWORD("exit", K_EXIT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("fetch", K_FETCH, UNRESERVED_KEYWORD)
!     PG_KEYWORD("first", K_FIRST, UNRESERVED_KEYWORD)
!     PG_KEYWORD("forward", K_FORWARD, UNRESERVED_KEYWORD)
!     PG_KEYWORD("get", K_GET, UNRESERVED_KEYWORD)
!     PG_KEYWORD("hint", K_HINT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("import", K_IMPORT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("info", K_INFO, UNRESERVED_KEYWORD)
!     PG_KEYWORD("insert", K_INSERT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("is", K_IS, UNRESERVED_KEYWORD)
!     PG_KEYWORD("last", K_LAST, UNRESERVED_KEYWORD)
!     PG_KEYWORD("log", K_LOG, UNRESERVED_KEYWORD)
!     PG_KEYWORD("message", K_MESSAGE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("message_text", K_MESSAGE_TEXT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("move", K_MOVE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("next", K_NEXT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("no", K_NO, UNRESERVED_KEYWORD)
!     PG_KEYWORD("notice", K_NOTICE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("open", K_OPEN, UNRESERVED_KEYWORD)
!     PG_KEYWORD("option", K_OPTION, UNRESERVED_KEYWORD)
!     PG_KEYWORD("perform", K_PERFORM, UNRESERVED_KEYWORD)
!     PG_KEYWORD("pg_context", K_PG_CONTEXT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("pg_datatype_name", K_PG_DATATYPE_NAME, UNRESERVED_KEYWORD)
!     PG_KEYWORD("pg_exception_context", K_PG_EXCEPTION_CONTEXT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("pg_exception_detail", K_PG_EXCEPTION_DETAIL, UNRESERVED_KEYWORD)
!     PG_KEYWORD("pg_exception_hint", K_PG_EXCEPTION_HINT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("print_strict_params", K_PRINT_STRICT_PARAMS, UNRESERVED_KEYWORD)
!     PG_KEYWORD("prior", K_PRIOR, UNRESERVED_KEYWORD)
!     PG_KEYWORD("query", K_QUERY, UNRESERVED_KEYWORD)
!     PG_KEYWORD("raise", K_RAISE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("relative", K_RELATIVE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("reset", K_RESET, UNRESERVED_KEYWORD)
!     PG_KEYWORD("return", K_RETURN, UNRESERVED_KEYWORD)
!     PG_KEYWORD("returned_sqlstate", K_RETURNED_SQLSTATE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("reverse", K_REVERSE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("rollback", K_ROLLBACK, UNRESERVED_KEYWORD)
!     PG_KEYWORD("row_count", K_ROW_COUNT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("rowtype", K_ROWTYPE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("schema", K_SCHEMA, UNRESERVED_KEYWORD)
!     PG_KEYWORD("schema_name", K_SCHEMA_NAME, UNRESERVED_KEYWORD)
!     PG_KEYWORD("scroll", K_SCROLL, UNRESERVED_KEYWORD)
!     PG_KEYWORD("set", K_SET, UNRESERVED_KEYWORD)
!     PG_KEYWORD("slice", K_SLICE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("sqlstate", K_SQLSTATE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("stacked", K_STACKED, UNRESERVED_KEYWORD)
!     PG_KEYWORD("table", K_TABLE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("table_name", K_TABLE_NAME, UNRESERVED_KEYWORD)
!     PG_KEYWORD("type", K_TYPE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("use_column", K_USE_COLUMN, UNRESERVED_KEYWORD)
!     PG_KEYWORD("use_variable", K_USE_VARIABLE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("variable_conflict", K_VARIABLE_CONFLICT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("warning", K_WARNING, UNRESERVED_KEYWORD)
  };

! static const int num_unreserved_keywords = lengthof(unreserved_keywords);

  /*
   * This macro must recognize all tokens that can immediately precede a
--- 56,77 ----
   * BEGIN BY DECLARE EXECUTE FOREACH IF LOOP STRICT WHILE
   */

! /* ScanKeywordList lookup data for PL/pgSQL keywords */
! #include "pl_reserved_kwlist_d.h"
! #include "pl_unreserved_kwlist_d.h"

! /* Token codes for PL/pgSQL keywords */
! #define PG_KEYWORD(kwname, value) value,

! static const uint16 ReservedPLKeywordTokens[] = {
! #include "pl_reserved_kwlist.h"
! };

! static const uint16 UnreservedPLKeywordTokens[] = {
! #include "pl_unreserved_kwlist.h"
  };

! #undef PG_KEYWORD

  /*
   * This macro must recognize all tokens that can immediately precede a
*************** plpgsql_yylex(void)
*** 256,262 ****
  {
      int            tok1;
      TokenAuxData aux1;
!     const ScanKeyword *kw;

      tok1 = internal_yylex(&aux1);
      if (tok1 == IDENT || tok1 == PARAM)
--- 147,153 ----
  {
      int            tok1;
      TokenAuxData aux1;
!     int            kwnum;

      tok1 = internal_yylex(&aux1);
      if (tok1 == IDENT || tok1 == PARAM)
*************** plpgsql_yylex(void)
*** 333,344 ****
                                         &aux1.lval.word))
                      tok1 = T_DATUM;
                  else if (!aux1.lval.word.quoted &&
!                          (kw = ScanKeywordLookup(aux1.lval.word.ident,
!                                                  unreserved_keywords,
!                                                  num_unreserved_keywords)))
                  {
!                     aux1.lval.keyword = kw->name;
!                     tok1 = kw->value;
                  }
                  else
                      tok1 = T_WORD;
--- 224,235 ----
                                         &aux1.lval.word))
                      tok1 = T_DATUM;
                  else if (!aux1.lval.word.quoted &&
!                          (kwnum = ScanKeywordLookup(aux1.lval.word.ident,
!                                                     &UnreservedPLKeywords)) >= 0)
                  {
!                     aux1.lval.keyword = GetScanKeyword(kwnum,
!                                                        &UnreservedPLKeywords);
!                     tok1 = UnreservedPLKeywordTokens[kwnum];
                  }
                  else
                      tok1 = T_WORD;
*************** plpgsql_yylex(void)
*** 375,386 ****
                                     &aux1.lval.word))
                  tok1 = T_DATUM;
              else if (!aux1.lval.word.quoted &&
!                      (kw = ScanKeywordLookup(aux1.lval.word.ident,
!                                              unreserved_keywords,
!                                              num_unreserved_keywords)))
              {
!                 aux1.lval.keyword = kw->name;
!                 tok1 = kw->value;
              }
              else
                  tok1 = T_WORD;
--- 266,277 ----
                                     &aux1.lval.word))
                  tok1 = T_DATUM;
              else if (!aux1.lval.word.quoted &&
!                      (kwnum = ScanKeywordLookup(aux1.lval.word.ident,
!                                                 &UnreservedPLKeywords)) >= 0)
              {
!                 aux1.lval.keyword = GetScanKeyword(kwnum,
!                                                    &UnreservedPLKeywords);
!                 tok1 = UnreservedPLKeywordTokens[kwnum];
              }
              else
                  tok1 = T_WORD;
*************** plpgsql_token_is_unreserved_keyword(int
*** 497,505 ****
  {
      int            i;

!     for (i = 0; i < num_unreserved_keywords; i++)
      {
!         if (unreserved_keywords[i].value == token)
              return true;
      }
      return false;
--- 388,396 ----
  {
      int            i;

!     for (i = 0; i < lengthof(UnreservedPLKeywordTokens); i++)
      {
!         if (UnreservedPLKeywordTokens[i] == token)
              return true;
      }
      return false;
*************** plpgsql_scanner_init(const char *str)
*** 696,702 ****
  {
      /* Start up the core scanner */
      yyscanner = scanner_init(str, &core_yy,
!                              reserved_keywords, num_reserved_keywords);

      /*
       * scanorig points to the original string, which unlike the scanner's
--- 587,593 ----
  {
      /* Start up the core scanner */
      yyscanner = scanner_init(str, &core_yy,
!                              &ReservedPLKeywords, ReservedPLKeywordTokens);

      /*
       * scanorig points to the original string, which unlike the scanner's
diff --git a/src/pl/plpgsql/src/pl_unreserved_kwlist.h b/src/pl/plpgsql/src/pl_unreserved_kwlist.h
index ...ef2aea0 .
*** a/src/pl/plpgsql/src/pl_unreserved_kwlist.h
--- b/src/pl/plpgsql/src/pl_unreserved_kwlist.h
***************
*** 0 ****
--- 1,111 ----
+ /*-------------------------------------------------------------------------
+  *
+  * pl_unreserved_kwlist.h
+  *
+  * The keyword lists are kept in their own source files for use by
+  * automatic tools.  The exact representation of a keyword is determined
+  * by the PG_KEYWORD macro, which is not defined in this file; it can
+  * be defined by the caller for special purposes.
+  *
+  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/pl/plpgsql/src/pl_unreserved_kwlist.h
+  *
+  *-------------------------------------------------------------------------
+  */
+
+ /* There is deliberately not an #ifndef PL_UNRESERVED_KWLIST_H here. */
+
+ /*
+  * List of (keyword-name, keyword-token-value) pairs.
+  *
+  * Be careful not to put the same word in both lists.  Also be sure that
+  * pl_gram.y's unreserved_keyword production agrees with this list.
+  *
+  * !!WARNING!!: This list must be sorted by ASCII name, because binary
+  *         search is used to locate entries.
+  */
+
+ /* name, value */
+ PG_KEYWORD("absolute", K_ABSOLUTE)
+ PG_KEYWORD("alias", K_ALIAS)
+ PG_KEYWORD("array", K_ARRAY)
+ PG_KEYWORD("assert", K_ASSERT)
+ PG_KEYWORD("backward", K_BACKWARD)
+ PG_KEYWORD("call", K_CALL)
+ PG_KEYWORD("close", K_CLOSE)
+ PG_KEYWORD("collate", K_COLLATE)
+ PG_KEYWORD("column", K_COLUMN)
+ PG_KEYWORD("column_name", K_COLUMN_NAME)
+ PG_KEYWORD("commit", K_COMMIT)
+ PG_KEYWORD("constant", K_CONSTANT)
+ PG_KEYWORD("constraint", K_CONSTRAINT)
+ PG_KEYWORD("constraint_name", K_CONSTRAINT_NAME)
+ PG_KEYWORD("continue", K_CONTINUE)
+ PG_KEYWORD("current", K_CURRENT)
+ PG_KEYWORD("cursor", K_CURSOR)
+ PG_KEYWORD("datatype", K_DATATYPE)
+ PG_KEYWORD("debug", K_DEBUG)
+ PG_KEYWORD("default", K_DEFAULT)
+ PG_KEYWORD("detail", K_DETAIL)
+ PG_KEYWORD("diagnostics", K_DIAGNOSTICS)
+ PG_KEYWORD("do", K_DO)
+ PG_KEYWORD("dump", K_DUMP)
+ PG_KEYWORD("elseif", K_ELSIF)
+ PG_KEYWORD("elsif", K_ELSIF)
+ PG_KEYWORD("errcode", K_ERRCODE)
+ PG_KEYWORD("error", K_ERROR)
+ PG_KEYWORD("exception", K_EXCEPTION)
+ PG_KEYWORD("exit", K_EXIT)
+ PG_KEYWORD("fetch", K_FETCH)
+ PG_KEYWORD("first", K_FIRST)
+ PG_KEYWORD("forward", K_FORWARD)
+ PG_KEYWORD("get", K_GET)
+ PG_KEYWORD("hint", K_HINT)
+ PG_KEYWORD("import", K_IMPORT)
+ PG_KEYWORD("info", K_INFO)
+ PG_KEYWORD("insert", K_INSERT)
+ PG_KEYWORD("is", K_IS)
+ PG_KEYWORD("last", K_LAST)
+ PG_KEYWORD("log", K_LOG)
+ PG_KEYWORD("message", K_MESSAGE)
+ PG_KEYWORD("message_text", K_MESSAGE_TEXT)
+ PG_KEYWORD("move", K_MOVE)
+ PG_KEYWORD("next", K_NEXT)
+ PG_KEYWORD("no", K_NO)
+ PG_KEYWORD("notice", K_NOTICE)
+ PG_KEYWORD("open", K_OPEN)
+ PG_KEYWORD("option", K_OPTION)
+ PG_KEYWORD("perform", K_PERFORM)
+ PG_KEYWORD("pg_context", K_PG_CONTEXT)
+ PG_KEYWORD("pg_datatype_name", K_PG_DATATYPE_NAME)
+ PG_KEYWORD("pg_exception_context", K_PG_EXCEPTION_CONTEXT)
+ PG_KEYWORD("pg_exception_detail", K_PG_EXCEPTION_DETAIL)
+ PG_KEYWORD("pg_exception_hint", K_PG_EXCEPTION_HINT)
+ PG_KEYWORD("print_strict_params", K_PRINT_STRICT_PARAMS)
+ PG_KEYWORD("prior", K_PRIOR)
+ PG_KEYWORD("query", K_QUERY)
+ PG_KEYWORD("raise", K_RAISE)
+ PG_KEYWORD("relative", K_RELATIVE)
+ PG_KEYWORD("reset", K_RESET)
+ PG_KEYWORD("return", K_RETURN)
+ PG_KEYWORD("returned_sqlstate", K_RETURNED_SQLSTATE)
+ PG_KEYWORD("reverse", K_REVERSE)
+ PG_KEYWORD("rollback", K_ROLLBACK)
+ PG_KEYWORD("row_count", K_ROW_COUNT)
+ PG_KEYWORD("rowtype", K_ROWTYPE)
+ PG_KEYWORD("schema", K_SCHEMA)
+ PG_KEYWORD("schema_name", K_SCHEMA_NAME)
+ PG_KEYWORD("scroll", K_SCROLL)
+ PG_KEYWORD("set", K_SET)
+ PG_KEYWORD("slice", K_SLICE)
+ PG_KEYWORD("sqlstate", K_SQLSTATE)
+ PG_KEYWORD("stacked", K_STACKED)
+ PG_KEYWORD("table", K_TABLE)
+ PG_KEYWORD("table_name", K_TABLE_NAME)
+ PG_KEYWORD("type", K_TYPE)
+ PG_KEYWORD("use_column", K_USE_COLUMN)
+ PG_KEYWORD("use_variable", K_USE_VARIABLE)
+ PG_KEYWORD("variable_conflict", K_VARIABLE_CONFLICT)
+ PG_KEYWORD("warning", K_WARNING)
diff --git a/src/tools/gen_keywordlist.pl b/src/tools/gen_keywordlist.pl
index ...eb5ed65 .
*** a/src/tools/gen_keywordlist.pl
--- b/src/tools/gen_keywordlist.pl
***************
*** 0 ****
--- 1,148 ----
+ #----------------------------------------------------------------------
+ #
+ # gen_keywordlist.pl
+ #    Perl script that transforms a list of keywords into a ScanKeywordList
+ #    data structure that can be passed to ScanKeywordLookup().
+ #
+ # The input is a C header file containing a series of macro calls
+ #    PG_KEYWORD("keyword", ...)
+ # Lines not starting with PG_KEYWORD are ignored.  The keywords are
+ # implicitly numbered 0..N-1 in order of appearance in the header file.
+ # Currently, the keywords are required to appear in ASCII order.
+ #
+ # The output is a C header file that defines a "const ScanKeywordList"
+ # variable named according to the -v switch ("ScanKeywords" by default).
+ # The variable is marked "static" unless the -e switch is given.
+ #
+ #
+ # Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ # Portions Copyright (c) 1994, Regents of the University of California
+ #
+ # src/tools/gen_keywordlist.pl
+ #
+ #----------------------------------------------------------------------
+
+ use strict;
+ use warnings;
+ use Getopt::Long;
+
+ my $output_path = '';
+ my $extern = 0;
+ my $varname = 'ScanKeywords';
+
+ GetOptions(
+     'output:s' => \$output_path,
+     'extern'   => \$extern,
+     'varname:s' => \$varname) || usage();
+
+ my $kw_input_file = shift @ARGV || die "No input file.\n";
+
+ # Make sure output_path ends in a slash if needed.
+ if ($output_path ne '' && substr($output_path, -1) ne '/')
+ {
+     $output_path .= '/';
+ }
+
+ $kw_input_file =~ /(\w+)\.h$/ || die "Input file must be named something.h.\n";
+ my $base_filename = $1 . '_d';
+ my $kw_def_file = $output_path . $base_filename . '.h';
+
+ open(my $kif, '<', $kw_input_file) || die "$kw_input_file: $!\n";
+ open(my $kwdef, '>', $kw_def_file) || die "$kw_def_file: $!\n";
+
+ # Opening boilerplate for keyword definition header.
+ printf $kwdef <<EOM, $base_filename, uc $base_filename, uc $base_filename;
+ /*-------------------------------------------------------------------------
+  *
+  * %s.h
+  *    List of keywords represented as a ScanKeywordList.
+  *
+  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * NOTES
+  *  ******************************
+  *  *** DO NOT EDIT THIS FILE! ***
+  *  ******************************
+  *
+  *  It has been GENERATED by src/tools/gen_keywordlist.pl
+  *
+  *-------------------------------------------------------------------------
+  */
+
+ #ifndef %s_H
+ #define %s_H
+
+ #include "common/kwlookup.h"
+
+ EOM
+
+ # Parse input file for keyword names.
+ my @keywords;
+ while (<$kif>)
+ {
+     if (/^PG_KEYWORD\("(\w+)"/)
+     {
+         push @keywords, $1;
+     }
+ }
+
+ # Error out if the keyword names are not in ASCII order.
+ for my $i (0..$#keywords - 1)
+ {
+     die qq|The keyword "$keywords[$i + 1]" is out of order in $kw_input_file\n|
+       if ($keywords[$i] cmp $keywords[$i + 1]) >= 0;
+ }
+
+ # Emit the string containing all the keywords.
+
+ printf $kwdef qq|static const char %s_kw_string[] =\n\t"|, $varname;
+ print $kwdef join qq|\\0"\n\t"|, @keywords;
+ print $kwdef qq|";\n\n|;
+
+ # Emit an array of numerical offsets which will be used to index into the
+ # keyword string.
+
+ printf $kwdef "static const uint16 %s_kw_offsets[] = {\n", $varname;
+
+ my $offset = 0;
+ foreach my $name (@keywords)
+ {
+     print $kwdef "\t$offset,\n";
+
+     # Calculate the cumulative offset of the next keyword,
+     # taking into account the null terminator.
+     $offset += length($name) + 1;
+ }
+
+ print $kwdef "};\n\n";
+
+ # Emit a macro defining the number of keywords.
+
+ printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords;
+
+ # Emit the struct that wraps all this lookup info into one variable.
+
+ print $kwdef "static " if !$extern;
+ printf $kwdef "const ScanKeywordList %s = {\n", $varname;
+ printf $kwdef qq|\t%s_kw_string,\n|, $varname;
+ printf $kwdef qq|\t%s_kw_offsets,\n|, $varname;
+ printf $kwdef qq|\t%s_NUM_KEYWORDS\n|, uc $varname;
+ print $kwdef "};\n\n";
+
+ printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename;
+
+
+ sub usage
+ {
+     die <<EOM;
+ Usage: gen_keywordlist.pl [--output/-o <path>] [--varname/-v <varname>] [--extern/-e] input_file
+     --output   Output directory (default '.')
+     --varname  Name for ScanKeywordList variable (default 'ScanKeywords')
+     --extern   Allow the ScanKeywordList variable to be globally visible
+
+ gen_keywordlist.pl transforms a list of keywords into a ScanKeywordList.
+ The output filename is derived from the input file by inserting _d,
+ for example kwlist_d.h is produced from kwlist.h.
+ EOM
+ }
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index eb2346b..937bf18 100644
*** a/src/tools/msvc/Solution.pm
--- b/src/tools/msvc/Solution.pm
*************** sub GenerateFiles
*** 410,415 ****
--- 410,451 ----
      }

      if (IsNewer(
+             'src/common/kwlist_d.h',
+             'src/include/parser/kwlist.h'))
+     {
+         print "Generating kwlist_d.h...\n";
+         system('perl src/tools/gen_keywordlist.pl --extern -o src/common src/include/parser/kwlist.h');
+     }
+
+     if (IsNewer(
+             'src/pl/plpgsql/src/pl_reserved_kwlist_d.h',
+             'src/pl/plpgsql/src/pl_reserved_kwlist.h')
+         || IsNewer(
+             'src/pl/plpgsql/src/pl_unreserved_kwlist_d.h',
+             'src/pl/plpgsql/src/pl_unreserved_kwlist.h'))
+     {
+         print "Generating pl_reserved_kwlist_d.h and pl_unreserved_kwlist_d.h...\n";
+         chdir('src/pl/plpgsql/src');
+         system('perl ../../../tools/gen_keywordlist.pl --varname ReservedPLKeywords pl_reserved_kwlist.h');
+         system('perl ../../../tools/gen_keywordlist.pl --varname UnreservedPLKeywords pl_unreserved_kwlist.h');
+         chdir('../../../..');
+     }
+
+     if (IsNewer(
+             'src/interfaces/ecpg/preproc/c_kwlist_d.h',
+             'src/interfaces/ecpg/preproc/c_kwlist.h')
+         || IsNewer(
+             'src/interfaces/ecpg/preproc/ecpg_kwlist_d.h',
+             'src/interfaces/ecpg/preproc/ecpg_kwlist.h'))
+     {
+         print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n";
+         chdir('src/interfaces/ecpg/preproc');
+         system('perl ../../../tools/gen_keywordlist.pl --varname ScanCKeywords c_kwlist.h');
+         system('perl ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h');
+         chdir('../../../..');
+     }
+
+     if (IsNewer(
              'src/interfaces/ecpg/preproc/preproc.y',
              'src/backend/parser/gram.y'))
      {
diff --git a/src/tools/msvc/clean.bat b/src/tools/msvc/clean.bat
index 7a23a2b..069d6eb 100755
*** a/src/tools/msvc/clean.bat
--- b/src/tools/msvc/clean.bat
*************** if %DIST%==1 if exist src\pl\tcl\pltcler
*** 64,69 ****
--- 64,74 ----
  if %DIST%==1 if exist src\backend\utils\sort\qsort_tuple.c del /q src\backend\utils\sort\qsort_tuple.c
  if %DIST%==1 if exist src\bin\psql\sql_help.c del /q src\bin\psql\sql_help.c
  if %DIST%==1 if exist src\bin\psql\sql_help.h del /q src\bin\psql\sql_help.h
+ if %DIST%==1 if exist src\common\kwlist_d.h del /q src\common\kwlist_d.h
+ if %DIST%==1 if exist src\pl\plpgsql\src\pl_reserved_kwlist_d.h del /q src\pl\plpgsql\src\pl_reserved_kwlist_d.h
+ if %DIST%==1 if exist src\pl\plpgsql\src\pl_unreserved_kwlist_d.h del /q src\pl\plpgsql\src\pl_unreserved_kwlist_d.h
+ if %DIST%==1 if exist src\interfaces\ecpg\preproc\c_kwlist_d.h del /q src\interfaces\ecpg\preproc\c_kwlist_d.h
+ if %DIST%==1 if exist src\interfaces\ecpg\preproc\ecpg_kwlist_d.h del /q src\interfaces\ecpg\preproc\ecpg_kwlist_d.h
  if %DIST%==1 if exist src\interfaces\ecpg\preproc\preproc.y del /q src\interfaces\ecpg\preproc\preproc.y
  if %DIST%==1 if exist src\backend\catalog\postgres.bki del /q src\backend\catalog\postgres.bki
  if %DIST%==1 if exist src\backend\catalog\postgres.description del /q src\backend\catalog\postgres.description

I wrote:
> I spent some time hacking on this today, and I think it's committable
> now, but I'm putting it back up in case anyone wants to have another
> look (and also so the cfbot can check it on Windows).

... and indeed, the cfbot doesn't like it.  Here's v8, with the
missing addition to Mkvcbuild.pm.

            regards, tom lane

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index e8ef966..9131991 100644
*** a/contrib/pg_stat_statements/pg_stat_statements.c
--- b/contrib/pg_stat_statements/pg_stat_statements.c
*************** fill_in_constant_lengths(pgssJumbleState
*** 3075,3082 ****
      /* initialize the flex scanner --- should match raw_parser() */
      yyscanner = scanner_init(query,
                               &yyextra,
!                              ScanKeywords,
!                              NumScanKeywords);

      /* we don't want to re-emit any escape string warnings */
      yyextra.escape_string_warning = false;
--- 3075,3082 ----
      /* initialize the flex scanner --- should match raw_parser() */
      yyscanner = scanner_init(query,
                               &yyextra,
!                              &ScanKeywords,
!                              ScanKeywordTokens);

      /* we don't want to re-emit any escape string warnings */
      yyextra.escape_string_warning = false;
diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index 7e9b122..4c0c258 100644
*** a/src/backend/parser/parser.c
--- b/src/backend/parser/parser.c
*************** raw_parser(const char *str)
*** 41,47 ****

      /* initialize the flex scanner */
      yyscanner = scanner_init(str, &yyextra.core_yy_extra,
!                              ScanKeywords, NumScanKeywords);

      /* base_yylex() only needs this much initialization */
      yyextra.have_lookahead = false;
--- 41,47 ----

      /* initialize the flex scanner */
      yyscanner = scanner_init(str, &yyextra.core_yy_extra,
!                              &ScanKeywords, ScanKeywordTokens);

      /* base_yylex() only needs this much initialization */
      yyextra.have_lookahead = false;
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index fbeb86f..e1cae85 100644
*** a/src/backend/parser/scan.l
--- b/src/backend/parser/scan.l
*************** bool        escape_string_warning = true;
*** 67,72 ****
--- 67,87 ----
  bool        standard_conforming_strings = true;

  /*
+  * Constant data exported from this file.  This array maps from the
+  * zero-based keyword numbers returned by ScanKeywordLookup to the
+  * Bison token numbers needed by gram.y.  This is exported because
+  * callers need to pass it to scanner_init, if they are using the
+  * standard keyword list ScanKeywords.
+  */
+ #define PG_KEYWORD(kwname, value, category) value,
+
+ const uint16 ScanKeywordTokens[] = {
+ #include "parser/kwlist.h"
+ };
+
+ #undef PG_KEYWORD
+
+ /*
   * Set the type of YYSTYPE.
   */
  #define YYSTYPE core_YYSTYPE
*************** other            .
*** 504,521 ****
                       * We will pass this along as a normal character string,
                       * but preceded with an internally-generated "NCHAR".
                       */
!                     const ScanKeyword *keyword;

                      SET_YYLLOC();
                      yyless(1);    /* eat only 'n' this time */

!                     keyword = ScanKeywordLookup("nchar",
!                                                 yyextra->keywords,
!                                                 yyextra->num_keywords);
!                     if (keyword != NULL)
                      {
!                         yylval->keyword = keyword->name;
!                         return keyword->value;
                      }
                      else
                      {
--- 519,536 ----
                       * We will pass this along as a normal character string,
                       * but preceded with an internally-generated "NCHAR".
                       */
!                     int        kwnum;

                      SET_YYLLOC();
                      yyless(1);    /* eat only 'n' this time */

!                     kwnum = ScanKeywordLookup("nchar",
!                                               yyextra->keywordlist);
!                     if (kwnum >= 0)
                      {
!                         yylval->keyword = GetScanKeyword(kwnum,
!                                                          yyextra->keywordlist);
!                         return yyextra->keyword_tokens[kwnum];
                      }
                      else
                      {
*************** other            .
*** 1021,1039 ****


  {identifier}    {
!                     const ScanKeyword *keyword;
                      char       *ident;

                      SET_YYLLOC();

                      /* Is it a keyword? */
!                     keyword = ScanKeywordLookup(yytext,
!                                                 yyextra->keywords,
!                                                 yyextra->num_keywords);
!                     if (keyword != NULL)
                      {
!                         yylval->keyword = keyword->name;
!                         return keyword->value;
                      }

                      /*
--- 1036,1054 ----


  {identifier}    {
!                     int            kwnum;
                      char       *ident;

                      SET_YYLLOC();

                      /* Is it a keyword? */
!                     kwnum = ScanKeywordLookup(yytext,
!                                               yyextra->keywordlist);
!                     if (kwnum >= 0)
                      {
!                         yylval->keyword = GetScanKeyword(kwnum,
!                                                          yyextra->keywordlist);
!                         return yyextra->keyword_tokens[kwnum];
                      }

                      /*
*************** scanner_yyerror(const char *message, cor
*** 1142,1149 ****
  core_yyscan_t
  scanner_init(const char *str,
               core_yy_extra_type *yyext,
!              const ScanKeyword *keywords,
!              int num_keywords)
  {
      Size        slen = strlen(str);
      yyscan_t    scanner;
--- 1157,1164 ----
  core_yyscan_t
  scanner_init(const char *str,
               core_yy_extra_type *yyext,
!              const ScanKeywordList *keywordlist,
!              const uint16 *keyword_tokens)
  {
      Size        slen = strlen(str);
      yyscan_t    scanner;
*************** scanner_init(const char *str,
*** 1153,1160 ****

      core_yyset_extra(yyext, scanner);

!     yyext->keywords = keywords;
!     yyext->num_keywords = num_keywords;

      yyext->backslash_quote = backslash_quote;
      yyext->escape_string_warning = escape_string_warning;
--- 1168,1175 ----

      core_yyset_extra(yyext, scanner);

!     yyext->keywordlist = keywordlist;
!     yyext->keyword_tokens = keyword_tokens;

      yyext->backslash_quote = backslash_quote;
      yyext->escape_string_warning = escape_string_warning;
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 7b69b82..746b7d2 100644
*** a/src/backend/utils/adt/misc.c
--- b/src/backend/utils/adt/misc.c
*************** pg_get_keywords(PG_FUNCTION_ARGS)
*** 417,431 ****

      funcctx = SRF_PERCALL_SETUP();

!     if (funcctx->call_cntr < NumScanKeywords)
      {
          char       *values[3];
          HeapTuple    tuple;

          /* cast-away-const is ugly but alternatives aren't much better */
!         values[0] = unconstify(char *, ScanKeywords[funcctx->call_cntr].name);

!         switch (ScanKeywords[funcctx->call_cntr].category)
          {
              case UNRESERVED_KEYWORD:
                  values[1] = "U";
--- 417,433 ----

      funcctx = SRF_PERCALL_SETUP();

!     if (funcctx->call_cntr < ScanKeywords.num_keywords)
      {
          char       *values[3];
          HeapTuple    tuple;

          /* cast-away-const is ugly but alternatives aren't much better */
!         values[0] = unconstify(char *,
!                                GetScanKeyword(funcctx->call_cntr,
!                                               &ScanKeywords));

!         switch (ScanKeywordCategories[funcctx->call_cntr])
          {
              case UNRESERVED_KEYWORD:
                  values[1] = "U";
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 368eacf..77811f6 100644
*** a/src/backend/utils/adt/ruleutils.c
--- b/src/backend/utils/adt/ruleutils.c
*************** quote_identifier(const char *ident)
*** 10601,10611 ****
           * Note: ScanKeywordLookup() does case-insensitive comparison, but
           * that's fine, since we already know we have all-lower-case.
           */
!         const ScanKeyword *keyword = ScanKeywordLookup(ident,
!                                                        ScanKeywords,
!                                                        NumScanKeywords);

!         if (keyword != NULL && keyword->category != UNRESERVED_KEYWORD)
              safe = false;
      }

--- 10601,10609 ----
           * Note: ScanKeywordLookup() does case-insensitive comparison, but
           * that's fine, since we already know we have all-lower-case.
           */
!         int            kwnum = ScanKeywordLookup(ident, &ScanKeywords);

!         if (kwnum >= 0 && ScanKeywordCategories[kwnum] != UNRESERVED_KEYWORD)
              safe = false;
      }

diff --git a/src/common/.gitignore b/src/common/.gitignore
index ...ffa3284 .
*** a/src/common/.gitignore
--- b/src/common/.gitignore
***************
*** 0 ****
--- 1 ----
+ /kwlist_d.h
diff --git a/src/common/Makefile b/src/common/Makefile
index ec8139f..317b071 100644
*** a/src/common/Makefile
--- b/src/common/Makefile
*************** override CPPFLAGS += -DVAL_LDFLAGS_EX="\
*** 41,51 ****
  override CPPFLAGS += -DVAL_LDFLAGS_SL="\"$(LDFLAGS_SL)\""
  override CPPFLAGS += -DVAL_LIBS="\"$(LIBS)\""

! override CPPFLAGS := -DFRONTEND $(CPPFLAGS)
  LIBS += $(PTHREAD_LIBS)

  OBJS_COMMON = base64.o config_info.o controldata_utils.o exec.o file_perm.o \
!     ip.o keywords.o link-canary.o md5.o pg_lzcompress.o \
      pgfnames.o psprintf.o relpath.o \
      rmtree.o saslprep.o scram-common.o string.o unicode_norm.o \
      username.o wait_error.o
--- 41,51 ----
  override CPPFLAGS += -DVAL_LDFLAGS_SL="\"$(LDFLAGS_SL)\""
  override CPPFLAGS += -DVAL_LIBS="\"$(LIBS)\""

! override CPPFLAGS := -DFRONTEND -I. -I$(top_srcdir)/src/common $(CPPFLAGS)
  LIBS += $(PTHREAD_LIBS)

  OBJS_COMMON = base64.o config_info.o controldata_utils.o exec.o file_perm.o \
!     ip.o keywords.o kwlookup.o link-canary.o md5.o pg_lzcompress.o \
      pgfnames.o psprintf.o relpath.o \
      rmtree.o saslprep.o scram-common.o string.o unicode_norm.o \
      username.o wait_error.o
*************** OBJS_SRV = $(OBJS_COMMON:%.o=%_srv.o)
*** 65,70 ****
--- 65,72 ----

  all: libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a

+ distprep: kwlist_d.h
+
  # libpgcommon is needed by some contrib
  install: all installdirs
      $(INSTALL_STLIB) libpgcommon.a '$(DESTDIR)$(libdir)/libpgcommon.a'
*************** libpgcommon_srv.a: $(OBJS_SRV)
*** 115,130 ****
  %_srv.o: %.c %.o
      $(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@

! # Dependencies of keywords.o need to be managed explicitly to make sure
! # that you don't get broken parsing code, even in a non-enable-depend build.
! # Note that gram.h isn't required for the frontend versions of keywords.o.
! $(top_builddir)/src/include/parser/gram.h: $(top_srcdir)/src/backend/parser/gram.y
!     $(MAKE) -C $(top_builddir)/src/backend $(top_builddir)/src/include/parser/gram.h

! keywords.o: $(top_srcdir)/src/include/parser/kwlist.h
! keywords_shlib.o: $(top_srcdir)/src/include/parser/kwlist.h
! keywords_srv.o: $(top_builddir)/src/include/parser/gram.h $(top_srcdir)/src/include/parser/kwlist.h

! clean distclean maintainer-clean:
      rm -f libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a
      rm -f $(OBJS_FRONTEND) $(OBJS_SHLIB) $(OBJS_SRV)
--- 117,134 ----
  %_srv.o: %.c %.o
      $(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@

! # generate SQL keyword lookup table to be included into keywords*.o.
! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(top_srcdir)/src/tools/gen_keywordlist.pl
!     $(PERL) $(top_srcdir)/src/tools/gen_keywordlist.pl --extern $<

! # Dependencies of keywords*.o need to be managed explicitly to make sure
! # that you don't get broken parsing code, even in a non-enable-depend build.
! keywords.o keywords_shlib.o keywords_srv.o: kwlist_d.h

! # kwlist_d.h is in the distribution tarball, so it is not cleaned here.
! clean distclean:
      rm -f libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a
      rm -f $(OBJS_FRONTEND) $(OBJS_SHLIB) $(OBJS_SRV)
+
+ maintainer-clean: distclean
+     rm -f kwlist_d.h
diff --git a/src/common/keywords.c b/src/common/keywords.c
index 6f99090..103166c 100644
*** a/src/common/keywords.c
--- b/src/common/keywords.c
***************
*** 1,7 ****
  /*-------------------------------------------------------------------------
   *
   * keywords.c
!  *      lexical token lookup for key words in PostgreSQL
   *
   *
   * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
--- 1,7 ----
  /*-------------------------------------------------------------------------
   *
   * keywords.c
!  *      PostgreSQL's list of SQL keywords
   *
   *
   * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
***************
*** 19,114 ****
  #include "postgres_fe.h"
  #endif

! #ifndef FRONTEND
!
! #include "parser/gramparse.h"
!
! #define PG_KEYWORD(a,b,c) {a,b,c},

- #else

! #include "common/keywords.h"

! /*
!  * We don't need the token number for frontend uses, so leave it out to avoid
!  * requiring backend headers that won't compile cleanly here.
!  */
! #define PG_KEYWORD(a,b,c) {a,0,c},

! #endif                            /* FRONTEND */


! const ScanKeyword ScanKeywords[] = {
  #include "parser/kwlist.h"
  };

! const int    NumScanKeywords = lengthof(ScanKeywords);
!
!
! /*
!  * ScanKeywordLookup - see if a given word is a keyword
!  *
!  * The table to be searched is passed explicitly, so that this can be used
!  * to search keyword lists other than the standard list appearing above.
!  *
!  * Returns a pointer to the ScanKeyword table entry, or NULL if no match.
!  *
!  * The match is done case-insensitively.  Note that we deliberately use a
!  * dumbed-down case conversion that will only translate 'A'-'Z' into 'a'-'z',
!  * even if we are in a locale where tolower() would produce more or different
!  * translations.  This is to conform to the SQL99 spec, which says that
!  * keywords are to be matched in this way even though non-keyword identifiers
!  * receive a different case-normalization mapping.
!  */
! const ScanKeyword *
! ScanKeywordLookup(const char *text,
!                   const ScanKeyword *keywords,
!                   int num_keywords)
! {
!     int            len,
!                 i;
!     char        word[NAMEDATALEN];
!     const ScanKeyword *low;
!     const ScanKeyword *high;
!
!     len = strlen(text);
!     /* We assume all keywords are shorter than NAMEDATALEN. */
!     if (len >= NAMEDATALEN)
!         return NULL;
!
!     /*
!      * Apply an ASCII-only downcasing.  We must not use tolower() since it may
!      * produce the wrong translation in some locales (eg, Turkish).
!      */
!     for (i = 0; i < len; i++)
!     {
!         char        ch = text[i];
!
!         if (ch >= 'A' && ch <= 'Z')
!             ch += 'a' - 'A';
!         word[i] = ch;
!     }
!     word[len] = '\0';
!
!     /*
!      * Now do a binary search using plain strcmp() comparison.
!      */
!     low = keywords;
!     high = keywords + (num_keywords - 1);
!     while (low <= high)
!     {
!         const ScanKeyword *middle;
!         int            difference;
!
!         middle = low + (high - low) / 2;
!         difference = strcmp(middle->name, word);
!         if (difference == 0)
!             return middle;
!         else if (difference < 0)
!             low = middle + 1;
!         else
!             high = middle - 1;
!     }
!
!     return NULL;
! }
--- 19,37 ----
  #include "postgres_fe.h"
  #endif

! #include "common/keywords.h"


! /* ScanKeywordList lookup data for SQL keywords */

! #include "kwlist_d.h"

! /* Keyword categories for SQL keywords */

+ #define PG_KEYWORD(kwname, value, category) category,

! const uint8 ScanKeywordCategories[SCANKEYWORDS_NUM_KEYWORDS] = {
  #include "parser/kwlist.h"
  };

! #undef PG_KEYWORD
diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c
index ...db62623 .
*** a/src/common/kwlookup.c
--- b/src/common/kwlookup.c
***************
*** 0 ****
--- 1,91 ----
+ /*-------------------------------------------------------------------------
+  *
+  * kwlookup.c
+  *      Key word lookup for PostgreSQL
+  *
+  *
+  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *      src/common/kwlookup.c
+  *
+  *-------------------------------------------------------------------------
+  */
+ #include "c.h"
+
+ #include "common/kwlookup.h"
+
+
+ /*
+  * ScanKeywordLookup - see if a given word is a keyword
+  *
+  * The list of keywords to be matched against is passed as a ScanKeywordList.
+  *
+  * Returns the keyword number (0..N-1) of the keyword, or -1 if no match.
+  * Callers typically use the keyword number to index into information
+  * arrays, but that is no concern of this code.
+  *
+  * The match is done case-insensitively.  Note that we deliberately use a
+  * dumbed-down case conversion that will only translate 'A'-'Z' into 'a'-'z',
+  * even if we are in a locale where tolower() would produce more or different
+  * translations.  This is to conform to the SQL99 spec, which says that
+  * keywords are to be matched in this way even though non-keyword identifiers
+  * receive a different case-normalization mapping.
+  */
+ int
+ ScanKeywordLookup(const char *text,
+                   const ScanKeywordList *keywords)
+ {
+     int            len,
+                 i;
+     char        word[NAMEDATALEN];
+     const char *kw_string;
+     const uint16 *kw_offsets;
+     const uint16 *low;
+     const uint16 *high;
+
+     len = strlen(text);
+     /* We assume all keywords are shorter than NAMEDATALEN. */
+     if (len >= NAMEDATALEN)
+         return -1;
+
+     /*
+      * Apply an ASCII-only downcasing.  We must not use tolower() since it may
+      * produce the wrong translation in some locales (eg, Turkish).
+      */
+     for (i = 0; i < len; i++)
+     {
+         char        ch = text[i];
+
+         if (ch >= 'A' && ch <= 'Z')
+             ch += 'a' - 'A';
+         word[i] = ch;
+     }
+     word[len] = '\0';
+
+     /*
+      * Now do a binary search using plain strcmp() comparison.
+      */
+     kw_string = keywords->kw_string;
+     kw_offsets = keywords->kw_offsets;
+     low = kw_offsets;
+     high = kw_offsets + (keywords->num_keywords - 1);
+     while (low <= high)
+     {
+         const uint16 *middle;
+         int            difference;
+
+         middle = low + (high - low) / 2;
+         difference = strcmp(kw_string + *middle, word);
+         if (difference == 0)
+             return middle - kw_offsets;
+         else if (difference < 0)
+             low = middle + 1;
+         else
+             high = middle - 1;
+     }
+
+     return -1;
+ }
diff --git a/src/fe_utils/string_utils.c b/src/fe_utils/string_utils.c
index 9b47b62..5c1732a 100644
*** a/src/fe_utils/string_utils.c
--- b/src/fe_utils/string_utils.c
*************** fmtId(const char *rawid)
*** 104,114 ****
           * Note: ScanKeywordLookup() does case-insensitive comparison, but
           * that's fine, since we already know we have all-lower-case.
           */
!         const ScanKeyword *keyword = ScanKeywordLookup(rawid,
!                                                        ScanKeywords,
!                                                        NumScanKeywords);

!         if (keyword != NULL && keyword->category != UNRESERVED_KEYWORD)
              need_quotes = true;
      }

--- 104,112 ----
           * Note: ScanKeywordLookup() does case-insensitive comparison, but
           * that's fine, since we already know we have all-lower-case.
           */
!         int            kwnum = ScanKeywordLookup(rawid, &ScanKeywords);

!         if (kwnum >= 0 && ScanKeywordCategories[kwnum] != UNRESERVED_KEYWORD)
              need_quotes = true;
      }

diff --git a/src/include/common/keywords.h b/src/include/common/keywords.h
index 8f22f32..fb18858 100644
*** a/src/include/common/keywords.h
--- b/src/include/common/keywords.h
***************
*** 1,7 ****
  /*-------------------------------------------------------------------------
   *
   * keywords.h
!  *      lexical token lookup for key words in PostgreSQL
   *
   *
   * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
--- 1,7 ----
  /*-------------------------------------------------------------------------
   *
   * keywords.h
!  *      PostgreSQL's list of SQL keywords
   *
   *
   * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
***************
*** 14,44 ****
  #ifndef KEYWORDS_H
  #define KEYWORDS_H

  /* Keyword categories --- should match lists in gram.y */
  #define UNRESERVED_KEYWORD        0
  #define COL_NAME_KEYWORD        1
  #define TYPE_FUNC_NAME_KEYWORD    2
  #define RESERVED_KEYWORD        3

-
- typedef struct ScanKeyword
- {
-     const char *name;            /* in lower case */
-     int16        value;            /* grammar's token code */
-     int16        category;        /* see codes above */
- } ScanKeyword;
-
  #ifndef FRONTEND
! extern PGDLLIMPORT const ScanKeyword ScanKeywords[];
! extern PGDLLIMPORT const int NumScanKeywords;
  #else
! extern const ScanKeyword ScanKeywords[];
! extern const int NumScanKeywords;
  #endif

-
- extern const ScanKeyword *ScanKeywordLookup(const char *text,
-                   const ScanKeyword *keywords,
-                   int num_keywords);
-
  #endif                            /* KEYWORDS_H */
--- 14,33 ----
  #ifndef KEYWORDS_H
  #define KEYWORDS_H

+ #include "common/kwlookup.h"
+
  /* Keyword categories --- should match lists in gram.y */
  #define UNRESERVED_KEYWORD        0
  #define COL_NAME_KEYWORD        1
  #define TYPE_FUNC_NAME_KEYWORD    2
  #define RESERVED_KEYWORD        3

  #ifndef FRONTEND
! extern PGDLLIMPORT const ScanKeywordList ScanKeywords;
! extern PGDLLIMPORT const uint8 ScanKeywordCategories[];
  #else
! extern const ScanKeywordList ScanKeywords;
! extern const uint8 ScanKeywordCategories[];
  #endif

  #endif                            /* KEYWORDS_H */
diff --git a/src/include/common/kwlookup.h b/src/include/common/kwlookup.h
index ...3098df3 .
*** a/src/include/common/kwlookup.h
--- b/src/include/common/kwlookup.h
***************
*** 0 ****
--- 1,39 ----
+ /*-------------------------------------------------------------------------
+  *
+  * kwlookup.h
+  *      Key word lookup for PostgreSQL
+  *
+  *
+  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/include/common/kwlookup.h
+  *
+  *-------------------------------------------------------------------------
+  */
+ #ifndef KWLOOKUP_H
+ #define KWLOOKUP_H
+
+ /*
+  * This struct contains the data needed by ScanKeywordLookup to perform a
+  * search within a set of keywords.  The contents are typically generated by
+  * src/tools/gen_keywordlist.pl from a header containing PG_KEYWORD macros.
+  */
+ typedef struct ScanKeywordList
+ {
+     const char *kw_string;        /* all keywords in order, separated by \0 */
+     const uint16 *kw_offsets;    /* offsets to the start of each keyword */
+     int            num_keywords;    /* number of keywords */
+ } ScanKeywordList;
+
+
+ extern int    ScanKeywordLookup(const char *text, const ScanKeywordList *keywords);
+
+ /* Code that wants to retrieve the text of the N'th keyword should use this. */
+ static inline const char *
+ GetScanKeyword(int n, const ScanKeywordList *keywords)
+ {
+     return keywords->kw_string + keywords->kw_offsets[n];
+ }
+
+ #endif                            /* KWLOOKUP_H */
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 0256d53..b8902d3 100644
*** a/src/include/parser/kwlist.h
--- b/src/include/parser/kwlist.h
***************
*** 2,8 ****
   *
   * kwlist.h
   *
!  * The keyword list is kept in its own source file for possible use by
   * automatic tools.  The exact representation of a keyword is determined
   * by the PG_KEYWORD macro, which is not defined in this file; it can
   * be defined by the caller for special purposes.
--- 2,8 ----
   *
   * kwlist.h
   *
!  * The keyword lists are kept in their own source files for use by
   * automatic tools.  The exact representation of a keyword is determined
   * by the PG_KEYWORD macro, which is not defined in this file; it can
   * be defined by the caller for special purposes.
diff --git a/src/include/parser/scanner.h b/src/include/parser/scanner.h
index 009550f..91e1c83 100644
*** a/src/include/parser/scanner.h
--- b/src/include/parser/scanner.h
*************** typedef struct core_yy_extra_type
*** 73,82 ****
      Size        scanbuflen;

      /*
!      * The keyword list to use.
       */
!     const ScanKeyword *keywords;
!     int            num_keywords;

      /*
       * Scanner settings to use.  These are initialized from the corresponding
--- 73,82 ----
      Size        scanbuflen;

      /*
!      * The keyword list to use, and the associated grammar token codes.
       */
!     const ScanKeywordList *keywordlist;
!     const uint16 *keyword_tokens;

      /*
       * Scanner settings to use.  These are initialized from the corresponding
*************** typedef struct core_yy_extra_type
*** 116,126 ****
  typedef void *core_yyscan_t;


  /* Entry points in parser/scan.l */
  extern core_yyscan_t scanner_init(const char *str,
               core_yy_extra_type *yyext,
!              const ScanKeyword *keywords,
!              int num_keywords);
  extern void scanner_finish(core_yyscan_t yyscanner);
  extern int core_yylex(core_YYSTYPE *lvalp, YYLTYPE *llocp,
             core_yyscan_t yyscanner);
--- 116,129 ----
  typedef void *core_yyscan_t;


+ /* Constant data exported from parser/scan.l */
+ extern PGDLLIMPORT const uint16 ScanKeywordTokens[];
+
  /* Entry points in parser/scan.l */
  extern core_yyscan_t scanner_init(const char *str,
               core_yy_extra_type *yyext,
!              const ScanKeywordList *keywordlist,
!              const uint16 *keyword_tokens);
  extern void scanner_finish(core_yyscan_t yyscanner);
  extern int core_yylex(core_YYSTYPE *lvalp, YYLTYPE *llocp,
             core_yyscan_t yyscanner);
diff --git a/src/interfaces/ecpg/preproc/.gitignore b/src/interfaces/ecpg/preproc/.gitignore
index 38ae2fe..958a826 100644
*** a/src/interfaces/ecpg/preproc/.gitignore
--- b/src/interfaces/ecpg/preproc/.gitignore
***************
*** 2,6 ****
--- 2,8 ----
  /preproc.c
  /preproc.h
  /pgc.c
+ /c_kwlist_d.h
+ /ecpg_kwlist_d.h
  /typename.c
  /ecpg
diff --git a/src/interfaces/ecpg/preproc/Makefile b/src/interfaces/ecpg/preproc/Makefile
index 69ddd8e..9b145a1 100644
*** a/src/interfaces/ecpg/preproc/Makefile
--- b/src/interfaces/ecpg/preproc/Makefile
*************** OBJS=    preproc.o pgc.o type.o ecpg.o outp
*** 28,33 ****
--- 28,35 ----
      keywords.o c_keywords.o ecpg_keywords.o typename.o descriptor.o variable.o \
      $(WIN32RES)

+ GEN_KEYWORDLIST = $(top_srcdir)/src/tools/gen_keywordlist.pl
+
  # Suppress parallel build to avoid a bug in GNU make 3.82
  # (see comments in ../Makefile)
  ifeq ($(MAKE_VERSION),3.82)
*************** preproc.y: ../../../backend/parser/gram.
*** 53,61 ****
      $(PERL) $(srcdir)/parse.pl $(srcdir) < $< > $@
      $(PERL) $(srcdir)/check_rules.pl $(srcdir) $<

  ecpg_keywords.o c_keywords.o keywords.o preproc.o pgc.o parser.o: preproc.h

! distprep: preproc.y preproc.c preproc.h pgc.c

  install: all installdirs
      $(INSTALL_PROGRAM) ecpg$(X) '$(DESTDIR)$(bindir)'
--- 55,73 ----
      $(PERL) $(srcdir)/parse.pl $(srcdir) < $< > $@
      $(PERL) $(srcdir)/check_rules.pl $(srcdir) $<

+ # generate keyword headers
+ c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST)
+     $(PERL) $(GEN_KEYWORDLIST) --varname ScanCKeywords $<
+
+ ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST)
+     $(PERL) $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $<
+
+ # Force these dependencies to be known even without dependency info built:
  ecpg_keywords.o c_keywords.o keywords.o preproc.o pgc.o parser.o: preproc.h
+ ecpg_keywords.o: ecpg_kwlist_d.h
+ c_keywords.o: c_kwlist_d.h

! distprep: preproc.y preproc.c preproc.h pgc.c c_kwlist_d.h ecpg_kwlist_d.h

  install: all installdirs
      $(INSTALL_PROGRAM) ecpg$(X) '$(DESTDIR)$(bindir)'
*************** installdirs:
*** 66,77 ****
  uninstall:
      rm -f '$(DESTDIR)$(bindir)/ecpg$(X)'

  clean distclean:
      rm -f *.o ecpg$(X)
      rm -f typename.c

- # `make distclean' must not remove preproc.y, preproc.c, preproc.h, or pgc.c
- # since we want to ship those files in the distribution for people with
- # inadequate tools.  Instead, `make maintainer-clean' will remove them.
  maintainer-clean: distclean
!     rm -f preproc.y preproc.c preproc.h pgc.c
--- 78,88 ----
  uninstall:
      rm -f '$(DESTDIR)$(bindir)/ecpg$(X)'

+ # preproc.y, preproc.c, preproc.h, pgc.c, c_kwlist_d.h, and ecpg_kwlist_d.h
+ # are in the distribution tarball, so they are not cleaned here.
  clean distclean:
      rm -f *.o ecpg$(X)
      rm -f typename.c

  maintainer-clean: distclean
!     rm -f preproc.y preproc.c preproc.h pgc.c c_kwlist_d.h ecpg_kwlist_d.h
diff --git a/src/interfaces/ecpg/preproc/c_keywords.c b/src/interfaces/ecpg/preproc/c_keywords.c
index c367dbf..521992f 100644
*** a/src/interfaces/ecpg/preproc/c_keywords.c
--- b/src/interfaces/ecpg/preproc/c_keywords.c
***************
*** 14,85 ****
  #include "preproc_extern.h"
  #include "preproc.h"

! /*
!  * List of (keyword-name, keyword-token-value) pairs.
!  *
!  * !!WARNING!!: This list must be sorted, because binary
!  *         search is used to locate entries.
!  */
! static const ScanKeyword ScanCKeywords[] = {
!     /* name, value, category */

!     /*
!      * category is not needed in ecpg, it is only here so we can share the
!      * data structure with the backend
!      */
!     {"VARCHAR", VARCHAR, 0},
!     {"auto", S_AUTO, 0},
!     {"bool", SQL_BOOL, 0},
!     {"char", CHAR_P, 0},
!     {"const", S_CONST, 0},
!     {"enum", ENUM_P, 0},
!     {"extern", S_EXTERN, 0},
!     {"float", FLOAT_P, 0},
!     {"hour", HOUR_P, 0},
!     {"int", INT_P, 0},
!     {"long", SQL_LONG, 0},
!     {"minute", MINUTE_P, 0},
!     {"month", MONTH_P, 0},
!     {"register", S_REGISTER, 0},
!     {"second", SECOND_P, 0},
!     {"short", SQL_SHORT, 0},
!     {"signed", SQL_SIGNED, 0},
!     {"static", S_STATIC, 0},
!     {"struct", SQL_STRUCT, 0},
!     {"to", TO, 0},
!     {"typedef", S_TYPEDEF, 0},
!     {"union", UNION, 0},
!     {"unsigned", SQL_UNSIGNED, 0},
!     {"varchar", VARCHAR, 0},
!     {"volatile", S_VOLATILE, 0},
!     {"year", YEAR_P, 0},
  };


  /*
   * Do a binary search using plain strcmp() comparison.  This is much like
   * ScanKeywordLookup(), except we want case-sensitive matching.
   */
! const ScanKeyword *
  ScanCKeywordLookup(const char *text)
  {
!     const ScanKeyword *low = &ScanCKeywords[0];
!     const ScanKeyword *high = &ScanCKeywords[lengthof(ScanCKeywords) - 1];

      while (low <= high)
      {
!         const ScanKeyword *middle;
          int            difference;

          middle = low + (high - low) / 2;
!         difference = strcmp(middle->name, text);
          if (difference == 0)
!             return middle;
          else if (difference < 0)
              low = middle + 1;
          else
              high = middle - 1;
      }

!     return NULL;
  }
--- 14,67 ----
  #include "preproc_extern.h"
  #include "preproc.h"

! /* ScanKeywordList lookup data for C keywords */
! #include "c_kwlist_d.h"

! /* Token codes for C keywords */
! #define PG_KEYWORD(kwname, value) value,
!
! static const uint16 ScanCKeywordTokens[] = {
! #include "c_kwlist.h"
  };

+ #undef PG_KEYWORD
+

  /*
+  * ScanCKeywordLookup - see if a given word is a keyword
+  *
+  * Returns the token value of the keyword, or -1 if no match.
+  *
   * Do a binary search using plain strcmp() comparison.  This is much like
   * ScanKeywordLookup(), except we want case-sensitive matching.
   */
! int
  ScanCKeywordLookup(const char *text)
  {
!     const char *kw_string;
!     const uint16 *kw_offsets;
!     const uint16 *low;
!     const uint16 *high;
!
!     kw_string = ScanCKeywords.kw_string;
!     kw_offsets = ScanCKeywords.kw_offsets;
!     low = kw_offsets;
!     high = kw_offsets + (ScanCKeywords.num_keywords - 1);

      while (low <= high)
      {
!         const uint16 *middle;
          int            difference;

          middle = low + (high - low) / 2;
!         difference = strcmp(kw_string + *middle, text);
          if (difference == 0)
!             return ScanCKeywordTokens[middle - kw_offsets];
          else if (difference < 0)
              low = middle + 1;
          else
              high = middle - 1;
      }

!     return -1;
  }
diff --git a/src/interfaces/ecpg/preproc/c_kwlist.h b/src/interfaces/ecpg/preproc/c_kwlist.h
index ...4545505 .
*** a/src/interfaces/ecpg/preproc/c_kwlist.h
--- b/src/interfaces/ecpg/preproc/c_kwlist.h
***************
*** 0 ****
--- 1,53 ----
+ /*-------------------------------------------------------------------------
+  *
+  * c_kwlist.h
+  *
+  * The keyword lists are kept in their own source files for use by
+  * automatic tools.  The exact representation of a keyword is determined
+  * by the PG_KEYWORD macro, which is not defined in this file; it can
+  * be defined by the caller for special purposes.
+  *
+  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/interfaces/ecpg/preproc/c_kwlist.h
+  *
+  *-------------------------------------------------------------------------
+  */
+
+ /* There is deliberately not an #ifndef C_KWLIST_H here. */
+
+ /*
+  * List of (keyword-name, keyword-token-value) pairs.
+  *
+  * !!WARNING!!: This list must be sorted by ASCII name, because binary
+  *         search is used to locate entries.
+  */
+
+ /* name, value */
+ PG_KEYWORD("VARCHAR", VARCHAR)
+ PG_KEYWORD("auto", S_AUTO)
+ PG_KEYWORD("bool", SQL_BOOL)
+ PG_KEYWORD("char", CHAR_P)
+ PG_KEYWORD("const", S_CONST)
+ PG_KEYWORD("enum", ENUM_P)
+ PG_KEYWORD("extern", S_EXTERN)
+ PG_KEYWORD("float", FLOAT_P)
+ PG_KEYWORD("hour", HOUR_P)
+ PG_KEYWORD("int", INT_P)
+ PG_KEYWORD("long", SQL_LONG)
+ PG_KEYWORD("minute", MINUTE_P)
+ PG_KEYWORD("month", MONTH_P)
+ PG_KEYWORD("register", S_REGISTER)
+ PG_KEYWORD("second", SECOND_P)
+ PG_KEYWORD("short", SQL_SHORT)
+ PG_KEYWORD("signed", SQL_SIGNED)
+ PG_KEYWORD("static", S_STATIC)
+ PG_KEYWORD("struct", SQL_STRUCT)
+ PG_KEYWORD("to", TO)
+ PG_KEYWORD("typedef", S_TYPEDEF)
+ PG_KEYWORD("union", UNION)
+ PG_KEYWORD("unsigned", SQL_UNSIGNED)
+ PG_KEYWORD("varchar", VARCHAR)
+ PG_KEYWORD("volatile", S_VOLATILE)
+ PG_KEYWORD("year", YEAR_P)
diff --git a/src/interfaces/ecpg/preproc/ecpg_keywords.c b/src/interfaces/ecpg/preproc/ecpg_keywords.c
index 37c97e1..4839c37 100644
*** a/src/interfaces/ecpg/preproc/ecpg_keywords.c
--- b/src/interfaces/ecpg/preproc/ecpg_keywords.c
***************
*** 16,97 ****
  #include "preproc_extern.h"
  #include "preproc.h"

! /*
!  * List of (keyword-name, keyword-token-value) pairs.
!  *
!  * !!WARNING!!: This list must be sorted, because binary
!  *         search is used to locate entries.
!  */
! static const ScanKeyword ECPGScanKeywords[] = {
!     /* name, value, category */

!     /*
!      * category is not needed in ecpg, it is only here so we can share the
!      * data structure with the backend
!      */
!     {"allocate", SQL_ALLOCATE, 0},
!     {"autocommit", SQL_AUTOCOMMIT, 0},
!     {"bool", SQL_BOOL, 0},
!     {"break", SQL_BREAK, 0},
!     {"cardinality", SQL_CARDINALITY, 0},
!     {"connect", SQL_CONNECT, 0},
!     {"count", SQL_COUNT, 0},
!     {"datetime_interval_code", SQL_DATETIME_INTERVAL_CODE, 0},
!     {"datetime_interval_precision", SQL_DATETIME_INTERVAL_PRECISION, 0},
!     {"describe", SQL_DESCRIBE, 0},
!     {"descriptor", SQL_DESCRIPTOR, 0},
!     {"disconnect", SQL_DISCONNECT, 0},
!     {"found", SQL_FOUND, 0},
!     {"free", SQL_FREE, 0},
!     {"get", SQL_GET, 0},
!     {"go", SQL_GO, 0},
!     {"goto", SQL_GOTO, 0},
!     {"identified", SQL_IDENTIFIED, 0},
!     {"indicator", SQL_INDICATOR, 0},
!     {"key_member", SQL_KEY_MEMBER, 0},
!     {"length", SQL_LENGTH, 0},
!     {"long", SQL_LONG, 0},
!     {"nullable", SQL_NULLABLE, 0},
!     {"octet_length", SQL_OCTET_LENGTH, 0},
!     {"open", SQL_OPEN, 0},
!     {"output", SQL_OUTPUT, 0},
!     {"reference", SQL_REFERENCE, 0},
!     {"returned_length", SQL_RETURNED_LENGTH, 0},
!     {"returned_octet_length", SQL_RETURNED_OCTET_LENGTH, 0},
!     {"scale", SQL_SCALE, 0},
!     {"section", SQL_SECTION, 0},
!     {"short", SQL_SHORT, 0},
!     {"signed", SQL_SIGNED, 0},
!     {"sqlerror", SQL_SQLERROR, 0},
!     {"sqlprint", SQL_SQLPRINT, 0},
!     {"sqlwarning", SQL_SQLWARNING, 0},
!     {"stop", SQL_STOP, 0},
!     {"struct", SQL_STRUCT, 0},
!     {"unsigned", SQL_UNSIGNED, 0},
!     {"var", SQL_VAR, 0},
!     {"whenever", SQL_WHENEVER, 0},
  };

  /*
   * ScanECPGKeywordLookup - see if a given word is a keyword
   *
!  * Returns a pointer to the ScanKeyword table entry, or NULL if no match.
   * Keywords are matched using the same case-folding rules as in the backend.
   */
! const ScanKeyword *
  ScanECPGKeywordLookup(const char *text)
  {
!     const ScanKeyword *res;

      /* First check SQL symbols defined by the backend. */
!     res = ScanKeywordLookup(text, SQLScanKeywords, NumSQLScanKeywords);
!     if (res)
!         return res;

      /* Try ECPG-specific keywords. */
!     res = ScanKeywordLookup(text, ECPGScanKeywords, lengthof(ECPGScanKeywords));
!     if (res)
!         return res;

!     return NULL;
  }
--- 16,55 ----
  #include "preproc_extern.h"
  #include "preproc.h"

! /* ScanKeywordList lookup data for ECPG keywords */
! #include "ecpg_kwlist_d.h"

! /* Token codes for ECPG keywords */
! #define PG_KEYWORD(kwname, value) value,
!
! static const uint16 ECPGScanKeywordTokens[] = {
! #include "ecpg_kwlist.h"
  };

+ #undef PG_KEYWORD
+
+
  /*
   * ScanECPGKeywordLookup - see if a given word is a keyword
   *
!  * Returns the token value of the keyword, or -1 if no match.
!  *
   * Keywords are matched using the same case-folding rules as in the backend.
   */
! int
  ScanECPGKeywordLookup(const char *text)
  {
!     int            kwnum;

      /* First check SQL symbols defined by the backend. */
!     kwnum = ScanKeywordLookup(text, &ScanKeywords);
!     if (kwnum >= 0)
!         return SQLScanKeywordTokens[kwnum];

      /* Try ECPG-specific keywords. */
!     kwnum = ScanKeywordLookup(text, &ScanECPGKeywords);
!     if (kwnum >= 0)
!         return ECPGScanKeywordTokens[kwnum];

!     return -1;
  }
diff --git a/src/interfaces/ecpg/preproc/ecpg_kwlist.h b/src/interfaces/ecpg/preproc/ecpg_kwlist.h
index ...97ef254 .
*** a/src/interfaces/ecpg/preproc/ecpg_kwlist.h
--- b/src/interfaces/ecpg/preproc/ecpg_kwlist.h
***************
*** 0 ****
--- 1,68 ----
+ /*-------------------------------------------------------------------------
+  *
+  * ecpg_kwlist.h
+  *
+  * The keyword lists are kept in their own source files for use by
+  * automatic tools.  The exact representation of a keyword is determined
+  * by the PG_KEYWORD macro, which is not defined in this file; it can
+  * be defined by the caller for special purposes.
+  *
+  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/interfaces/ecpg/preproc/ecpg_kwlist.h
+  *
+  *-------------------------------------------------------------------------
+  */
+
+ /* There is deliberately not an #ifndef ECPG_KWLIST_H here. */
+
+ /*
+  * List of (keyword-name, keyword-token-value) pairs.
+  *
+  * !!WARNING!!: This list must be sorted by ASCII name, because binary
+  *         search is used to locate entries.
+  */
+
+ /* name, value */
+ PG_KEYWORD("allocate", SQL_ALLOCATE)
+ PG_KEYWORD("autocommit", SQL_AUTOCOMMIT)
+ PG_KEYWORD("bool", SQL_BOOL)
+ PG_KEYWORD("break", SQL_BREAK)
+ PG_KEYWORD("cardinality", SQL_CARDINALITY)
+ PG_KEYWORD("connect", SQL_CONNECT)
+ PG_KEYWORD("count", SQL_COUNT)
+ PG_KEYWORD("datetime_interval_code", SQL_DATETIME_INTERVAL_CODE)
+ PG_KEYWORD("datetime_interval_precision", SQL_DATETIME_INTERVAL_PRECISION)
+ PG_KEYWORD("describe", SQL_DESCRIBE)
+ PG_KEYWORD("descriptor", SQL_DESCRIPTOR)
+ PG_KEYWORD("disconnect", SQL_DISCONNECT)
+ PG_KEYWORD("found", SQL_FOUND)
+ PG_KEYWORD("free", SQL_FREE)
+ PG_KEYWORD("get", SQL_GET)
+ PG_KEYWORD("go", SQL_GO)
+ PG_KEYWORD("goto", SQL_GOTO)
+ PG_KEYWORD("identified", SQL_IDENTIFIED)
+ PG_KEYWORD("indicator", SQL_INDICATOR)
+ PG_KEYWORD("key_member", SQL_KEY_MEMBER)
+ PG_KEYWORD("length", SQL_LENGTH)
+ PG_KEYWORD("long", SQL_LONG)
+ PG_KEYWORD("nullable", SQL_NULLABLE)
+ PG_KEYWORD("octet_length", SQL_OCTET_LENGTH)
+ PG_KEYWORD("open", SQL_OPEN)
+ PG_KEYWORD("output", SQL_OUTPUT)
+ PG_KEYWORD("reference", SQL_REFERENCE)
+ PG_KEYWORD("returned_length", SQL_RETURNED_LENGTH)
+ PG_KEYWORD("returned_octet_length", SQL_RETURNED_OCTET_LENGTH)
+ PG_KEYWORD("scale", SQL_SCALE)
+ PG_KEYWORD("section", SQL_SECTION)
+ PG_KEYWORD("short", SQL_SHORT)
+ PG_KEYWORD("signed", SQL_SIGNED)
+ PG_KEYWORD("sqlerror", SQL_SQLERROR)
+ PG_KEYWORD("sqlprint", SQL_SQLPRINT)
+ PG_KEYWORD("sqlwarning", SQL_SQLWARNING)
+ PG_KEYWORD("stop", SQL_STOP)
+ PG_KEYWORD("struct", SQL_STRUCT)
+ PG_KEYWORD("unsigned", SQL_UNSIGNED)
+ PG_KEYWORD("var", SQL_VAR)
+ PG_KEYWORD("whenever", SQL_WHENEVER)
diff --git a/src/interfaces/ecpg/preproc/keywords.c b/src/interfaces/ecpg/preproc/keywords.c
index 12409e9..0380409 100644
*** a/src/interfaces/ecpg/preproc/keywords.c
--- b/src/interfaces/ecpg/preproc/keywords.c
***************
*** 17,40 ****

  /*
   * This is much trickier than it looks.  We are #include'ing kwlist.h
!  * but the "value" numbers that go into the table are from preproc.h
!  * not the backend's gram.h.  Therefore this table will recognize all
!  * keywords known to the backend, but will supply the token numbers used
   * by ecpg's grammar, which is what we need.  The ecpg grammar must
   * define all the same token names the backend does, else we'll get
   * undefined-symbol failures in this compile.
   */

- #include "common/keywords.h"
-
  #include "preproc_extern.h"
  #include "preproc.h"


! #define PG_KEYWORD(a,b,c) {a,b,c},
!
! const ScanKeyword SQLScanKeywords[] = {
  #include "parser/kwlist.h"
  };

! const int    NumSQLScanKeywords = lengthof(SQLScanKeywords);
--- 17,38 ----

  /*
   * This is much trickier than it looks.  We are #include'ing kwlist.h
!  * but the token numbers that go into the table are from preproc.h
!  * not the backend's gram.h.  Therefore this token table will match
!  * the ScanKeywords table supplied from common/keywords.c, including all
!  * keywords known to the backend, but it will supply the token numbers used
   * by ecpg's grammar, which is what we need.  The ecpg grammar must
   * define all the same token names the backend does, else we'll get
   * undefined-symbol failures in this compile.
   */

  #include "preproc_extern.h"
  #include "preproc.h"

+ #define PG_KEYWORD(kwname, value, category) value,

! const uint16 SQLScanKeywordTokens[] = {
  #include "parser/kwlist.h"
  };

! #undef PG_KEYWORD
diff --git a/src/interfaces/ecpg/preproc/pgc.l b/src/interfaces/ecpg/preproc/pgc.l
index a60564c..3131f5f 100644
*** a/src/interfaces/ecpg/preproc/pgc.l
--- b/src/interfaces/ecpg/preproc/pgc.l
*************** cppline            {space}*#([^i][A-Za-z]*|{if}|{
*** 920,938 ****
                  }

  {identifier}    {
-                     const ScanKeyword  *keyword;
-
                      if (!isdefine())
                      {
                          /* Is it an SQL/ECPG keyword? */
!                         keyword = ScanECPGKeywordLookup(yytext);
!                         if (keyword != NULL)
!                             return keyword->value;

                          /* Is it a C keyword? */
!                         keyword = ScanCKeywordLookup(yytext);
!                         if (keyword != NULL)
!                             return keyword->value;

                          /*
                           * None of the above.  Return it as an identifier.
--- 920,938 ----
                  }

  {identifier}    {
                      if (!isdefine())
                      {
+                         int        kwvalue;
+
                          /* Is it an SQL/ECPG keyword? */
!                         kwvalue = ScanECPGKeywordLookup(yytext);
!                         if (kwvalue >= 0)
!                             return kwvalue;

                          /* Is it a C keyword? */
!                         kwvalue = ScanCKeywordLookup(yytext);
!                         if (kwvalue >= 0)
!                             return kwvalue;

                          /*
                           * None of the above.  Return it as an identifier.
*************** cppline            {space}*#([^i][A-Za-z]*|{if}|{
*** 1010,1021 ****
                          return CPP_LINE;
                      }
  <C>{identifier}        {
-                         const ScanKeyword        *keyword;
-
                          /*
                           * Try to detect a function name:
                           * look for identifiers at the global scope
!                          * keep the last identifier before the first '(' and '{' */
                          if (braces_open == 0 && parenths_open == 0)
                          {
                              if (current_function)
--- 1010,1020 ----
                          return CPP_LINE;
                      }
  <C>{identifier}        {
                          /*
                           * Try to detect a function name:
                           * look for identifiers at the global scope
!                          * keep the last identifier before the first '(' and '{'
!                          */
                          if (braces_open == 0 && parenths_open == 0)
                          {
                              if (current_function)
*************** cppline            {space}*#([^i][A-Za-z]*|{if}|{
*** 1026,1034 ****
                          /* however, some defines have to be taken care of for compatibility */
                          if ((!INFORMIX_MODE || !isinformixdefine()) && !isdefine())
                          {
!                             keyword = ScanCKeywordLookup(yytext);
!                             if (keyword != NULL)
!                                 return keyword->value;
                              else
                              {
                                  base_yylval.str = mm_strdup(yytext);
--- 1025,1035 ----
                          /* however, some defines have to be taken care of for compatibility */
                          if ((!INFORMIX_MODE || !isinformixdefine()) && !isdefine())
                          {
!                             int        kwvalue;
!
!                             kwvalue = ScanCKeywordLookup(yytext);
!                             if (kwvalue >= 0)
!                                 return kwvalue;
                              else
                              {
                                  base_yylval.str = mm_strdup(yytext);
diff --git a/src/interfaces/ecpg/preproc/preproc_extern.h b/src/interfaces/ecpg/preproc/preproc_extern.h
index 13eda67..9746780 100644
*** a/src/interfaces/ecpg/preproc/preproc_extern.h
--- b/src/interfaces/ecpg/preproc/preproc_extern.h
*************** extern struct when when_error,
*** 59,66 ****
  extern struct ECPGstruct_member *struct_member_list[STRUCT_DEPTH];

  /* Globals from keywords.c */
! extern const ScanKeyword SQLScanKeywords[];
! extern const int NumSQLScanKeywords;

  /* functions */

--- 59,65 ----
  extern struct ECPGstruct_member *struct_member_list[STRUCT_DEPTH];

  /* Globals from keywords.c */
! extern const uint16 SQLScanKeywordTokens[];

  /* functions */

*************** extern void check_indicator(struct ECPGt
*** 102,109 ****
  extern void remove_typedefs(int);
  extern void remove_variables(int);
  extern struct variable *new_variable(const char *, struct ECPGtype *, int);
! extern const ScanKeyword *ScanCKeywordLookup(const char *);
! extern const ScanKeyword *ScanECPGKeywordLookup(const char *text);
  extern void parser_init(void);
  extern int    filtered_base_yylex(void);

--- 101,108 ----
  extern void remove_typedefs(int);
  extern void remove_variables(int);
  extern struct variable *new_variable(const char *, struct ECPGtype *, int);
! extern int    ScanCKeywordLookup(const char *text);
! extern int    ScanECPGKeywordLookup(const char *text);
  extern void parser_init(void);
  extern int    filtered_base_yylex(void);

diff --git a/src/pl/plpgsql/src/.gitignore b/src/pl/plpgsql/src/.gitignore
index ff6ac96..3ab9a22 100644
*** a/src/pl/plpgsql/src/.gitignore
--- b/src/pl/plpgsql/src/.gitignore
***************
*** 1,5 ****
--- 1,7 ----
  /pl_gram.c
  /pl_gram.h
+ /pl_reserved_kwlist_d.h
+ /pl_unreserved_kwlist_d.h
  /plerrcodes.h
  /log/
  /results/
diff --git a/src/pl/plpgsql/src/Makefile b/src/pl/plpgsql/src/Makefile
index 25a5a9d..9dd4a74 100644
*** a/src/pl/plpgsql/src/Makefile
--- b/src/pl/plpgsql/src/Makefile
*************** REGRESS_OPTS = --dbname=$(PL_TESTDB)
*** 29,34 ****
--- 29,36 ----
  REGRESS = plpgsql_call plpgsql_control plpgsql_domain plpgsql_record \
      plpgsql_cache plpgsql_transaction plpgsql_varprops

+ GEN_KEYWORDLIST = $(top_srcdir)/src/tools/gen_keywordlist.pl
+
  all: all-lib

  # Shared library stuff
*************** uninstall-headers:
*** 61,66 ****
--- 63,69 ----

  # Force these dependencies to be known even without dependency info built:
  pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o: plpgsql.h pl_gram.h plerrcodes.h
+ pl_scanner.o: pl_reserved_kwlist_d.h pl_unreserved_kwlist_d.h

  # See notes in src/backend/parser/Makefile about the following two rules
  pl_gram.h: pl_gram.c
*************** pl_gram.c: BISONFLAGS += -d
*** 72,77 ****
--- 75,87 ----
  plerrcodes.h: $(top_srcdir)/src/backend/utils/errcodes.txt generate-plerrcodes.pl
      $(PERL) $(srcdir)/generate-plerrcodes.pl $< > $@

+ # generate keyword headers for the scanner
+ pl_reserved_kwlist_d.h: pl_reserved_kwlist.h $(GEN_KEYWORDLIST)
+     $(PERL) $(GEN_KEYWORDLIST) --varname ReservedPLKeywords $<
+
+ pl_unreserved_kwlist_d.h: pl_unreserved_kwlist.h $(GEN_KEYWORDLIST)
+     $(PERL) $(GEN_KEYWORDLIST) --varname UnreservedPLKeywords $<
+

  check: submake
      $(pg_regress_check) $(REGRESS_OPTS) $(REGRESS)
*************** submake:
*** 84,96 ****
      $(MAKE) -C $(top_builddir)/src/test/regress pg_regress$(X)


! distprep: pl_gram.h pl_gram.c plerrcodes.h

! # pl_gram.c, pl_gram.h and plerrcodes.h are in the distribution tarball,
! # so they are not cleaned here.
  clean distclean: clean-lib
      rm -f $(OBJS)
      rm -rf $(pg_regress_clean_files)

  maintainer-clean: distclean
!     rm -f pl_gram.c pl_gram.h plerrcodes.h
--- 94,107 ----
      $(MAKE) -C $(top_builddir)/src/test/regress pg_regress$(X)


! distprep: pl_gram.h pl_gram.c plerrcodes.h pl_reserved_kwlist_d.h pl_unreserved_kwlist_d.h

! # pl_gram.c, pl_gram.h, plerrcodes.h, pl_reserved_kwlist_d.h, and
! # pl_unreserved_kwlist_d.h are in the distribution tarball, so they
! # are not cleaned here.
  clean distclean: clean-lib
      rm -f $(OBJS)
      rm -rf $(pg_regress_clean_files)

  maintainer-clean: distclean
!     rm -f pl_gram.c pl_gram.h plerrcodes.h pl_reserved_kwlist_d.h pl_unreserved_kwlist_d.h
diff --git a/src/pl/plpgsql/src/pl_reserved_kwlist.h b/src/pl/plpgsql/src/pl_reserved_kwlist.h
index ...5c2e0c1 .
*** a/src/pl/plpgsql/src/pl_reserved_kwlist.h
--- b/src/pl/plpgsql/src/pl_reserved_kwlist.h
***************
*** 0 ****
--- 1,53 ----
+ /*-------------------------------------------------------------------------
+  *
+  * pl_reserved_kwlist.h
+  *
+  * The keyword lists are kept in their own source files for use by
+  * automatic tools.  The exact representation of a keyword is determined
+  * by the PG_KEYWORD macro, which is not defined in this file; it can
+  * be defined by the caller for special purposes.
+  *
+  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/pl/plpgsql/src/pl_reserved_kwlist.h
+  *
+  *-------------------------------------------------------------------------
+  */
+
+ /* There is deliberately not an #ifndef PL_RESERVED_KWLIST_H here. */
+
+ /*
+  * List of (keyword-name, keyword-token-value) pairs.
+  *
+  * Be careful not to put the same word in both lists.
+  *
+  * !!WARNING!!: This list must be sorted by ASCII name, because binary
+  *         search is used to locate entries.
+  */
+
+ /* name, value */
+ PG_KEYWORD("all", K_ALL)
+ PG_KEYWORD("begin", K_BEGIN)
+ PG_KEYWORD("by", K_BY)
+ PG_KEYWORD("case", K_CASE)
+ PG_KEYWORD("declare", K_DECLARE)
+ PG_KEYWORD("else", K_ELSE)
+ PG_KEYWORD("end", K_END)
+ PG_KEYWORD("execute", K_EXECUTE)
+ PG_KEYWORD("for", K_FOR)
+ PG_KEYWORD("foreach", K_FOREACH)
+ PG_KEYWORD("from", K_FROM)
+ PG_KEYWORD("if", K_IF)
+ PG_KEYWORD("in", K_IN)
+ PG_KEYWORD("into", K_INTO)
+ PG_KEYWORD("loop", K_LOOP)
+ PG_KEYWORD("not", K_NOT)
+ PG_KEYWORD("null", K_NULL)
+ PG_KEYWORD("or", K_OR)
+ PG_KEYWORD("strict", K_STRICT)
+ PG_KEYWORD("then", K_THEN)
+ PG_KEYWORD("to", K_TO)
+ PG_KEYWORD("using", K_USING)
+ PG_KEYWORD("when", K_WHEN)
+ PG_KEYWORD("while", K_WHILE)
diff --git a/src/pl/plpgsql/src/pl_scanner.c b/src/pl/plpgsql/src/pl_scanner.c
index 8340628..c260438 100644
*** a/src/pl/plpgsql/src/pl_scanner.c
--- b/src/pl/plpgsql/src/pl_scanner.c
***************
*** 22,37 ****
  #include "pl_gram.h"            /* must be after parser/scanner.h */


- #define PG_KEYWORD(a,b,c) {a,b,c},
-
-
  /* Klugy flag to tell scanner how to look up identifiers */
  IdentifierLookup plpgsql_IdentifierLookup = IDENTIFIER_LOOKUP_NORMAL;

  /*
   * A word about keywords:
   *
!  * We keep reserved and unreserved keywords in separate arrays.  The
   * reserved keywords are passed to the core scanner, so they will be
   * recognized before (and instead of) any variable name.  Unreserved words
   * are checked for separately, usually after determining that the identifier
--- 22,36 ----
  #include "pl_gram.h"            /* must be after parser/scanner.h */


  /* Klugy flag to tell scanner how to look up identifiers */
  IdentifierLookup plpgsql_IdentifierLookup = IDENTIFIER_LOOKUP_NORMAL;

  /*
   * A word about keywords:
   *
!  * We keep reserved and unreserved keywords in separate headers.  Be careful
!  * not to put the same word in both headers.  Also be sure that pl_gram.y's
!  * unreserved_keyword production agrees with the unreserved header.  The
   * reserved keywords are passed to the core scanner, so they will be
   * recognized before (and instead of) any variable name.  Unreserved words
   * are checked for separately, usually after determining that the identifier
*************** IdentifierLookup plpgsql_IdentifierLooku
*** 57,186 ****
   * BEGIN BY DECLARE EXECUTE FOREACH IF LOOP STRICT WHILE
   */

! /*
!  * Lists of keyword (name, token-value, category) entries.
!  *
!  * !!WARNING!!: These lists must be sorted by ASCII name, because binary
!  *         search is used to locate entries.
!  *
!  * Be careful not to put the same word in both lists.  Also be sure that
!  * pl_gram.y's unreserved_keyword production agrees with the second list.
!  */

! static const ScanKeyword reserved_keywords[] = {
!     PG_KEYWORD("all", K_ALL, RESERVED_KEYWORD)
!     PG_KEYWORD("begin", K_BEGIN, RESERVED_KEYWORD)
!     PG_KEYWORD("by", K_BY, RESERVED_KEYWORD)
!     PG_KEYWORD("case", K_CASE, RESERVED_KEYWORD)
!     PG_KEYWORD("declare", K_DECLARE, RESERVED_KEYWORD)
!     PG_KEYWORD("else", K_ELSE, RESERVED_KEYWORD)
!     PG_KEYWORD("end", K_END, RESERVED_KEYWORD)
!     PG_KEYWORD("execute", K_EXECUTE, RESERVED_KEYWORD)
!     PG_KEYWORD("for", K_FOR, RESERVED_KEYWORD)
!     PG_KEYWORD("foreach", K_FOREACH, RESERVED_KEYWORD)
!     PG_KEYWORD("from", K_FROM, RESERVED_KEYWORD)
!     PG_KEYWORD("if", K_IF, RESERVED_KEYWORD)
!     PG_KEYWORD("in", K_IN, RESERVED_KEYWORD)
!     PG_KEYWORD("into", K_INTO, RESERVED_KEYWORD)
!     PG_KEYWORD("loop", K_LOOP, RESERVED_KEYWORD)
!     PG_KEYWORD("not", K_NOT, RESERVED_KEYWORD)
!     PG_KEYWORD("null", K_NULL, RESERVED_KEYWORD)
!     PG_KEYWORD("or", K_OR, RESERVED_KEYWORD)
!     PG_KEYWORD("strict", K_STRICT, RESERVED_KEYWORD)
!     PG_KEYWORD("then", K_THEN, RESERVED_KEYWORD)
!     PG_KEYWORD("to", K_TO, RESERVED_KEYWORD)
!     PG_KEYWORD("using", K_USING, RESERVED_KEYWORD)
!     PG_KEYWORD("when", K_WHEN, RESERVED_KEYWORD)
!     PG_KEYWORD("while", K_WHILE, RESERVED_KEYWORD)
! };

! static const int num_reserved_keywords = lengthof(reserved_keywords);

! static const ScanKeyword unreserved_keywords[] = {
!     PG_KEYWORD("absolute", K_ABSOLUTE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("alias", K_ALIAS, UNRESERVED_KEYWORD)
!     PG_KEYWORD("array", K_ARRAY, UNRESERVED_KEYWORD)
!     PG_KEYWORD("assert", K_ASSERT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("backward", K_BACKWARD, UNRESERVED_KEYWORD)
!     PG_KEYWORD("call", K_CALL, UNRESERVED_KEYWORD)
!     PG_KEYWORD("close", K_CLOSE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("collate", K_COLLATE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("column", K_COLUMN, UNRESERVED_KEYWORD)
!     PG_KEYWORD("column_name", K_COLUMN_NAME, UNRESERVED_KEYWORD)
!     PG_KEYWORD("commit", K_COMMIT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("constant", K_CONSTANT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("constraint", K_CONSTRAINT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("constraint_name", K_CONSTRAINT_NAME, UNRESERVED_KEYWORD)
!     PG_KEYWORD("continue", K_CONTINUE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("current", K_CURRENT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("cursor", K_CURSOR, UNRESERVED_KEYWORD)
!     PG_KEYWORD("datatype", K_DATATYPE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("debug", K_DEBUG, UNRESERVED_KEYWORD)
!     PG_KEYWORD("default", K_DEFAULT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("detail", K_DETAIL, UNRESERVED_KEYWORD)
!     PG_KEYWORD("diagnostics", K_DIAGNOSTICS, UNRESERVED_KEYWORD)
!     PG_KEYWORD("do", K_DO, UNRESERVED_KEYWORD)
!     PG_KEYWORD("dump", K_DUMP, UNRESERVED_KEYWORD)
!     PG_KEYWORD("elseif", K_ELSIF, UNRESERVED_KEYWORD)
!     PG_KEYWORD("elsif", K_ELSIF, UNRESERVED_KEYWORD)
!     PG_KEYWORD("errcode", K_ERRCODE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("error", K_ERROR, UNRESERVED_KEYWORD)
!     PG_KEYWORD("exception", K_EXCEPTION, UNRESERVED_KEYWORD)
!     PG_KEYWORD("exit", K_EXIT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("fetch", K_FETCH, UNRESERVED_KEYWORD)
!     PG_KEYWORD("first", K_FIRST, UNRESERVED_KEYWORD)
!     PG_KEYWORD("forward", K_FORWARD, UNRESERVED_KEYWORD)
!     PG_KEYWORD("get", K_GET, UNRESERVED_KEYWORD)
!     PG_KEYWORD("hint", K_HINT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("import", K_IMPORT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("info", K_INFO, UNRESERVED_KEYWORD)
!     PG_KEYWORD("insert", K_INSERT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("is", K_IS, UNRESERVED_KEYWORD)
!     PG_KEYWORD("last", K_LAST, UNRESERVED_KEYWORD)
!     PG_KEYWORD("log", K_LOG, UNRESERVED_KEYWORD)
!     PG_KEYWORD("message", K_MESSAGE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("message_text", K_MESSAGE_TEXT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("move", K_MOVE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("next", K_NEXT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("no", K_NO, UNRESERVED_KEYWORD)
!     PG_KEYWORD("notice", K_NOTICE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("open", K_OPEN, UNRESERVED_KEYWORD)
!     PG_KEYWORD("option", K_OPTION, UNRESERVED_KEYWORD)
!     PG_KEYWORD("perform", K_PERFORM, UNRESERVED_KEYWORD)
!     PG_KEYWORD("pg_context", K_PG_CONTEXT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("pg_datatype_name", K_PG_DATATYPE_NAME, UNRESERVED_KEYWORD)
!     PG_KEYWORD("pg_exception_context", K_PG_EXCEPTION_CONTEXT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("pg_exception_detail", K_PG_EXCEPTION_DETAIL, UNRESERVED_KEYWORD)
!     PG_KEYWORD("pg_exception_hint", K_PG_EXCEPTION_HINT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("print_strict_params", K_PRINT_STRICT_PARAMS, UNRESERVED_KEYWORD)
!     PG_KEYWORD("prior", K_PRIOR, UNRESERVED_KEYWORD)
!     PG_KEYWORD("query", K_QUERY, UNRESERVED_KEYWORD)
!     PG_KEYWORD("raise", K_RAISE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("relative", K_RELATIVE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("reset", K_RESET, UNRESERVED_KEYWORD)
!     PG_KEYWORD("return", K_RETURN, UNRESERVED_KEYWORD)
!     PG_KEYWORD("returned_sqlstate", K_RETURNED_SQLSTATE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("reverse", K_REVERSE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("rollback", K_ROLLBACK, UNRESERVED_KEYWORD)
!     PG_KEYWORD("row_count", K_ROW_COUNT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("rowtype", K_ROWTYPE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("schema", K_SCHEMA, UNRESERVED_KEYWORD)
!     PG_KEYWORD("schema_name", K_SCHEMA_NAME, UNRESERVED_KEYWORD)
!     PG_KEYWORD("scroll", K_SCROLL, UNRESERVED_KEYWORD)
!     PG_KEYWORD("set", K_SET, UNRESERVED_KEYWORD)
!     PG_KEYWORD("slice", K_SLICE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("sqlstate", K_SQLSTATE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("stacked", K_STACKED, UNRESERVED_KEYWORD)
!     PG_KEYWORD("table", K_TABLE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("table_name", K_TABLE_NAME, UNRESERVED_KEYWORD)
!     PG_KEYWORD("type", K_TYPE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("use_column", K_USE_COLUMN, UNRESERVED_KEYWORD)
!     PG_KEYWORD("use_variable", K_USE_VARIABLE, UNRESERVED_KEYWORD)
!     PG_KEYWORD("variable_conflict", K_VARIABLE_CONFLICT, UNRESERVED_KEYWORD)
!     PG_KEYWORD("warning", K_WARNING, UNRESERVED_KEYWORD)
  };

! static const int num_unreserved_keywords = lengthof(unreserved_keywords);

  /*
   * This macro must recognize all tokens that can immediately precede a
--- 56,77 ----
   * BEGIN BY DECLARE EXECUTE FOREACH IF LOOP STRICT WHILE
   */

! /* ScanKeywordList lookup data for PL/pgSQL keywords */
! #include "pl_reserved_kwlist_d.h"
! #include "pl_unreserved_kwlist_d.h"

! /* Token codes for PL/pgSQL keywords */
! #define PG_KEYWORD(kwname, value) value,

! static const uint16 ReservedPLKeywordTokens[] = {
! #include "pl_reserved_kwlist.h"
! };

! static const uint16 UnreservedPLKeywordTokens[] = {
! #include "pl_unreserved_kwlist.h"
  };

! #undef PG_KEYWORD

  /*
   * This macro must recognize all tokens that can immediately precede a
*************** plpgsql_yylex(void)
*** 256,262 ****
  {
      int            tok1;
      TokenAuxData aux1;
!     const ScanKeyword *kw;

      tok1 = internal_yylex(&aux1);
      if (tok1 == IDENT || tok1 == PARAM)
--- 147,153 ----
  {
      int            tok1;
      TokenAuxData aux1;
!     int            kwnum;

      tok1 = internal_yylex(&aux1);
      if (tok1 == IDENT || tok1 == PARAM)
*************** plpgsql_yylex(void)
*** 333,344 ****
                                         &aux1.lval.word))
                      tok1 = T_DATUM;
                  else if (!aux1.lval.word.quoted &&
!                          (kw = ScanKeywordLookup(aux1.lval.word.ident,
!                                                  unreserved_keywords,
!                                                  num_unreserved_keywords)))
                  {
!                     aux1.lval.keyword = kw->name;
!                     tok1 = kw->value;
                  }
                  else
                      tok1 = T_WORD;
--- 224,235 ----
                                         &aux1.lval.word))
                      tok1 = T_DATUM;
                  else if (!aux1.lval.word.quoted &&
!                          (kwnum = ScanKeywordLookup(aux1.lval.word.ident,
!                                                     &UnreservedPLKeywords)) >= 0)
                  {
!                     aux1.lval.keyword = GetScanKeyword(kwnum,
!                                                        &UnreservedPLKeywords);
!                     tok1 = UnreservedPLKeywordTokens[kwnum];
                  }
                  else
                      tok1 = T_WORD;
*************** plpgsql_yylex(void)
*** 375,386 ****
                                     &aux1.lval.word))
                  tok1 = T_DATUM;
              else if (!aux1.lval.word.quoted &&
!                      (kw = ScanKeywordLookup(aux1.lval.word.ident,
!                                              unreserved_keywords,
!                                              num_unreserved_keywords)))
              {
!                 aux1.lval.keyword = kw->name;
!                 tok1 = kw->value;
              }
              else
                  tok1 = T_WORD;
--- 266,277 ----
                                     &aux1.lval.word))
                  tok1 = T_DATUM;
              else if (!aux1.lval.word.quoted &&
!                      (kwnum = ScanKeywordLookup(aux1.lval.word.ident,
!                                                 &UnreservedPLKeywords)) >= 0)
              {
!                 aux1.lval.keyword = GetScanKeyword(kwnum,
!                                                    &UnreservedPLKeywords);
!                 tok1 = UnreservedPLKeywordTokens[kwnum];
              }
              else
                  tok1 = T_WORD;
*************** plpgsql_token_is_unreserved_keyword(int
*** 497,505 ****
  {
      int            i;

!     for (i = 0; i < num_unreserved_keywords; i++)
      {
!         if (unreserved_keywords[i].value == token)
              return true;
      }
      return false;
--- 388,396 ----
  {
      int            i;

!     for (i = 0; i < lengthof(UnreservedPLKeywordTokens); i++)
      {
!         if (UnreservedPLKeywordTokens[i] == token)
              return true;
      }
      return false;
*************** plpgsql_scanner_init(const char *str)
*** 696,702 ****
  {
      /* Start up the core scanner */
      yyscanner = scanner_init(str, &core_yy,
!                              reserved_keywords, num_reserved_keywords);

      /*
       * scanorig points to the original string, which unlike the scanner's
--- 587,593 ----
  {
      /* Start up the core scanner */
      yyscanner = scanner_init(str, &core_yy,
!                              &ReservedPLKeywords, ReservedPLKeywordTokens);

      /*
       * scanorig points to the original string, which unlike the scanner's
diff --git a/src/pl/plpgsql/src/pl_unreserved_kwlist.h b/src/pl/plpgsql/src/pl_unreserved_kwlist.h
index ...ef2aea0 .
*** a/src/pl/plpgsql/src/pl_unreserved_kwlist.h
--- b/src/pl/plpgsql/src/pl_unreserved_kwlist.h
***************
*** 0 ****
--- 1,111 ----
+ /*-------------------------------------------------------------------------
+  *
+  * pl_unreserved_kwlist.h
+  *
+  * The keyword lists are kept in their own source files for use by
+  * automatic tools.  The exact representation of a keyword is determined
+  * by the PG_KEYWORD macro, which is not defined in this file; it can
+  * be defined by the caller for special purposes.
+  *
+  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * src/pl/plpgsql/src/pl_unreserved_kwlist.h
+  *
+  *-------------------------------------------------------------------------
+  */
+
+ /* There is deliberately not an #ifndef PL_UNRESERVED_KWLIST_H here. */
+
+ /*
+  * List of (keyword-name, keyword-token-value) pairs.
+  *
+  * Be careful not to put the same word in both lists.  Also be sure that
+  * pl_gram.y's unreserved_keyword production agrees with this list.
+  *
+  * !!WARNING!!: This list must be sorted by ASCII name, because binary
+  *         search is used to locate entries.
+  */
+
+ /* name, value */
+ PG_KEYWORD("absolute", K_ABSOLUTE)
+ PG_KEYWORD("alias", K_ALIAS)
+ PG_KEYWORD("array", K_ARRAY)
+ PG_KEYWORD("assert", K_ASSERT)
+ PG_KEYWORD("backward", K_BACKWARD)
+ PG_KEYWORD("call", K_CALL)
+ PG_KEYWORD("close", K_CLOSE)
+ PG_KEYWORD("collate", K_COLLATE)
+ PG_KEYWORD("column", K_COLUMN)
+ PG_KEYWORD("column_name", K_COLUMN_NAME)
+ PG_KEYWORD("commit", K_COMMIT)
+ PG_KEYWORD("constant", K_CONSTANT)
+ PG_KEYWORD("constraint", K_CONSTRAINT)
+ PG_KEYWORD("constraint_name", K_CONSTRAINT_NAME)
+ PG_KEYWORD("continue", K_CONTINUE)
+ PG_KEYWORD("current", K_CURRENT)
+ PG_KEYWORD("cursor", K_CURSOR)
+ PG_KEYWORD("datatype", K_DATATYPE)
+ PG_KEYWORD("debug", K_DEBUG)
+ PG_KEYWORD("default", K_DEFAULT)
+ PG_KEYWORD("detail", K_DETAIL)
+ PG_KEYWORD("diagnostics", K_DIAGNOSTICS)
+ PG_KEYWORD("do", K_DO)
+ PG_KEYWORD("dump", K_DUMP)
+ PG_KEYWORD("elseif", K_ELSIF)
+ PG_KEYWORD("elsif", K_ELSIF)
+ PG_KEYWORD("errcode", K_ERRCODE)
+ PG_KEYWORD("error", K_ERROR)
+ PG_KEYWORD("exception", K_EXCEPTION)
+ PG_KEYWORD("exit", K_EXIT)
+ PG_KEYWORD("fetch", K_FETCH)
+ PG_KEYWORD("first", K_FIRST)
+ PG_KEYWORD("forward", K_FORWARD)
+ PG_KEYWORD("get", K_GET)
+ PG_KEYWORD("hint", K_HINT)
+ PG_KEYWORD("import", K_IMPORT)
+ PG_KEYWORD("info", K_INFO)
+ PG_KEYWORD("insert", K_INSERT)
+ PG_KEYWORD("is", K_IS)
+ PG_KEYWORD("last", K_LAST)
+ PG_KEYWORD("log", K_LOG)
+ PG_KEYWORD("message", K_MESSAGE)
+ PG_KEYWORD("message_text", K_MESSAGE_TEXT)
+ PG_KEYWORD("move", K_MOVE)
+ PG_KEYWORD("next", K_NEXT)
+ PG_KEYWORD("no", K_NO)
+ PG_KEYWORD("notice", K_NOTICE)
+ PG_KEYWORD("open", K_OPEN)
+ PG_KEYWORD("option", K_OPTION)
+ PG_KEYWORD("perform", K_PERFORM)
+ PG_KEYWORD("pg_context", K_PG_CONTEXT)
+ PG_KEYWORD("pg_datatype_name", K_PG_DATATYPE_NAME)
+ PG_KEYWORD("pg_exception_context", K_PG_EXCEPTION_CONTEXT)
+ PG_KEYWORD("pg_exception_detail", K_PG_EXCEPTION_DETAIL)
+ PG_KEYWORD("pg_exception_hint", K_PG_EXCEPTION_HINT)
+ PG_KEYWORD("print_strict_params", K_PRINT_STRICT_PARAMS)
+ PG_KEYWORD("prior", K_PRIOR)
+ PG_KEYWORD("query", K_QUERY)
+ PG_KEYWORD("raise", K_RAISE)
+ PG_KEYWORD("relative", K_RELATIVE)
+ PG_KEYWORD("reset", K_RESET)
+ PG_KEYWORD("return", K_RETURN)
+ PG_KEYWORD("returned_sqlstate", K_RETURNED_SQLSTATE)
+ PG_KEYWORD("reverse", K_REVERSE)
+ PG_KEYWORD("rollback", K_ROLLBACK)
+ PG_KEYWORD("row_count", K_ROW_COUNT)
+ PG_KEYWORD("rowtype", K_ROWTYPE)
+ PG_KEYWORD("schema", K_SCHEMA)
+ PG_KEYWORD("schema_name", K_SCHEMA_NAME)
+ PG_KEYWORD("scroll", K_SCROLL)
+ PG_KEYWORD("set", K_SET)
+ PG_KEYWORD("slice", K_SLICE)
+ PG_KEYWORD("sqlstate", K_SQLSTATE)
+ PG_KEYWORD("stacked", K_STACKED)
+ PG_KEYWORD("table", K_TABLE)
+ PG_KEYWORD("table_name", K_TABLE_NAME)
+ PG_KEYWORD("type", K_TYPE)
+ PG_KEYWORD("use_column", K_USE_COLUMN)
+ PG_KEYWORD("use_variable", K_USE_VARIABLE)
+ PG_KEYWORD("variable_conflict", K_VARIABLE_CONFLICT)
+ PG_KEYWORD("warning", K_WARNING)
diff --git a/src/tools/gen_keywordlist.pl b/src/tools/gen_keywordlist.pl
index ...eb5ed65 .
*** a/src/tools/gen_keywordlist.pl
--- b/src/tools/gen_keywordlist.pl
***************
*** 0 ****
--- 1,148 ----
+ #----------------------------------------------------------------------
+ #
+ # gen_keywordlist.pl
+ #    Perl script that transforms a list of keywords into a ScanKeywordList
+ #    data structure that can be passed to ScanKeywordLookup().
+ #
+ # The input is a C header file containing a series of macro calls
+ #    PG_KEYWORD("keyword", ...)
+ # Lines not starting with PG_KEYWORD are ignored.  The keywords are
+ # implicitly numbered 0..N-1 in order of appearance in the header file.
+ # Currently, the keywords are required to appear in ASCII order.
+ #
+ # The output is a C header file that defines a "const ScanKeywordList"
+ # variable named according to the -v switch ("ScanKeywords" by default).
+ # The variable is marked "static" unless the -e switch is given.
+ #
+ #
+ # Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ # Portions Copyright (c) 1994, Regents of the University of California
+ #
+ # src/tools/gen_keywordlist.pl
+ #
+ #----------------------------------------------------------------------
+
+ use strict;
+ use warnings;
+ use Getopt::Long;
+
+ my $output_path = '';
+ my $extern = 0;
+ my $varname = 'ScanKeywords';
+
+ GetOptions(
+     'output:s' => \$output_path,
+     'extern'   => \$extern,
+     'varname:s' => \$varname) || usage();
+
+ my $kw_input_file = shift @ARGV || die "No input file.\n";
+
+ # Make sure output_path ends in a slash if needed.
+ if ($output_path ne '' && substr($output_path, -1) ne '/')
+ {
+     $output_path .= '/';
+ }
+
+ $kw_input_file =~ /(\w+)\.h$/ || die "Input file must be named something.h.\n";
+ my $base_filename = $1 . '_d';
+ my $kw_def_file = $output_path . $base_filename . '.h';
+
+ open(my $kif, '<', $kw_input_file) || die "$kw_input_file: $!\n";
+ open(my $kwdef, '>', $kw_def_file) || die "$kw_def_file: $!\n";
+
+ # Opening boilerplate for keyword definition header.
+ printf $kwdef <<EOM, $base_filename, uc $base_filename, uc $base_filename;
+ /*-------------------------------------------------------------------------
+  *
+  * %s.h
+  *    List of keywords represented as a ScanKeywordList.
+  *
+  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  * NOTES
+  *  ******************************
+  *  *** DO NOT EDIT THIS FILE! ***
+  *  ******************************
+  *
+  *  It has been GENERATED by src/tools/gen_keywordlist.pl
+  *
+  *-------------------------------------------------------------------------
+  */
+
+ #ifndef %s_H
+ #define %s_H
+
+ #include "common/kwlookup.h"
+
+ EOM
+
+ # Parse input file for keyword names.
+ my @keywords;
+ while (<$kif>)
+ {
+     if (/^PG_KEYWORD\("(\w+)"/)
+     {
+         push @keywords, $1;
+     }
+ }
+
+ # Error out if the keyword names are not in ASCII order.
+ for my $i (0..$#keywords - 1)
+ {
+     die qq|The keyword "$keywords[$i + 1]" is out of order in $kw_input_file\n|
+       if ($keywords[$i] cmp $keywords[$i + 1]) >= 0;
+ }
+
+ # Emit the string containing all the keywords.
+
+ printf $kwdef qq|static const char %s_kw_string[] =\n\t"|, $varname;
+ print $kwdef join qq|\\0"\n\t"|, @keywords;
+ print $kwdef qq|";\n\n|;
+
+ # Emit an array of numerical offsets which will be used to index into the
+ # keyword string.
+
+ printf $kwdef "static const uint16 %s_kw_offsets[] = {\n", $varname;
+
+ my $offset = 0;
+ foreach my $name (@keywords)
+ {
+     print $kwdef "\t$offset,\n";
+
+     # Calculate the cumulative offset of the next keyword,
+     # taking into account the null terminator.
+     $offset += length($name) + 1;
+ }
+
+ print $kwdef "};\n\n";
+
+ # Emit a macro defining the number of keywords.
+
+ printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords;
+
+ # Emit the struct that wraps all this lookup info into one variable.
+
+ print $kwdef "static " if !$extern;
+ printf $kwdef "const ScanKeywordList %s = {\n", $varname;
+ printf $kwdef qq|\t%s_kw_string,\n|, $varname;
+ printf $kwdef qq|\t%s_kw_offsets,\n|, $varname;
+ printf $kwdef qq|\t%s_NUM_KEYWORDS\n|, uc $varname;
+ print $kwdef "};\n\n";
+
+ printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename;
+
+
+ sub usage
+ {
+     die <<EOM;
+ Usage: gen_keywordlist.pl [--output/-o <path>] [--varname/-v <varname>] [--extern/-e] input_file
+     --output   Output directory (default '.')
+     --varname  Name for ScanKeywordList variable (default 'ScanKeywords')
+     --extern   Allow the ScanKeywordList variable to be globally visible
+
+ gen_keywordlist.pl transforms a list of keywords into a ScanKeywordList.
+ The output filename is derived from the input file by inserting _d,
+ for example kwlist_d.h is produced from kwlist.h.
+ EOM
+ }
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 2921d19..56192f1 100644
*** a/src/tools/msvc/Mkvcbuild.pm
--- b/src/tools/msvc/Mkvcbuild.pm
*************** sub mkvcbuild
*** 118,124 ****

      our @pgcommonallfiles = qw(
        base64.c config_info.c controldata_utils.c exec.c file_perm.c ip.c
!       keywords.c link-canary.c md5.c
        pg_lzcompress.c pgfnames.c psprintf.c relpath.c rmtree.c
        saslprep.c scram-common.c string.c unicode_norm.c username.c
        wait_error.c);
--- 118,124 ----

      our @pgcommonallfiles = qw(
        base64.c config_info.c controldata_utils.c exec.c file_perm.c ip.c
!       keywords.c kwlookup.c link-canary.c md5.c
        pg_lzcompress.c pgfnames.c psprintf.c relpath.c rmtree.c
        saslprep.c scram-common.c string.c unicode_norm.c username.c
        wait_error.c);
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index eb2346b..937bf18 100644
*** a/src/tools/msvc/Solution.pm
--- b/src/tools/msvc/Solution.pm
*************** sub GenerateFiles
*** 410,415 ****
--- 410,451 ----
      }

      if (IsNewer(
+             'src/common/kwlist_d.h',
+             'src/include/parser/kwlist.h'))
+     {
+         print "Generating kwlist_d.h...\n";
+         system('perl src/tools/gen_keywordlist.pl --extern -o src/common src/include/parser/kwlist.h');
+     }
+
+     if (IsNewer(
+             'src/pl/plpgsql/src/pl_reserved_kwlist_d.h',
+             'src/pl/plpgsql/src/pl_reserved_kwlist.h')
+         || IsNewer(
+             'src/pl/plpgsql/src/pl_unreserved_kwlist_d.h',
+             'src/pl/plpgsql/src/pl_unreserved_kwlist.h'))
+     {
+         print "Generating pl_reserved_kwlist_d.h and pl_unreserved_kwlist_d.h...\n";
+         chdir('src/pl/plpgsql/src');
+         system('perl ../../../tools/gen_keywordlist.pl --varname ReservedPLKeywords pl_reserved_kwlist.h');
+         system('perl ../../../tools/gen_keywordlist.pl --varname UnreservedPLKeywords pl_unreserved_kwlist.h');
+         chdir('../../../..');
+     }
+
+     if (IsNewer(
+             'src/interfaces/ecpg/preproc/c_kwlist_d.h',
+             'src/interfaces/ecpg/preproc/c_kwlist.h')
+         || IsNewer(
+             'src/interfaces/ecpg/preproc/ecpg_kwlist_d.h',
+             'src/interfaces/ecpg/preproc/ecpg_kwlist.h'))
+     {
+         print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n";
+         chdir('src/interfaces/ecpg/preproc');
+         system('perl ../../../tools/gen_keywordlist.pl --varname ScanCKeywords c_kwlist.h');
+         system('perl ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h');
+         chdir('../../../..');
+     }
+
+     if (IsNewer(
              'src/interfaces/ecpg/preproc/preproc.y',
              'src/backend/parser/gram.y'))
      {
diff --git a/src/tools/msvc/clean.bat b/src/tools/msvc/clean.bat
index 7a23a2b..069d6eb 100755
*** a/src/tools/msvc/clean.bat
--- b/src/tools/msvc/clean.bat
*************** if %DIST%==1 if exist src\pl\tcl\pltcler
*** 64,69 ****
--- 64,74 ----
  if %DIST%==1 if exist src\backend\utils\sort\qsort_tuple.c del /q src\backend\utils\sort\qsort_tuple.c
  if %DIST%==1 if exist src\bin\psql\sql_help.c del /q src\bin\psql\sql_help.c
  if %DIST%==1 if exist src\bin\psql\sql_help.h del /q src\bin\psql\sql_help.h
+ if %DIST%==1 if exist src\common\kwlist_d.h del /q src\common\kwlist_d.h
+ if %DIST%==1 if exist src\pl\plpgsql\src\pl_reserved_kwlist_d.h del /q src\pl\plpgsql\src\pl_reserved_kwlist_d.h
+ if %DIST%==1 if exist src\pl\plpgsql\src\pl_unreserved_kwlist_d.h del /q src\pl\plpgsql\src\pl_unreserved_kwlist_d.h
+ if %DIST%==1 if exist src\interfaces\ecpg\preproc\c_kwlist_d.h del /q src\interfaces\ecpg\preproc\c_kwlist_d.h
+ if %DIST%==1 if exist src\interfaces\ecpg\preproc\ecpg_kwlist_d.h del /q src\interfaces\ecpg\preproc\ecpg_kwlist_d.h
  if %DIST%==1 if exist src\interfaces\ecpg\preproc\preproc.y del /q src\interfaces\ecpg\preproc\preproc.y
  if %DIST%==1 if exist src\backend\catalog\postgres.bki del /q src\backend\catalog\postgres.bki
  if %DIST%==1 if exist src\backend\catalog\postgres.description del /q src\backend\catalog\postgres.description

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
David Rowley
Date:
On Sat, 5 Jan 2019 at 09:20, John Naylor <jcnaylor@gmail.com> wrote:
>
> On 1/3/19, Joerg Sonnenberger <joerg@bec.de> wrote:
> > Hello John,
> > I was pointed at your patch on IRC and decided to look into adding my
> > own pieces. What I can provide you is a fast perfect hash function
> > generator.  I've attached a sample hash function based on the current
> > main keyword list. hash() essentially gives you the number of the only
> > possible match, a final strcmp/memcmp is still necessary to verify that
> > it is an actual keyword though. The |0x20 can be dropped if all cases
> > have pre-lower-cased the input already. This would replace the binary
> > search in the lookup functions. Returning offsets directly would be easy
> > as well. That allows writing a single string where each entry is prefixed
> > with a type mask, the token id, the length of the keyword and the actual
> > keyword text. Does that sound useful to you?
>
> Judging by previous responses, there is still interest in using
> perfect hash functions, so thanks for this. I'm not knowledgeable
> enough to judge its implementation, so I'll leave that for others.

Well, I'm quite impressed by the resulting hash function. The
resulting hash value can be used directly to index the existing 440
element ScanKeywords[] array (way better than the 1815 element array
Andrew got from gperf).  If we also happened to also store the length
of the keyword in that array then we could compare the length of the
word after hashing. If the length is the same then we could perform a
memcmp() to confirm the match, should be a little cheaper than a
strcmp() and we should be able to store the length for free on 64-bit
machines. If the length is not the same then it's not a keyword.

It may also save some cycles to determine the input word's length at
the same time as lowering it.

The keyword length could also be easily determined by changing the
PG_KEYWORD macro to become:

#define PG_KEYWORD(a,b,c) {a,0,c,sizeof(a)-1},

after, of course adding a new field to the ScanKeyword struct.

What I'm most interested in is how long it took to generate the hash
function in hash2.c?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Joerg Sonnenberger
Date:
On Mon, Jan 07, 2019 at 03:11:55AM +1300, David Rowley wrote:
> What I'm most interested in is how long it took to generate the hash
> function in hash2.c?

It's within the noise floor of time(1) on my laptop, e.g. ~1ms.

Joerg


Joerg Sonnenberger <joerg@bec.de> writes:
> On Mon, Jan 07, 2019 at 03:11:55AM +1300, David Rowley wrote:
>> What I'm most interested in is how long it took to generate the hash
>> function in hash2.c?

> It's within the noise floor of time(1) on my laptop, e.g. ~1ms.

I decided to do some simple performance measurements to see if we're
actually getting any useful results here.  I set up a test case that
just runs raw_parser in a tight loop:

    while (count-- > 0)
    {
        List       *parsetree_list;
        MemoryContext oldcontext;

        oldcontext = MemoryContextSwitchTo(mycontext);
        parsetree_list = pg_parse_query(query_string);
        MemoryContextSwitchTo(oldcontext);
        MemoryContextReset(mycontext);
        CHECK_FOR_INTERRUPTS();
    }

and exercised it on the contents of information_schema.sql.
I think that's a reasonably representative test case considering
that we're only examining the speed of the flex+bison stages.
(Since it's mostly DDL, including parse analysis might be a bit
unlike normal workloads, but for raw parsing it should be fine.)

On my workstation, in a non-cassert build, HEAD requires about 4700 ms
for 1000 iterations (with maybe 1% cross-run variation).

Applying the v8 patch I posted yesterday, the time drops to ~4520 ms
or about a 4% savings.  So not a lot, but it's pretty repeatable,
and it shows that reducing the cache footprint of keyword lookup
is worth something.

I then tried plastering in Joerg's example hash function, as in the
attached delta patch on top of v8.  This is *not* a usable patch;
it breaks plpgsql and ecpg, because ScanKeywordLookup no longer
works for non-core keyword sets.  But that doesn't matter for the
information_schema test case, and on that I find the runtime drops
to ~3820 ms, or 19% better than HEAD.  So this is definitely an
idea worth pursuing.

Some additional thoughts:

* It's too bad that the hash function doesn't have a return convention
that allows distinguishing "couldn't possibly match any keyword" from
"might match keyword 0".  I imagine a lot of the zero entries in its
hashtable could be interpreted as the former, so at the cost of perhaps
a couple more if-tests we could skip work at the caller.  As this patch
stands, we could only skip the strcmp() so it might not be worth the
trouble --- but if we use Joerg's |0x20 hack then we could hash before
downcasing, allowing us to skip the downcasing step if the word couldn't
possibly be a keyword.  Likely that would be worth the trouble.

* We should extend the ScanKeywordList representation to include a
field holding the longest keyword length in the table, which
gen_keywordlist.pl would have no trouble providing.  Then we could
skip downcasing and/or hashing for any word longer than that, replacing
the current NAMEDATALEN test, and thereby putting a tight bound on
the cost of downcasing and/or hashing.

* If we do hash first, then we could replace the downcasing loop and
strcmp with an integrated loop that downcases and compares one
character at a time, removing the need for the NAMEDATALEN-sized
buffer variable.

* I think it matters to the speed of the hashing loop that the
magic multipliers are hard-coded.  (Examining the assembly code
shows that, at least on my x86_64 hardware, gcc prefers to use
shift-and-add sequences here instead of multiply instructions.)
So we probably can't have inlined hashing code --- I imagine the
hash generator needs the flexibility to pick different values of
those multipliers.  I envision making this work by having
gen_keywordlist.pl emit a function definition for the hash step and
include a function pointer to it in ScanKeywordList.  That extra
function call will make things fractionally slower than what I have
here, but I don't think it will change the conclusions any.  This
approach would also greatly alleviate the concern I had yesterday
about ecpg's c_keywords.c having a second copy of the hashing code;
what it would have is its own generated function, which isn't much
of a problem.

* Given that the generator's runtime is negligible when coded in C,
I suspect that we might able to tolerate the speed hit from translating
it to Perl, and frankly I'd much rather do that than cope with the
consequences of including C code in our build process.

I'm eagerly awaiting seeing Joerg's code, but I think in the
meantime I'm going to go look up NetBSD's nbperf to get an idea
of how painful it might be to do in Perl.  (Now, bearing in mind
that I'm not exactly fluent in Perl, there are probably other
people around here who could produce a better translation ...)

            regards, tom lane

diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c
index db62623..8c55f40 100644
*** a/src/common/kwlookup.c
--- b/src/common/kwlookup.c
***************
*** 17,22 ****
--- 17,149 ----

  #include "common/kwlookup.h"

+ static uint32
+ hash(const void *key, size_t keylen)
+ {
+     static const uint16 g[881] = {
+         0x015b, 0x0000, 0x0070, 0x01b2, 0x0000, 0x0078, 0x0020, 0x0000,
+         0x0000, 0x0193, 0x0000, 0x0000, 0x0000, 0x01ac, 0x0000, 0x0122,
+         0x00b9, 0x0176, 0x013b, 0x0000, 0x0000, 0x0000, 0x0150, 0x0000,
+         0x0000, 0x0000, 0x008b, 0x00ea, 0x00b3, 0x0197, 0x0000, 0x0118,
+         0x012d, 0x0102, 0x0000, 0x0091, 0x0061, 0x008c, 0x0000, 0x0000,
+         0x0144, 0x01b4, 0x0000, 0x0000, 0x01a8, 0x019e, 0x0000, 0x00da,
+         0x0000, 0x0000, 0x0122, 0x0176, 0x00f3, 0x016a, 0x00f4, 0x00c0,
+         0x0111, 0x0000, 0x0103, 0x0028, 0x001a, 0x0180, 0x0000, 0x0000,
+         0x005f, 0x0000, 0x00d9, 0x0000, 0x016d, 0x0000, 0x0170, 0x0007,
+         0x016f, 0x0000, 0x014e, 0x0098, 0x00a8, 0x004b, 0x0000, 0x0056,
+         0x0000, 0x0121, 0x0012, 0x0102, 0x0192, 0x0000, 0x00f2, 0x0066,
+         0x0000, 0x003a, 0x0000, 0x0000, 0x0144, 0x0000, 0x0000, 0x0133,
+         0x0067, 0x0169, 0x0000, 0x0000, 0x0152, 0x0122, 0x0000, 0x0058,
+         0x0135, 0x0045, 0x0193, 0x00d2, 0x007e, 0x0000, 0x00ae, 0x012c,
+         0x0000, 0x0000, 0x0000, 0x0000, 0x0124, 0x0000, 0x0046, 0x0018,
+         0x0000, 0x00ba, 0x00d1, 0x004a, 0x0000, 0x0000, 0x0000, 0x0000,
+         0x0000, 0x001f, 0x0000, 0x0101, 0x0000, 0x0000, 0x0000, 0x01b5,
+         0x016e, 0x0173, 0x008a, 0x0000, 0x0173, 0x000b, 0x0000, 0x00d5,
+         0x005e, 0x0000, 0x00ac, 0x0000, 0x0000, 0x0000, 0x01a1, 0x0000,
+         0x0000, 0x0127, 0x0000, 0x005e, 0x0000, 0x016f, 0x0000, 0x012b,
+         0x01a4, 0x01b4, 0x0000, 0x0000, 0x003a, 0x0000, 0x0000, 0x00f5,
+         0x00b1, 0x0003, 0x0123, 0x001b, 0x0000, 0x004f, 0x0000, 0x0000,
+         0x0000, 0x0000, 0x0000, 0x007a, 0x0000, 0x0000, 0x0000, 0x0000,
+         0x00c2, 0x00a2, 0x00b9, 0x0000, 0x00cb, 0x0000, 0x00d2, 0x0000,
+         0x0197, 0x0121, 0x0000, 0x00d6, 0x0107, 0x0000, 0x0000, 0x0000,
+         0x0000, 0x0000, 0x0000, 0x0165, 0x00df, 0x0121, 0x0000, 0x0000,
+         0x0000, 0x0000, 0x0000, 0x019b, 0x0000, 0x01ad, 0x0000, 0x014f,
+         0x018d, 0x0000, 0x015f, 0x0168, 0x0000, 0x0199, 0x0000, 0x0000,
+         0x0000, 0x00a1, 0x0000, 0x0000, 0x0109, 0x0000, 0x0000, 0x01a6,
+         0x0097, 0x0000, 0x0018, 0x0000, 0x00d1, 0x0000, 0x0000, 0x0000,
+         0x0187, 0x0018, 0x0000, 0x00aa, 0x0000, 0x0000, 0x0000, 0x0000,
+         0x0136, 0x0063, 0x00b8, 0x0000, 0x0067, 0x0114, 0x0000, 0x0000,
+         0x0151, 0x0000, 0x0000, 0x0000, 0x00bf, 0x0000, 0x0000, 0x0000,
+         0x01b4, 0x00d4, 0x0000, 0x0006, 0x017e, 0x0167, 0x003a, 0x017f,
+         0x0183, 0x00c9, 0x01a2, 0x0000, 0x0000, 0x0153, 0x00ce, 0x0000,
+         0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0051, 0x0000, 0x0086,
+         0x0000, 0x0083, 0x0137, 0x0000, 0x0000, 0x0050, 0x0000, 0x00d7,
+         0x0000, 0x0000, 0x0000, 0x0129, 0x00f1, 0x0000, 0x009b, 0x01a7,
+         0x0000, 0x00b4, 0x0000, 0x00e0, 0x0046, 0x0025, 0x0000, 0x0000,
+         0x0000, 0x0144, 0x0000, 0x01a5, 0x0044, 0x0096, 0x0078, 0x0166,
+         0x0000, 0x0000, 0x0000, 0x0143, 0x0000, 0x00b8, 0x0000, 0x009e,
+         0x0000, 0x008c, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x00fe,
+         0x0000, 0x0000, 0x0037, 0x0057, 0x0000, 0x00c3, 0x0000, 0x0000,
+         0x0000, 0x00bf, 0x014b, 0x0069, 0x00ce, 0x0000, 0x019d, 0x007f,
+         0x0186, 0x0000, 0x0119, 0x0015, 0x0000, 0x000e, 0x0113, 0x0139,
+         0x008e, 0x01ab, 0x0000, 0x005c, 0x0000, 0x0095, 0x0000, 0x019d,
+         0x0000, 0x0195, 0x0036, 0x0000, 0x0000, 0x00e0, 0x0146, 0x0000,
+         0x0033, 0x0000, 0x0000, 0x0035, 0x0000, 0x0000, 0x0000, 0x0000,
+         0x00d2, 0x0000, 0x0000, 0x0000, 0x0000, 0x004c, 0x00f0, 0x0000,
+         0x0119, 0x00bd, 0x0000, 0x0000, 0x0031, 0x0117, 0x00b4, 0x0000,
+         0x00f8, 0x0000, 0x0055, 0x0000, 0x0170, 0x0000, 0x0000, 0x0000,
+         0x00e4, 0x00b5, 0x01b5, 0x0024, 0x0000, 0x01a5, 0x0000, 0x0000,
+         0x0000, 0x0000, 0x0151, 0x0000, 0x00cc, 0x0000, 0x0000, 0x0150,
+         0x00f3, 0x0071, 0x00d0, 0x0085, 0x0140, 0x0000, 0x00ae, 0x0000,
+         0x00c4, 0x01a8, 0x0000, 0x0091, 0x0180, 0x0057, 0x0072, 0x0000,
+         0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x002a, 0x0000,
+         0x0112, 0x003d, 0x017f, 0x0088, 0x0000, 0x0158, 0x0046, 0x0101,
+         0x0000, 0x0000, 0x00ea, 0x0000, 0x0000, 0x00b2, 0x0149, 0x0000,
+         0x007c, 0x0107, 0x0161, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+         0x0000, 0x0169, 0x0000, 0x0118, 0x0091, 0x0043, 0x0064, 0x0000,
+         0x0000, 0x0194, 0x0000, 0x0000, 0x00e9, 0x0000, 0x0000, 0x0000,
+         0x005e, 0x0000, 0x0029, 0x0000, 0x0000, 0x0000, 0x003c, 0x0000,
+         0x0000, 0x008b, 0x0000, 0x0000, 0x00fd, 0x002d, 0x0184, 0x0000,
+         0x0000, 0x016a, 0x006f, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+         0x0000, 0x0000, 0x0000, 0x0000, 0x01a0, 0x0000, 0x003b, 0x0000,
+         0x006d, 0x0000, 0x016a, 0x0000, 0x01b2, 0x00c9, 0x0094, 0x0181,
+         0x018b, 0x0000, 0x0199, 0x00c7, 0x017e, 0x0000, 0x0000, 0x0160,
+         0x0000, 0x0000, 0x0175, 0x0000, 0x0000, 0x006a, 0x008d, 0x0000,
+         0x00ed, 0x0000, 0x00b7, 0x0000, 0x0000, 0x0107, 0x00f9, 0x0000,
+         0x0173, 0x0137, 0x0000, 0x0185, 0x0114, 0x006c, 0x0000, 0x0000,
+         0x00f4, 0x0189, 0x0000, 0x0000, 0x0102, 0x0000, 0x00d5, 0x0000,
+         0x0000, 0x015a, 0x0000, 0x00de, 0x0000, 0x0000, 0x0000, 0x0000,
+         0x0000, 0x0000, 0x0000, 0x00f0, 0x00e1, 0x0000, 0x0000, 0x01b6,
+         0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0197, 0x0000, 0x0154,
+         0x004a, 0x018d, 0x0000, 0x00c3, 0x0000, 0x0000, 0x0000, 0x0000,
+         0x00c1, 0x0189, 0x001c, 0x0000, 0x00a6, 0x0000, 0x0000, 0x00a4,
+         0x0000, 0x0000, 0x0000, 0x00ed, 0x0000, 0x0173, 0x0169, 0x00d2,
+         0x0117, 0x0000, 0x009b, 0x0000, 0x014e, 0x0000, 0x00ac, 0x0000,
+         0x008e, 0x0121, 0x0104, 0x0179, 0x0000, 0x01a5, 0x0103, 0x0000,
+         0x001b, 0x0000, 0x0000, 0x01a8, 0x00ba, 0x010c, 0x0000, 0x0000,
+         0x010e, 0x00ab, 0x0062, 0x0000, 0x0000, 0x0154, 0x0122, 0x013f,
+         0x015f, 0x0000, 0x00f5, 0x01a9, 0x017b, 0x01a4, 0x0000, 0x0000,
+         0x0040, 0x0004, 0x019b, 0x0000, 0x00e3, 0x010d, 0x015b, 0x0000,
+         0x0104, 0x0000, 0x0000, 0x00b2, 0x00e8, 0x0000, 0x0000, 0x0000,
+         0x0065, 0x0062, 0x007a, 0x0000, 0x0065, 0x008d, 0x0000, 0x0085,
+         0x0000, 0x0000, 0x0000, 0x00af, 0x0104, 0x01ab, 0x0040, 0x00a8,
+         0x0000, 0x0000, 0x0000, 0x0159, 0x0000, 0x0000, 0x0000, 0x0000,
+         0x0000, 0x0189, 0x017b, 0x0077, 0x0000, 0x00ea, 0x00d7, 0x007f,
+         0x0000, 0x00ae, 0x0047, 0x0000, 0x0163, 0x0000, 0x0000, 0x0157,
+         0x0178, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x014d, 0x0009,
+         0x0000, 0x0000, 0x00b6, 0x0000, 0x0000, 0x0000, 0x0000, 0x0192,
+         0x01b1, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+         0x0053, 0x012b, 0x00f6, 0x0096, 0x0000, 0x0141, 0x0000, 0x0000,
+         0x0026, 0x0044, 0x00ce, 0x0061, 0x0199, 0x0000, 0x016b, 0x0156,
+         0x011d, 0x0000, 0x0038, 0x008c, 0x00c8, 0x0000, 0x0000, 0x0002,
+         0x0000, 0x01a1, 0x0000, 0x001e, 0x0000, 0x0000, 0x00bc, 0x00ab,
+         0x0000, 0x0183, 0x0085, 0x0000, 0x0000, 0x010c, 0x0000, 0x01a5,
+         0x0120, 0x0000, 0x0000, 0x0000, 0x0000, 0x0135, 0x0079, 0x0000,
+         0x01ae, 0x0028, 0x0000, 0x0000, 0x014a, 0x0000, 0x00dd, 0x0000,
+         0x0000, 0x00d8, 0x00de, 0x0075, 0x0000, 0x0000, 0x0021, 0x0099,
+         0x0000, 0x00f7, 0x0000, 0x0000, 0x0000, 0x0046, 0x0000, 0x0010,
+         0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0116,
+         0x0000, 0x0000, 0x0000, 0x0000, 0x01a8, 0x0000, 0x0000, 0x0000,
+         0x004c, 0x0000, 0x00b7, 0x0000, 0x013f, 0x003c, 0x0000, 0x0000,
+         0x006d, 0x007f, 0x0181, 0x0000, 0x0013, 0x0000, 0x0180, 0x0000,
+         0x0000, 0x0000, 0x0000, 0x0000, 0x0136, 0x000d, 0x0000, 0x0000,
+         0x0000, 0x0082, 0x0000, 0x0000, 0x00cf, 0x00c3, 0x0000, 0x0000,
+         0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0127,
+         0x0000, 0x0013, 0x019c, 0x0000, 0x0024, 0x00bd, 0x017e, 0x0000,
+         0x00b8, 0x002e, 0x012c, 0x0007, 0x0000, 0x0000, 0x00f7, 0x0000,
+         0x0048,
+     };
+
+     const unsigned char *k = key;
+     uint32_t a, b;
+
+     a = b = 0x0U;
+     while (keylen--) {
+         a = a * 31 + (k[keylen]|0x20);
+         b = b * 37 + (k[keylen]|0x20);
+     }
+     return (g[a % 881] + g[b % 881]) % 440;
+ }

  /*
   * ScanKeywordLookup - see if a given word is a keyword
*************** ScanKeywordLookup(const char *text,
*** 41,50 ****
      int            len,
                  i;
      char        word[NAMEDATALEN];
!     const char *kw_string;
!     const uint16 *kw_offsets;
!     const uint16 *low;
!     const uint16 *high;

      len = strlen(text);
      /* We assume all keywords are shorter than NAMEDATALEN. */
--- 168,175 ----
      int            len,
                  i;
      char        word[NAMEDATALEN];
!     const char *kw;
!     uint32 h;

      len = strlen(text);
      /* We assume all keywords are shorter than NAMEDATALEN. */
*************** ScanKeywordLookup(const char *text,
*** 66,91 ****
      word[len] = '\0';

      /*
!      * Now do a binary search using plain strcmp() comparison.
       */
!     kw_string = keywords->kw_string;
!     kw_offsets = keywords->kw_offsets;
!     low = kw_offsets;
!     high = kw_offsets + (keywords->num_keywords - 1);
!     while (low <= high)
!     {
!         const uint16 *middle;
!         int            difference;
!
!         middle = low + (high - low) / 2;
!         difference = strcmp(kw_string + *middle, word);
!         if (difference == 0)
!             return middle - kw_offsets;
!         else if (difference < 0)
!             low = middle + 1;
!         else
!             high = middle - 1;
!     }
!
      return -1;
  }
--- 191,201 ----
      word[len] = '\0';

      /*
!      * Now do a hash search using plain strcmp() comparison.
       */
!     h = hash(word, len);
!     kw = keywords->kw_string + keywords->kw_offsets[h];
!     if (strcmp(word, kw) == 0)
!         return h;
      return -1;
  }

Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Joerg Sonnenberger
Date:
On Sun, Jan 06, 2019 at 02:29:05PM -0500, Tom Lane wrote:
> * It's too bad that the hash function doesn't have a return convention
> that allows distinguishing "couldn't possibly match any keyword" from
> "might match keyword 0".  I imagine a lot of the zero entries in its
> hashtable could be interpreted as the former, so at the cost of perhaps
> a couple more if-tests we could skip work at the caller.  As this patch
> stands, we could only skip the strcmp() so it might not be worth the
> trouble --- but if we use Joerg's |0x20 hack then we could hash before
> downcasing, allowing us to skip the downcasing step if the word couldn't
> possibly be a keyword.  Likely that would be worth the trouble.

The hash function itself doesn't have enough data in it to know whether
its a match or not. A strcmp or memcmp at the end will always be
necessary if you don't know already that it is a keyword.

> * We should extend the ScanKeywordList representation to include a
> field holding the longest keyword length in the table, which
> gen_keywordlist.pl would have no trouble providing.  Then we could
> skip downcasing and/or hashing for any word longer than that, replacing
> the current NAMEDATALEN test, and thereby putting a tight bound on
> the cost of downcasing and/or hashing.

Correct, possibly even have an array for each class of keywords.

> * If we do hash first, then we could replace the downcasing loop and
> strcmp with an integrated loop that downcases and compares one
> character at a time, removing the need for the NAMEDATALEN-sized
> buffer variable.

This is also an option. Assuming that the character set for keywords
doesn't change (letter or _), one 64bit bit test per input character
would ensure that the |0x20 hack gives correct result for comparing as
well. Any other input would be an instant mismatch. If digits are valid
keyword characters, it would be two tests.

> * I think it matters to the speed of the hashing loop that the
> magic multipliers are hard-coded.  (Examining the assembly code
> shows that, at least on my x86_64 hardware, gcc prefers to use
> shift-and-add sequences here instead of multiply instructions.)

Right, that's one of the reasons for choosing them. The other is that it
gives decent mixing for ASCII-only input. At the moment, they are
hard-coded.

> So we probably can't have inlined hashing code --- I imagine the
> hash generator needs the flexibility to pick different values of
> those multipliers.

Right now, only the initial values are randomized. Picking a different
set of hash functions is possible, but someone that should be done only
if there is an actual need. That was what I meant with stronger mixing
might be necessary for "annoying" keyword additions.

> I envision making this work by having
> gen_keywordlist.pl emit a function definition for the hash step and
> include a function pointer to it in ScanKeywordList.  That extra
> function call will make things fractionally slower than what I have
> here, but I don't think it will change the conclusions any.  This
> approach would also greatly alleviate the concern I had yesterday
> about ecpg's c_keywords.c having a second copy of the hashing code;
> what it would have is its own generated function, which isn't much
> of a problem.

There are two ways for dealing with it:
(1) Have one big hash table with all the various keywords and a class
mask stored. If there is enough overlap between the keyword tables, it
can significantly reduce the amount of space needed. In terms of code
complexity, it adds one class check at the end, i.e. a bitmap test.
(2) Build independent hash tables for each input class. A bit simpler to
manage, but can result in a bit wasted space.

From the generator side, it doesn't matter which choice is taken.

> * Given that the generator's runtime is negligible when coded in C,
> I suspect that we might able to tolerate the speed hit from translating
> it to Perl, and frankly I'd much rather do that than cope with the
> consequences of including C code in our build process.

I'm just not fluent enough in Perl to be much help for that, but I can
sit down and write a trivial Python version of it :) There are a couple
of changes that are useful to have in this context, e.g. the ability to
directly provide the offsets in the result table to allow dropping the
index -> offset table completely.

Joerg


Joerg Sonnenberger <joerg@bec.de> writes:
> On Sun, Jan 06, 2019 at 02:29:05PM -0500, Tom Lane wrote:
>> So we probably can't have inlined hashing code --- I imagine the
>> hash generator needs the flexibility to pick different values of
>> those multipliers.

> Right now, only the initial values are randomized. Picking a different
> set of hash functions is possible, but someone that should be done only
> if there is an actual need. That was what I meant with stronger mixing
> might be necessary for "annoying" keyword additions.

Hmm.  I'm still leaning towards using generated, out-of-line hash
functions though, because then we could have a generator switch
indicating whether to apply the |0x20 case coercion or not.
(I realize that we could blow off that consideration and use a
case-insensitive hash function all the time, but it seems cleaner
to me not to make assumptions about how variable the hash function
parameters will need to be.)

> There are two ways for dealing with it:
> (1) Have one big hash table with all the various keywords and a class
> mask stored. If there is enough overlap between the keyword tables, it
> can significantly reduce the amount of space needed. In terms of code
> complexity, it adds one class check at the end, i.e. a bitmap test.

No, this would be a bad idea IMO, because it makes the core, plpgsql,
and ecpg keyword sets all interdependent.  If you add a keyword to any
one of those and forget to rebuild the other components, you got trouble.
Maybe we could make that reliable, but I don't think it's worth fooling
with for hypothetical benefit.  Also, it'd make the net space usage more
not less, because each of those executables/shlibs would contain copies
of all the keywords for the other ones' needs.

            regards, tom lane


Joerg Sonnenberger <joerg@bec.de> writes:
> On Sun, Jan 06, 2019 at 02:29:05PM -0500, Tom Lane wrote:
>> * We should extend the ScanKeywordList representation to include a
>> field holding the longest keyword length in the table, which
>> gen_keywordlist.pl would have no trouble providing.  Then we could
>> skip downcasing and/or hashing for any word longer than that, replacing
>> the current NAMEDATALEN test, and thereby putting a tight bound on
>> the cost of downcasing and/or hashing.

> Correct, possibly even have an array for each class of keywords.

I added that change to v8 and noted a further small improvement in my
test case.  That probably says something about the prevalence of long
identifiers in information_schema.sql ;-), but anyway we can figure
it's not a net loss.

I've pushed that version (v8 + max_kw_len); if the buildfarm doesn't
fall over, we can move on with looking at hashing.

I took a quick look through the NetBSD nbperf sources at

http://cvsweb.netbsd.org/bsdweb.cgi/src/usr.bin/nbperf/

and I concur with your judgment that we could manage translating
that into Perl, especially if we only implement the parts we need.
I'm curious what further changes you've made locally, and what
parameters you were using.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On 1/6/19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I've pushed that version (v8 + max_kw_len); if the buildfarm doesn't
> fall over, we can move on with looking at hashing.

Thank you. The API adjustment looks good, and I'm glad that splitting
out the aux info led to even further cleanups.

-John Naylor


I wrote:
> I took a quick look through the NetBSD nbperf sources at
> http://cvsweb.netbsd.org/bsdweb.cgi/src/usr.bin/nbperf/
> and I concur with your judgment that we could manage translating
> that into Perl, especially if we only implement the parts we need.

Here's an implementation of that, using the hash functions you showed
upthread.  The speed of the Perl script seems to be pretty acceptable;
less than 100ms to handle the main SQL keyword list, on my machine.
Yeah, the C version might be less than 1ms, but I don't think that
we need to put up with non-Perl build tooling for that.

Using the same test case as before (parsing information_schema.sql),
I get runtimes around 3560 ms, a shade better than my jury-rigged
prototype.

Probably there's a lot to be criticized about the Perl style below;
anybody feel a need to rewrite it?

            regards, tom lane

diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c
index d72842e..445b99d 100644
*** a/src/common/kwlookup.c
--- b/src/common/kwlookup.c
***************
*** 35,94 ****
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *text,
                    const ScanKeywordList *keywords)
  {
!     int            len,
!                 i;
!     char        word[NAMEDATALEN];
!     const char *kw_string;
!     const uint16 *kw_offsets;
!     const uint16 *low;
!     const uint16 *high;
!
!     len = strlen(text);

      if (len > keywords->max_kw_len)
!         return -1;                /* too long to be any keyword */
!
!     /* We assume all keywords are shorter than NAMEDATALEN. */
!     Assert(len < NAMEDATALEN);

      /*
!      * Apply an ASCII-only downcasing.  We must not use tolower() since it may
!      * produce the wrong translation in some locales (eg, Turkish).
       */
!     for (i = 0; i < len; i++)
!     {
!         char        ch = text[i];
!
!         if (ch >= 'A' && ch <= 'Z')
!             ch += 'a' - 'A';
!         word[i] = ch;
!     }
!     word[len] = '\0';

      /*
!      * Now do a binary search using plain strcmp() comparison.
       */
!     kw_string = keywords->kw_string;
!     kw_offsets = keywords->kw_offsets;
!     low = kw_offsets;
!     high = kw_offsets + (keywords->num_keywords - 1);
!     while (low <= high)
      {
!         const uint16 *middle;
!         int            difference;

!         middle = low + (high - low) / 2;
!         difference = strcmp(kw_string + *middle, word);
!         if (difference == 0)
!             return middle - kw_offsets;
!         else if (difference < 0)
!             low = middle + 1;
!         else
!             high = middle - 1;
      }

!     return -1;
  }
--- 35,81 ----
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *str,
                    const ScanKeywordList *keywords)
  {
!     size_t        len;
!     uint32        h;
!     const char *kw;

+     /*
+      * Reject immediately if too long to be any keyword.  This saves useless
+      * hashing and downcasing work on long strings.
+      */
+     len = strlen(str);
      if (len > keywords->max_kw_len)
!         return -1;

      /*
!      * Compute the hash function.  We assume it was generated to produce
!      * case-insensitive results.  Since it's a perfect hash, we need only
!      * match to the specific keyword it identifies.
       */
!     h = keywords->hash(str, len);

      /*
!      * Compare character-by-character to see if we have a match, applying an
!      * ASCII-only downcasing to the input characters.  We must not use
!      * tolower() since it may produce the wrong translation in some locales
!      * (eg, Turkish).
       */
!     kw = GetScanKeyword(h, keywords);
!     while (*str != '\0')
      {
!         char        ch = *str++;

!         if (ch >= 'A' && ch <= 'Z')
!             ch += 'a' - 'A';
!         if (ch != *kw++)
!             return -1;
      }
+     if (*kw != '\0')
+         return -1;

!     /* Success! */
!     return h;
  }
diff --git a/src/include/common/kwlookup.h b/src/include/common/kwlookup.h
index 39efb35..a6609ee 100644
*** a/src/include/common/kwlookup.h
--- b/src/include/common/kwlookup.h
***************
*** 14,19 ****
--- 14,22 ----
  #ifndef KWLOOKUP_H
  #define KWLOOKUP_H

+ /* Hash function used by ScanKeywordLookup */
+ typedef uint32 (*ScanKeywordHashFunc) (const void *key, size_t keylen);
+
  /*
   * This struct contains the data needed by ScanKeywordLookup to perform a
   * search within a set of keywords.  The contents are typically generated by
*************** typedef struct ScanKeywordList
*** 23,28 ****
--- 26,32 ----
  {
      const char *kw_string;        /* all keywords in order, separated by \0 */
      const uint16 *kw_offsets;    /* offsets to the start of each keyword */
+     ScanKeywordHashFunc hash;    /* perfect hash function for keywords */
      int            num_keywords;    /* number of keywords */
      int            max_kw_len;        /* length of longest keyword */
  } ScanKeywordList;
diff --git a/src/interfaces/ecpg/preproc/Makefile b/src/interfaces/ecpg/preproc/Makefile
index b5b74a3..abfe3cc 100644
*** a/src/interfaces/ecpg/preproc/Makefile
--- b/src/interfaces/ecpg/preproc/Makefile
*************** preproc.y: ../../../backend/parser/gram.
*** 57,63 ****

  # generate keyword headers
  c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST)
!     $(PERL) $(GEN_KEYWORDLIST) --varname ScanCKeywords $<

  ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST)
      $(PERL) $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $<
--- 57,63 ----

  # generate keyword headers
  c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST)
!     $(PERL) $(GEN_KEYWORDLIST) --varname ScanCKeywords --case $<

  ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST)
      $(PERL) $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $<
diff --git a/src/interfaces/ecpg/preproc/c_keywords.c b/src/interfaces/ecpg/preproc/c_keywords.c
index 38ddf6f..3493915 100644
*** a/src/interfaces/ecpg/preproc/c_keywords.c
--- b/src/interfaces/ecpg/preproc/c_keywords.c
***************
*** 9,16 ****
   */
  #include "postgres_fe.h"

- #include <ctype.h>
-
  #include "preproc_extern.h"
  #include "preproc.h"

--- 9,14 ----
*************** static const uint16 ScanCKeywordTokens[]
*** 32,70 ****
   *
   * Returns the token value of the keyword, or -1 if no match.
   *
!  * Do a binary search using plain strcmp() comparison.  This is much like
   * ScanKeywordLookup(), except we want case-sensitive matching.
   */
  int
! ScanCKeywordLookup(const char *text)
  {
!     const char *kw_string;
!     const uint16 *kw_offsets;
!     const uint16 *low;
!     const uint16 *high;

!     if (strlen(text) > ScanCKeywords.max_kw_len)
!         return -1;                /* too long to be any keyword */

!     kw_string = ScanCKeywords.kw_string;
!     kw_offsets = ScanCKeywords.kw_offsets;
!     low = kw_offsets;
!     high = kw_offsets + (ScanCKeywords.num_keywords - 1);

!     while (low <= high)
!     {
!         const uint16 *middle;
!         int            difference;

!         middle = low + (high - low) / 2;
!         difference = strcmp(kw_string + *middle, text);
!         if (difference == 0)
!             return ScanCKeywordTokens[middle - kw_offsets];
!         else if (difference < 0)
!             low = middle + 1;
!         else
!             high = middle - 1;
!     }

      return -1;
  }
--- 30,63 ----
   *
   * Returns the token value of the keyword, or -1 if no match.
   *
!  * Do a hash search using plain strcmp() comparison.  This is much like
   * ScanKeywordLookup(), except we want case-sensitive matching.
   */
  int
! ScanCKeywordLookup(const char *str)
  {
!     size_t        len;
!     uint32        h;
!     const char *kw;

!     /*
!      * Reject immediately if too long to be any keyword.  This saves useless
!      * hashing work on long strings.
!      */
!     len = strlen(str);
!     if (len > ScanCKeywords.max_kw_len)
!         return -1;

!     /*
!      * Compute the hash function.  Since it's a perfect hash, we need only
!      * match to the specific keyword it identifies.
!      */
!     h = ScanCKeywords_hash_func(str, len);

!     kw = GetScanKeyword(h, &ScanCKeywords);

!     if (strcmp(kw, str) == 0)
!         return ScanCKeywordTokens[h];

      return -1;
  }
diff --git a/src/tools/gen_keywordlist.pl b/src/tools/gen_keywordlist.pl
index d764aff..241ba51 100644
*** a/src/tools/gen_keywordlist.pl
--- b/src/tools/gen_keywordlist.pl
***************
*** 14,19 ****
--- 14,25 ----
  # variable named according to the -v switch ("ScanKeywords" by default).
  # The variable is marked "static" unless the -e switch is given.
  #
+ # ScanKeywordList uses hash-based lookup, so this script also selects
+ # a minimal perfect hash function for the keyword set, and emits a
+ # static hash function that is referenced in the ScanKeywordList struct.
+ # The hash function is case-insensitive unless --case is specified.
+ # Note that case insensitivity assumes all-ASCII keywords!
+ #
  #
  # Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
  # Portions Copyright (c) 1994, Regents of the University of California
*************** use Getopt::Long;
*** 28,39 ****

  my $output_path = '';
  my $extern = 0;
  my $varname = 'ScanKeywords';

  GetOptions(
!     'output:s' => \$output_path,
!     'extern'   => \$extern,
!     'varname:s' => \$varname) || usage();

  my $kw_input_file = shift @ARGV || die "No input file.\n";

--- 34,47 ----

  my $output_path = '';
  my $extern = 0;
+ my $case_sensitive = 0;
  my $varname = 'ScanKeywords';

  GetOptions(
!     'output:s'       => \$output_path,
!     'extern'         => \$extern,
!     'case-sensitive' => \$case_sensitive,
!     'varname:s'      => \$varname) || usage();

  my $kw_input_file = shift @ARGV || die "No input file.\n";

*************** while (<$kif>)
*** 87,93 ****
--- 95,116 ----
      }
  }

+ # When being case-insensitive, insist that the input be all-lower-case.
+ if (!$case_sensitive)
+ {
+     foreach my $kw (@keywords)
+     {
+         die qq|The keyword "$kw" is not lower-case in $kw_input_file\n|
+           if ($kw ne lc $kw);
+     }
+ }
+
  # Error out if the keyword names are not in ASCII order.
+ #
+ # While this isn't really necessary with hash-based lookup, it's still
+ # helpful because it provides a cheap way to reject duplicate keywords.
+ # Also, insisting on sorted order ensures that code that scans the keyword
+ # table linearly will see the keywords in a canonical order.
  for my $i (0..$#keywords - 1)
  {
      die qq|The keyword "$keywords[$i + 1]" is out of order in $kw_input_file\n|
*************** print $kwdef "};\n\n";
*** 128,139 ****
--- 151,167 ----

  printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords;

+ # Emit the definition of the hash function.
+
+ construct_hash_function();
+
  # Emit the struct that wraps all this lookup info into one variable.

  print $kwdef "static " if !$extern;
  printf $kwdef "const ScanKeywordList %s = {\n", $varname;
  printf $kwdef qq|\t%s_kw_string,\n|, $varname;
  printf $kwdef qq|\t%s_kw_offsets,\n|, $varname;
+ printf $kwdef qq|\t%s_hash_func,\n|, $varname;
  printf $kwdef qq|\t%s_NUM_KEYWORDS,\n|, uc $varname;
  printf $kwdef qq|\t%d\n|, $max_len;
  print $kwdef "};\n\n";
*************** print $kwdef "};\n\n";
*** 141,146 ****
--- 169,433 ----
  printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename;


+ # This code constructs a minimal perfect hash function for the given
+ # keyword set, using an algorithm described in
+ # "An optimal algorithm for generating minimal perfect hash functions"
+ # by Czech, Havas and Majewski in Information Processing Letters,
+ # 43(5):256-264, October 1992.
+ # This implementation is loosely based on NetBSD's "nbperf",
+ # which was written by Joerg Sonnenberger.
+
+ # At runtime, we'll compute two simple hash functions of the input word,
+ # and use them to index into a mapping table.  The hash functions are just
+ # multiply-and-add in uint32 arithmetic, with different multipliers but
+ # the same initial seed.
+
+ # Calculate a hash function as the run-time code will do.
+ # If we are making a case-insensitive hash function, we implement that
+ # by OR'ing 0x20 into each byte of the key.  This correctly transforms
+ # upper-case ASCII into lower-case ASCII, while not changing digits or
+ # dollar signs.  It might reduce our ability to discriminate other
+ # characters, but not by very much, and typically keywords wouldn't
+ # contain other characters anyway.
+ sub calc_hash
+ {
+     my ($keyword, $mult, $seed) = @_;
+
+     my $result = $seed;
+     for my $c (split //, $keyword)
+     {
+         my $cn = ord($c);
+         $cn |= 0x20 if !$case_sensitive;
+         $result = ($result * $mult + $cn) % 4294967296;
+     }
+     return $result;
+ }
+
+ # Attempt to construct a mapping table for the minimal perfect hash function.
+ # Returns a nonempty integer array if successful, else an empty array.
+ sub construct_hash_table
+ {
+     # Parameters of the hash functions are passed in.
+     my ($hash_mult1, $hash_mult2, $hash_seed) = @_;
+
+     # This algorithm is based on a graph whose edges correspond to the
+     # keywords and whose vertices correspond to entries of the mapping table.
+     # A keyword edge links the two vertices whose indexes are the outputs of
+     # the two hash functions for that keyword.  For K keywords, the mapping
+     # table must have at least 2*K+1 entries, guaranteeing that there's at
+     # least one unused entry.
+     my $nedges = scalar @keywords;    # number of edges
+     my $nverts = 2 * $nedges + 1;     # number of vertices
+
+     # Initialize the array of edges.
+     my @E = ();
+     foreach my $kw (@keywords)
+     {
+         # Calculate hashes for this keyword.
+         # The hashes are immediately reduced modulo the mapping table size.
+         my $hash1 = calc_hash($kw, $hash_mult1, $hash_seed) % $nverts;
+         my $hash2 = calc_hash($kw, $hash_mult2, $hash_seed) % $nverts;
+
+         # If the two hashes are the same for any keyword, we have to fail
+         # since this edge would itself form a cycle in the graph.
+         return () if $hash1 == $hash2;
+
+         # Add the edge for this keyword, initially populating just
+         # its "left" and "right" fields, which are the hash values.
+         push @E,
+           {
+             left   => $hash1,
+             right  => $hash2,
+             l_prev => undef,
+             l_next => undef,
+             r_prev => undef,
+             r_next => undef
+           };
+     }
+
+     # Initialize the array of vertices, marking them all "unused".
+     my @V = ();
+     for (my $i = 0; $i < $nverts; $i++)
+     {
+         push @V, { l_edge => undef, r_edge => undef };
+     }
+
+     # Attach each edge to a chain of the edges using its vertices.
+     # At completion, each vertice's l_edge field, if not undef,
+     # points to one edge having that vertex as left end, and the
+     # remaining such edges are chained through l_next/l_prev links.
+     # Likewise for right ends with r_edge and r_next/r_prev.
+     for (my $i = 0; $i < $nedges; $i++)
+     {
+         my $v = $E[$i]{left};
+         $E[ $V[$v]{l_edge} ]{l_prev} = $i if (defined $V[$v]{l_edge});
+         $E[$i]{l_next} = $V[$v]{l_edge};
+         $V[$v]{l_edge} = $i;
+
+         $v = $E[$i]{right};
+         $E[ $V[$v]{r_edge} ]{r_prev} = $i if (defined $V[$v]{r_edge});
+         $E[$i]{r_next} = $V[$v]{r_edge};
+         $V[$v]{r_edge} = $i;
+     }
+
+     # Now we attempt to prove the graph acyclic.
+     # A cycle-free graph is either empty or has some vertex of degree 1.
+     # Removing the edge attached to that vertex doesn't change this property,
+     # so doing that repeatedly will reduce the size of the graph.
+     # If the graph is empty at the end of the process, it was acyclic.
+     # We track the order of edge removal so that the next phase can process
+     # them in reverse order of removal.
+     my @output_order = ();
+
+     for (my $i = 0; $i < $nverts; $i++)
+     {
+         my $v = $i;
+         # If vertex v is of degree 1 (i.e. exactly 1 edge connects to it),
+         # remove that edge, and then consider the edge's other vertex to see
+         # if it is now of degree 1.  The inner loop repeats until reaching a
+         # vertex not of degree 1.
+         for (;;)
+         {
+             # If it's of degree 0, stop.
+             last if (!defined $V[$v]{l_edge} && !defined $V[$v]{r_edge});
+             # Can't be degree 1 if both sides have edges.
+             last if (defined $V[$v]{l_edge} && defined $V[$v]{r_edge});
+             # Check relevant side.
+             my ($e, $v2);
+             if (defined $V[$v]{l_edge})
+             {
+                 $e = $V[$v]{l_edge};
+                 # If there's more entries in chain, v is not of degree 1.
+                 last if (defined $E[$e]{l_next});
+                 # OK, unlink e from v ...
+                 $V[$v]{l_edge} = undef;
+                 # ... and from its right-side vertex.
+                 $v2 = $E[$e]{right};
+                 if (defined $E[$e]{r_prev})
+                 {
+                     $E[ $E[$e]{r_prev} ]{r_next} = $E[$e]{r_next};
+                 }
+                 else
+                 {
+                     $V[$v2]{r_edge} = $E[$e]{r_next};
+                 }
+                 if (defined $E[$e]{r_next})
+                 {
+                     $E[ $E[$e]{r_next} ]{r_prev} = $E[$e]{r_prev};
+                 }
+             }
+             else
+             {
+                 $e = $V[$v]{r_edge};
+                 # If there's more entries in chain, v is not of degree 1.
+                 last if (defined $E[$e]{r_next});
+                 # OK, unlink e from v ...
+                 $V[$v]{r_edge} = undef;
+                 # ... and from its left-side vertex.
+                 $v2 = $E[$e]{left};
+                 if (defined $E[$e]{l_prev})
+                 {
+                     $E[ $E[$e]{l_prev} ]{l_next} = $E[$e]{l_next};
+                 }
+                 else
+                 {
+                     $V[$v2]{l_edge} = $E[$e]{l_next};
+                 }
+                 if (defined $E[$e]{l_next})
+                 {
+                     $E[ $E[$e]{l_next} ]{l_prev} = $E[$e]{l_prev};
+                 }
+             }
+
+             # Push e onto the front of the output-order list.
+             unshift @output_order, $e;
+             # Consider v2 on next iteration of inner loop.
+             $v = $v2;
+         }
+     }
+
+     # We succeeded only if all edges were removed from the graph.
+     return () if (scalar(@output_order) != $nedges);
+
+     # OK, build the hash table of size $nverts.
+     my @hashtab = (0) x $nverts;
+     # We need a "visited" flag array in this step, too.
+     my @visited = (0) x $nverts;
+
+     # The idea is that for any keyword, the sum of the hash table entries for
+     # its first and second hash values, reduced mod $nedges, is the desired
+     # output (i.e., the keyword number).  By assigning hash table values in
+     # the selected edge order, we can guarantee that that's true.
+     foreach my $e (@output_order)
+     {
+         my $l = $E[$e]{left};
+         my $r = $E[$e]{right};
+         if (!$visited[$l])
+         {
+             $hashtab[$l] = ($nedges + $e - $hashtab[$r]) % $nedges;
+         }
+         else
+         {
+             die "oops, doubly used hashtab entry" if $visited[$r];
+             $hashtab[$r] = ($nedges + $e - $hashtab[$l]) % $nedges;
+         }
+         $visited[$l] = 1;
+         $visited[$r] = 1;
+     }
+
+     return @hashtab;
+ }
+
+ sub construct_hash_function
+ {
+     # Try different hash function parameters until we find a set that works
+     # for these keywords.  In principle we might need to change multipliers,
+     # but these two multipliers are chosen to be cheap to calculate via
+     # shift-and-add, so don't change them except at great need.
+     my $hash_mult1 = 31;
+     my $hash_mult2 = 37;
+
+     # We just try successive hash seed values until we find one that works.
+     # (Commonly, random seeds are tried, but we want reproducible results
+     # from this program so we don't do that.)
+     my $hash_seed;
+     my @hashtab;
+     for ($hash_seed = 0;; $hash_seed++)
+     {
+         @hashtab = construct_hash_table($hash_mult1, $hash_mult2, $hash_seed);
+         last if @hashtab;
+     }
+
+     # OK, emit the hash function definition including the hash table.
+     printf $kwdef "static uint32\n";
+     printf $kwdef "%s_hash_func(const void *key, size_t keylen)\n{\n",
+       $varname;
+     printf $kwdef "\tstatic const uint16 h[%d] = {\n", scalar(@hashtab);
+     for (my $i = 0; $i < scalar(@hashtab); $i++)
+     {
+         printf $kwdef "%s0x%04x,%s",
+           ($i % 8 == 0 ? "\t\t" : " "),
+           $hashtab[$i],
+           ($i % 8 == 7 ? "\n" : "");
+     }
+     printf $kwdef "\n" if (scalar(@hashtab) % 8 != 0);
+     printf $kwdef "\t};\n\n";
+     printf $kwdef "\tconst unsigned char *k = key;\n";
+     printf $kwdef "\tuint32\t\ta = %d;\n",   $hash_seed;
+     printf $kwdef "\tuint32\t\tb = %d;\n\n", $hash_seed;
+     printf $kwdef "\twhile (keylen--)\n\t{\n";
+     printf $kwdef "\t\tunsigned char c = *k++";
+     printf $kwdef " | 0x20" if !$case_sensitive;
+     printf $kwdef ";\n\n";
+     printf $kwdef "\t\ta = a * %d + c;\n", $hash_mult1;
+     printf $kwdef "\t\tb = b * %d + c;\n", $hash_mult2;
+     printf $kwdef "\t}\n";
+     printf $kwdef "\treturn (h[a %% %d] + h[b %% %d]) %% %d;\n",
+       scalar(@hashtab), scalar(@hashtab), scalar(@keywords);
+     printf $kwdef "}\n\n";
+ }
+
+
  sub usage
  {
      die <<EOM;
*************** Usage: gen_keywordlist.pl [--output/-o <
*** 148,153 ****
--- 435,441 ----
      --output   Output directory (default '.')
      --varname  Name for ScanKeywordList variable (default 'ScanKeywords')
      --extern   Allow the ScanKeywordList variable to be globally visible
+     --case     Keyword matching is to be case-sensitive

  gen_keywordlist.pl transforms a list of keywords into a ScanKeywordList.
  The output filename is derived from the input file by inserting _d,
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 937bf18..ed603fc 100644
*** a/src/tools/msvc/Solution.pm
--- b/src/tools/msvc/Solution.pm
*************** sub GenerateFiles
*** 440,446 ****
      {
          print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n";
          chdir('src/interfaces/ecpg/preproc');
!         system('perl ../../../tools/gen_keywordlist.pl --varname ScanCKeywords c_kwlist.h');
          system('perl ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h');
          chdir('../../../..');
      }
--- 440,446 ----
      {
          print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n";
          chdir('src/interfaces/ecpg/preproc');
!         system('perl ../../../tools/gen_keywordlist.pl --varname ScanCKeywords --case c_kwlist.h');
          system('perl ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h');
          chdir('../../../..');
      }

Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andres Freund
Date:
Hi,

On 2019-01-07 16:11:04 -0500, Tom Lane wrote:
> I wrote:
> > I took a quick look through the NetBSD nbperf sources at
> > http://cvsweb.netbsd.org/bsdweb.cgi/src/usr.bin/nbperf/
> > and I concur with your judgment that we could manage translating
> > that into Perl, especially if we only implement the parts we need.
> 
> Here's an implementation of that, using the hash functions you showed
> upthread.  The speed of the Perl script seems to be pretty acceptable;
> less than 100ms to handle the main SQL keyword list, on my machine.
> Yeah, the C version might be less than 1ms, but I don't think that
> we need to put up with non-Perl build tooling for that.
> 
> Using the same test case as before (parsing information_schema.sql),
> I get runtimes around 3560 ms, a shade better than my jury-rigged
> prototype.
> 
> Probably there's a lot to be criticized about the Perl style below;
> anybody feel a need to rewrite it?

Hm, shouldn't we extract the perfect hash generation into a perl module
or such? It seems that there's plenty other possible uses for it.

Greetings,

Andres Freund


I wrote:
> Probably there's a lot to be criticized about the Perl style below;
> anybody feel a need to rewrite it?

Here's a somewhat better version.  I realized that I was being too
slavishly tied to the data structures used in the C version; in Perl
it's easier to manage the lists of edges as hashes.  I can't see any
need to distinguish left and right edge sets, either, so this just
has one such hash per vertex.

Also, it seems to me that we *can* make intelligent use of unused
hashtable entries to exit early on many non-keyword inputs.  The reason
the existing code fails to do so is that it computes the sums and
differences of hashtable entries in unsigned modulo arithmetic; but if
we make the hashtable entries signed, we can set them up as exact
differences and drop the final modulo operation in the hash function.
Then, any out-of-range sum must indicate an input that is not a keyword
(because it is touching a pair of hashtable entries that didn't go
together) and we can exit early from the caller.  This in turn lets us
mark unused hashtable entries with large values to ensure that sums
involving them will be out of range.

A weak spot in that argument is that it's not entirely clear how
large the differences can get --- with an unlucky series of
collisions, maybe they could get large enough to overflow int16?
I don't think that's likely for the size of problem this script
is going to encounter, so I just put in an error check for it.
But it could do with closer analysis before deciding that this
is a general-purpose solution.

            regards, tom lane

diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c
index d72842e..9dc1fee 100644
*** a/src/common/kwlookup.c
--- b/src/common/kwlookup.c
***************
*** 35,94 ****
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *text,
                    const ScanKeywordList *keywords)
  {
!     int            len,
!                 i;
!     char        word[NAMEDATALEN];
!     const char *kw_string;
!     const uint16 *kw_offsets;
!     const uint16 *low;
!     const uint16 *high;
!
!     len = strlen(text);

      if (len > keywords->max_kw_len)
!         return -1;                /* too long to be any keyword */
!
!     /* We assume all keywords are shorter than NAMEDATALEN. */
!     Assert(len < NAMEDATALEN);

      /*
!      * Apply an ASCII-only downcasing.  We must not use tolower() since it may
!      * produce the wrong translation in some locales (eg, Turkish).
       */
!     for (i = 0; i < len; i++)
!     {
!         char        ch = text[i];

!         if (ch >= 'A' && ch <= 'Z')
!             ch += 'a' - 'A';
!         word[i] = ch;
!     }
!     word[len] = '\0';

      /*
!      * Now do a binary search using plain strcmp() comparison.
       */
!     kw_string = keywords->kw_string;
!     kw_offsets = keywords->kw_offsets;
!     low = kw_offsets;
!     high = kw_offsets + (keywords->num_keywords - 1);
!     while (low <= high)
      {
!         const uint16 *middle;
!         int            difference;

!         middle = low + (high - low) / 2;
!         difference = strcmp(kw_string + *middle, word);
!         if (difference == 0)
!             return middle - kw_offsets;
!         else if (difference < 0)
!             low = middle + 1;
!         else
!             high = middle - 1;
      }

!     return -1;
  }
--- 35,89 ----
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *str,
                    const ScanKeywordList *keywords)
  {
!     size_t        len;
!     int            h;
!     const char *kw;

+     /*
+      * Reject immediately if too long to be any keyword.  This saves useless
+      * hashing and downcasing work on long strings.
+      */
+     len = strlen(str);
      if (len > keywords->max_kw_len)
!         return -1;

      /*
!      * Compute the hash function.  We assume it was generated to produce
!      * case-insensitive results.  Since it's a perfect hash, we need only
!      * match to the specific keyword it identifies.
       */
!     h = keywords->hash(str, len);

!     /*
!      * An out-of-range result implies no match.  (This can happen for
!      * non-keyword inputs because the hash function will sum two unrelated
!      * hashtable entries.)
!      */
!     if (h < 0 || h >= keywords->num_keywords)
!         return -1;

      /*
!      * Compare character-by-character to see if we have a match, applying an
!      * ASCII-only downcasing to the input characters.  We must not use
!      * tolower() since it may produce the wrong translation in some locales
!      * (eg, Turkish).
       */
!     kw = GetScanKeyword(h, keywords);
!     while (*str != '\0')
      {
!         char        ch = *str++;

!         if (ch >= 'A' && ch <= 'Z')
!             ch += 'a' - 'A';
!         if (ch != *kw++)
!             return -1;
      }
+     if (*kw != '\0')
+         return -1;

!     /* Success! */
!     return h;
  }
diff --git a/src/include/common/kwlookup.h b/src/include/common/kwlookup.h
index 39efb35..dbff367 100644
*** a/src/include/common/kwlookup.h
--- b/src/include/common/kwlookup.h
***************
*** 14,19 ****
--- 14,22 ----
  #ifndef KWLOOKUP_H
  #define KWLOOKUP_H

+ /* Hash function used by ScanKeywordLookup */
+ typedef int (*ScanKeywordHashFunc) (const void *key, size_t keylen);
+
  /*
   * This struct contains the data needed by ScanKeywordLookup to perform a
   * search within a set of keywords.  The contents are typically generated by
*************** typedef struct ScanKeywordList
*** 23,28 ****
--- 26,32 ----
  {
      const char *kw_string;        /* all keywords in order, separated by \0 */
      const uint16 *kw_offsets;    /* offsets to the start of each keyword */
+     ScanKeywordHashFunc hash;    /* perfect hash function for keywords */
      int            num_keywords;    /* number of keywords */
      int            max_kw_len;        /* length of longest keyword */
  } ScanKeywordList;
diff --git a/src/interfaces/ecpg/preproc/Makefile b/src/interfaces/ecpg/preproc/Makefile
index b5b74a3..abfe3cc 100644
*** a/src/interfaces/ecpg/preproc/Makefile
--- b/src/interfaces/ecpg/preproc/Makefile
*************** preproc.y: ../../../backend/parser/gram.
*** 57,63 ****

  # generate keyword headers
  c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST)
!     $(PERL) $(GEN_KEYWORDLIST) --varname ScanCKeywords $<

  ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST)
      $(PERL) $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $<
--- 57,63 ----

  # generate keyword headers
  c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST)
!     $(PERL) $(GEN_KEYWORDLIST) --varname ScanCKeywords --case $<

  ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST)
      $(PERL) $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $<
diff --git a/src/interfaces/ecpg/preproc/c_keywords.c b/src/interfaces/ecpg/preproc/c_keywords.c
index 38ddf6f..387298b 100644
*** a/src/interfaces/ecpg/preproc/c_keywords.c
--- b/src/interfaces/ecpg/preproc/c_keywords.c
***************
*** 9,16 ****
   */
  #include "postgres_fe.h"

- #include <ctype.h>
-
  #include "preproc_extern.h"
  #include "preproc.h"

--- 9,14 ----
*************** static const uint16 ScanCKeywordTokens[]
*** 32,70 ****
   *
   * Returns the token value of the keyword, or -1 if no match.
   *
!  * Do a binary search using plain strcmp() comparison.  This is much like
   * ScanKeywordLookup(), except we want case-sensitive matching.
   */
  int
! ScanCKeywordLookup(const char *text)
  {
!     const char *kw_string;
!     const uint16 *kw_offsets;
!     const uint16 *low;
!     const uint16 *high;

!     if (strlen(text) > ScanCKeywords.max_kw_len)
!         return -1;                /* too long to be any keyword */

!     kw_string = ScanCKeywords.kw_string;
!     kw_offsets = ScanCKeywords.kw_offsets;
!     low = kw_offsets;
!     high = kw_offsets + (ScanCKeywords.num_keywords - 1);

!     while (low <= high)
!     {
!         const uint16 *middle;
!         int            difference;

!         middle = low + (high - low) / 2;
!         difference = strcmp(kw_string + *middle, text);
!         if (difference == 0)
!             return ScanCKeywordTokens[middle - kw_offsets];
!         else if (difference < 0)
!             low = middle + 1;
!         else
!             high = middle - 1;
!     }

      return -1;
  }
--- 30,71 ----
   *
   * Returns the token value of the keyword, or -1 if no match.
   *
!  * Do a hash search using plain strcmp() comparison.  This is much like
   * ScanKeywordLookup(), except we want case-sensitive matching.
   */
  int
! ScanCKeywordLookup(const char *str)
  {
!     size_t        len;
!     int            h;
!     const char *kw;

!     /*
!      * Reject immediately if too long to be any keyword.  This saves useless
!      * hashing work on long strings.
!      */
!     len = strlen(str);
!     if (len > ScanCKeywords.max_kw_len)
!         return -1;

!     /*
!      * Compute the hash function.  Since it's a perfect hash, we need only
!      * match to the specific keyword it identifies.
!      */
!     h = ScanCKeywords_hash_func(str, len);

!     /*
!      * An out-of-range result implies no match.  (This can happen for
!      * non-keyword inputs because the hash function will sum two unrelated
!      * hashtable entries.)
!      */
!     if (h < 0 || h >= ScanCKeywords.num_keywords)
!         return -1;

!     kw = GetScanKeyword(h, &ScanCKeywords);
!
!     if (strcmp(kw, str) == 0)
!         return ScanCKeywordTokens[h];

      return -1;
  }
diff --git a/src/tools/gen_keywordlist.pl b/src/tools/gen_keywordlist.pl
index d764aff..f15afc1 100644
*** a/src/tools/gen_keywordlist.pl
--- b/src/tools/gen_keywordlist.pl
***************
*** 14,19 ****
--- 14,25 ----
  # variable named according to the -v switch ("ScanKeywords" by default).
  # The variable is marked "static" unless the -e switch is given.
  #
+ # ScanKeywordList uses hash-based lookup, so this script also selects
+ # a minimal perfect hash function for the keyword set, and emits a
+ # static hash function that is referenced in the ScanKeywordList struct.
+ # The hash function is case-insensitive unless --case is specified.
+ # Note that case insensitivity assumes all-ASCII keywords!
+ #
  #
  # Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
  # Portions Copyright (c) 1994, Regents of the University of California
*************** use Getopt::Long;
*** 28,39 ****

  my $output_path = '';
  my $extern = 0;
  my $varname = 'ScanKeywords';

  GetOptions(
!     'output:s' => \$output_path,
!     'extern'   => \$extern,
!     'varname:s' => \$varname) || usage();

  my $kw_input_file = shift @ARGV || die "No input file.\n";

--- 34,47 ----

  my $output_path = '';
  my $extern = 0;
+ my $case_sensitive = 0;
  my $varname = 'ScanKeywords';

  GetOptions(
!     'output:s'       => \$output_path,
!     'extern'         => \$extern,
!     'case-sensitive' => \$case_sensitive,
!     'varname:s'      => \$varname) || usage();

  my $kw_input_file = shift @ARGV || die "No input file.\n";

*************** while (<$kif>)
*** 87,93 ****
--- 95,116 ----
      }
  }

+ # When being case-insensitive, insist that the input be all-lower-case.
+ if (!$case_sensitive)
+ {
+     foreach my $kw (@keywords)
+     {
+         die qq|The keyword "$kw" is not lower-case in $kw_input_file\n|
+           if ($kw ne lc $kw);
+     }
+ }
+
  # Error out if the keyword names are not in ASCII order.
+ #
+ # While this isn't really necessary with hash-based lookup, it's still
+ # helpful because it provides a cheap way to reject duplicate keywords.
+ # Also, insisting on sorted order ensures that code that scans the keyword
+ # table linearly will see the keywords in a canonical order.
  for my $i (0..$#keywords - 1)
  {
      die qq|The keyword "$keywords[$i + 1]" is out of order in $kw_input_file\n|
*************** print $kwdef "};\n\n";
*** 128,139 ****
--- 151,167 ----

  printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords;

+ # Emit the definition of the hash function.
+
+ construct_hash_function();
+
  # Emit the struct that wraps all this lookup info into one variable.

  print $kwdef "static " if !$extern;
  printf $kwdef "const ScanKeywordList %s = {\n", $varname;
  printf $kwdef qq|\t%s_kw_string,\n|, $varname;
  printf $kwdef qq|\t%s_kw_offsets,\n|, $varname;
+ printf $kwdef qq|\t%s_hash_func,\n|, $varname;
  printf $kwdef qq|\t%s_NUM_KEYWORDS,\n|, uc $varname;
  printf $kwdef qq|\t%d\n|, $max_len;
  print $kwdef "};\n\n";
*************** print $kwdef "};\n\n";
*** 141,146 ****
--- 169,392 ----
  printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename;


+ # This code constructs a minimal perfect hash function for the given
+ # keyword set, using an algorithm described in
+ # "An optimal algorithm for generating minimal perfect hash functions"
+ # by Czech, Havas and Majewski in Information Processing Letters,
+ # 43(5):256-264, October 1992.
+ # This implementation is loosely based on NetBSD's "nbperf",
+ # which was written by Joerg Sonnenberger.
+
+ # At runtime, we'll compute two simple hash functions of the input word,
+ # and use them to index into a mapping table.  The hash functions are just
+ # multiply-and-add in uint32 arithmetic, with different multipliers but
+ # the same initial seed.
+
+ # Calculate a hash function as the run-time code will do.
+ # If we are making a case-insensitive hash function, we implement that
+ # by OR'ing 0x20 into each byte of the key.  This correctly transforms
+ # upper-case ASCII into lower-case ASCII, while not changing digits or
+ # dollar signs.  (It does change '_', else we could just skip adjusting
+ # $cn here at all, for typical keyword strings.)
+ sub calc_hash
+ {
+     my ($keyword, $mult, $seed) = @_;
+
+     my $result = $seed;
+     for my $c (split //, $keyword)
+     {
+         my $cn = ord($c);
+         $cn |= 0x20 if !$case_sensitive;
+         $result = ($result * $mult + $cn) % 4294967296;
+     }
+     return $result;
+ }
+
+ # Attempt to construct a mapping table for the minimal perfect hash function.
+ # Returns a nonempty integer array if successful, else an empty array.
+ sub construct_hash_table
+ {
+     # Parameters of the hash functions are passed in.
+     my ($hash_mult1, $hash_mult2, $hash_seed) = @_;
+
+     # This algorithm is based on a graph whose edges correspond to the
+     # keywords and whose vertices correspond to entries of the mapping table.
+     # A keyword edge links the two vertices whose indexes are the outputs of
+     # the two hash functions for that keyword.  For K keywords, the mapping
+     # table must have at least 2*K+1 entries, guaranteeing that there's at
+     # least one unused entry.
+     my $nedges = scalar @keywords;    # number of edges
+     my $nverts = 2 * $nedges + 1;     # number of vertices
+
+     # Initialize the array of edges.
+     my @E = ();
+     foreach my $kw (@keywords)
+     {
+         # Calculate hashes for this keyword.
+         # The hashes are immediately reduced modulo the mapping table size.
+         my $hash1 = calc_hash($kw, $hash_mult1, $hash_seed) % $nverts;
+         my $hash2 = calc_hash($kw, $hash_mult2, $hash_seed) % $nverts;
+
+         # If the two hashes are the same for any keyword, we have to fail
+         # since this edge would itself form a cycle in the graph.
+         return () if $hash1 == $hash2;
+
+         # Add the edge for this keyword.
+         push @E, { left => $hash1, right => $hash2 };
+     }
+
+     # Initialize the array of vertices, giving them all empty lists
+     # of associated edges.  (The lists will be hashes of edge numbers.)
+     my @V = ();
+     for (my $v = 0; $v < $nverts; $v++)
+     {
+         push @V, { edges => {} };
+     }
+
+     # Insert each edge in the lists of edges using its vertices.
+     for (my $e = 0; $e < $nedges; $e++)
+     {
+         my $v = $E[$e]{left};
+         $V[$v]{edges}->{$e} = 1;
+
+         $v = $E[$e]{right};
+         $V[$v]{edges}->{$e} = 1;
+     }
+
+     # Now we attempt to prove the graph acyclic.
+     # A cycle-free graph is either empty or has some vertex of degree 1.
+     # Removing the edge attached to that vertex doesn't change this property,
+     # so doing that repeatedly will reduce the size of the graph.
+     # If the graph is empty at the end of the process, it was acyclic.
+     # We track the order of edge removal so that the next phase can process
+     # them in reverse order of removal.
+     my @output_order = ();
+
+     # Consider each vertex as a possible starting point for edge-removal.
+     for (my $startv = 0; $startv < $nverts; $startv++)
+     {
+         my $v = $startv;
+
+         # If vertex v is of degree 1 (i.e. exactly 1 edge connects to it),
+         # remove that edge, and then consider the edge's other vertex to see
+         # if it is now of degree 1.  The inner loop repeats until reaching a
+         # vertex not of degree 1.
+         while (scalar(keys(%{ $V[$v]{edges} })) == 1)
+         {
+             # Unlink its only edge.
+             my $e = (keys(%{ $V[$v]{edges} }))[0];
+             delete($V[$v]{edges}->{$e});
+
+             # Unlink the edge from its other vertex, too.
+             my $v2 = $E[$e]{left};
+             $v2 = $E[$e]{right} if ($v2 == $v);
+             delete($V[$v2]{edges}->{$e});
+
+             # Push e onto the front of the output-order list.
+             unshift @output_order, $e;
+
+             # Consider v2 on next iteration of inner loop.
+             $v = $v2;
+         }
+     }
+
+     # We succeeded only if all edges were removed from the graph.
+     return () if (scalar(@output_order) != $nedges);
+
+     # OK, build the hash table of size $nverts.
+     my @hashtab = (0) x $nverts;
+     # We need a "visited" flag array in this step, too.
+     my @visited = (0) x $nverts;
+
+     # The idea is that for any keyword, the sum of the hash table entries
+     # for its first and second hash values is the desired output (i.e., the
+     # keyword number).  By assigning hash table values in the selected edge
+     # order, we can guarantee that that's true.
+     foreach my $e (@output_order)
+     {
+         my $l = $E[$e]{left};
+         my $r = $E[$e]{right};
+         if (!$visited[$l])
+         {
+             # $hashtab[$r] might be zero, or some previously assigned value.
+             $hashtab[$l] = $e - $hashtab[$r];
+         }
+         else
+         {
+             die "oops, doubly used hashtab entry" if $visited[$r];
+             # $hashtab[$l] might be zero, or some previously assigned value.
+             $hashtab[$r] = $e - $hashtab[$l];
+         }
+         # Now freeze both of these hashtab entries.
+         $visited[$l] = 1;
+         $visited[$r] = 1;
+     }
+
+     # Check that the results fit in int16.  (With very large keyword sets, we
+     # might need to allow wider hashtable entries; but that day is far away.)
+     # Then set any unused hash table entries to 0x7FFF.  For reasonable
+     # keyword counts, that will ensure that any hash sum involving such an
+     # entry will be out-of-range, allowing the caller to exit early.
+     for (my $v = 0; $v < $nverts; $v++)
+     {
+         die "oops, hashtab entry overflow"
+           if $hashtab[$v] < -32767 || $hashtab[$v] > 32767;
+         $hashtab[$v] = 0x7FFF if !$visited[$v];
+     }
+
+     return @hashtab;
+ }
+
+ sub construct_hash_function
+ {
+     # Try different hash function parameters until we find a set that works
+     # for these keywords.  In principle we might need to change multipliers,
+     # but these two multipliers are chosen to be cheap to calculate via
+     # shift-and-add, so don't change them except at great need.
+     my $hash_mult1 = 31;
+     my $hash_mult2 = 37;
+
+     # We just try successive hash seed values until we find one that works.
+     # (Commonly, random seeds are tried, but we want reproducible results
+     # from this program so we don't do that.)
+     my $hash_seed;
+     my @hashtab;
+     for ($hash_seed = 0;; $hash_seed++)
+     {
+         @hashtab = construct_hash_table($hash_mult1, $hash_mult2, $hash_seed);
+         last if @hashtab;
+     }
+     my $nhash = scalar(@hashtab);
+
+     # OK, emit the hash function definition including the hash table.
+     printf $kwdef "static int\n";
+     printf $kwdef "%s_hash_func(const void *key, size_t keylen)\n{\n",
+       $varname;
+     printf $kwdef "\tstatic const int16 h[%d] = {\n", $nhash;
+     for (my $i = 0; $i < $nhash; $i++)
+     {
+         printf $kwdef "%s%6d,%s",
+           ($i % 8 == 0 ? "\t\t" : " "),
+           $hashtab[$i],
+           ($i % 8 == 7 ? "\n" : "");
+     }
+     printf $kwdef "\n" if ($nhash % 8 != 0);
+     printf $kwdef "\t};\n\n";
+     printf $kwdef "\tconst unsigned char *k = key;\n";
+     printf $kwdef "\tuint32\t\ta = %d;\n",   $hash_seed;
+     printf $kwdef "\tuint32\t\tb = %d;\n\n", $hash_seed;
+     printf $kwdef "\twhile (keylen--)\n\t{\n";
+     printf $kwdef "\t\tunsigned char c = *k++";
+     printf $kwdef " | 0x20" if !$case_sensitive;
+     printf $kwdef ";\n\n";
+     printf $kwdef "\t\ta = a * %d + c;\n", $hash_mult1;
+     printf $kwdef "\t\tb = b * %d + c;\n", $hash_mult2;
+     printf $kwdef "\t}\n";
+     printf $kwdef "\treturn h[a %% %d] + h[b %% %d];\n", $nhash, $nhash;
+     printf $kwdef "}\n\n";
+ }
+
+
  sub usage
  {
      die <<EOM;
*************** Usage: gen_keywordlist.pl [--output/-o <
*** 148,153 ****
--- 394,400 ----
      --output   Output directory (default '.')
      --varname  Name for ScanKeywordList variable (default 'ScanKeywords')
      --extern   Allow the ScanKeywordList variable to be globally visible
+     --case     Keyword matching is to be case-sensitive

  gen_keywordlist.pl transforms a list of keywords into a ScanKeywordList.
  The output filename is derived from the input file by inserting _d,
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 937bf18..ed603fc 100644
*** a/src/tools/msvc/Solution.pm
--- b/src/tools/msvc/Solution.pm
*************** sub GenerateFiles
*** 440,446 ****
      {
          print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n";
          chdir('src/interfaces/ecpg/preproc');
!         system('perl ../../../tools/gen_keywordlist.pl --varname ScanCKeywords c_kwlist.h');
          system('perl ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h');
          chdir('../../../..');
      }
--- 440,446 ----
      {
          print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n";
          chdir('src/interfaces/ecpg/preproc');
!         system('perl ../../../tools/gen_keywordlist.pl --varname ScanCKeywords --case c_kwlist.h');
          system('perl ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h');
          chdir('../../../..');
      }

Andres Freund <andres@anarazel.de> writes:
> Hm, shouldn't we extract the perfect hash generation into a perl module
> or such? It seems that there's plenty other possible uses for it.

Such as?  But in any case, that sounds like a task for someone with
more sense of Perl style than I have.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andres Freund
Date:
Hi,

On 2019-01-07 19:37:51 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Hm, shouldn't we extract the perfect hash generation into a perl module
> > or such? It seems that there's plenty other possible uses for it.
>
> Such as?

Builtin functions for one, which we'd swatted down last time round due
to gperfs defficiencies. But I think there's plenty more potential,
e.g. it'd make sense from a performance POV to use a perfect hash
function for locks on builtin objects (the hashtable for lookups therein
shows up prominently in a fair number of profiles, and they are a large
percentage of the acquistions). I'm certain there's plenty more, I've
not though too much about it.


> But in any case, that sounds like a task for someone with
> more sense of Perl style than I have.

John, any chance you could help out with that... :)

Greetings,

Andres Freund


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andrew Dunstan
Date:
On 1/7/19 7:52 PM, Andres Freund wrote:
> Hi,
>
> On 2019-01-07 19:37:51 -0500, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> Hm, shouldn't we extract the perfect hash generation into a perl module
>>> or such? It seems that there's plenty other possible uses for it.
>> Such as?
> Builtin functions for one, which we'd swatted down last time round due
> to gperfs defficiencies. But I think there's plenty more potential,
> e.g. it'd make sense from a performance POV to use a perfect hash
> function for locks on builtin objects (the hashtable for lookups therein
> shows up prominently in a fair number of profiles, and they are a large
> percentage of the acquistions). I'm certain there's plenty more, I've
> not though too much about it.
>


Yeah, this is pretty neat,



>> But in any case, that sounds like a task for someone with
>> more sense of Perl style than I have.
> John, any chance you could help out with that... :)
>

If he doesn't I will.


cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On Tue, Jan 8, 2019 at 12:06 PM Andrew Dunstan
<andrew.dunstan@2ndquadrant.com> wrote:
> On 1/7/19 7:52 PM, Andres Freund wrote:
> > Builtin functions for one, which we'd swatted down last time round due
> > to gperfs defficiencies.

Do you mean the fmgr table?

> >> But in any case, that sounds like a task for someone with
> >> more sense of Perl style than I have.
> > John, any chance you could help out with that... :)
>
> If he doesn't I will.

I'll take a crack at separating into a module.  I'll wait a bit in
case there are any stylistic suggestions on the patch as it stands.


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andres Freund
Date:
Hi,

On 2019-01-08 13:41:16 -0500, John Naylor wrote:
> On Tue, Jan 8, 2019 at 12:06 PM Andrew Dunstan
> <andrew.dunstan@2ndquadrant.com> wrote:
> > On 1/7/19 7:52 PM, Andres Freund wrote:
> > > Builtin functions for one, which we'd swatted down last time round due
> > > to gperfs defficiencies.
> 
> Do you mean the fmgr table?

Not the entire fmgr table, but just the builtin oid index, generated by
the following section:

# Create the fmgr_builtins table, collect data for fmgr_builtin_oid_index
print $tfh "\nconst FmgrBuiltin fmgr_builtins[] = {\n";
my %bmap;
$bmap{'t'} = 'true';
$bmap{'f'} = 'false';
my @fmgr_builtin_oid_index;
my $fmgr_count = 0;
foreach my $s (sort { $a->{oid} <=> $b->{oid} } @fmgr)
{
    print $tfh
      "  { $s->{oid}, $s->{nargs}, $bmap{$s->{strict}}, $bmap{$s->{retset}}, \"$s->{prosrc}\", $s->{prosrc} }";

    $fmgr_builtin_oid_index[ $s->{oid} ] = $fmgr_count++;

    if ($fmgr_count <= $#fmgr)
    {
        print $tfh ",\n";
    }
    else
    {
        print $tfh "\n";
    }
}
print $tfh "};\n";

print $tfh qq|
const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin));
|;

The generated fmgr_builtin_oid_index is pretty sparse, and a more dense
hashtable might e.g. more efficient from a cache perspective.

Greetings,

Andres Freund


John Naylor <john.naylor@2ndquadrant.com> writes:
> On Tue, Jan 8, 2019 at 12:06 PM Andrew Dunstan
> <andrew.dunstan@2ndquadrant.com> wrote:
>> If he doesn't I will.

> I'll take a crack at separating into a module.  I'll wait a bit in
> case there are any stylistic suggestions on the patch as it stands.

I had a go at that myself.  I'm sure there's plenty to criticize in
the result, but at least it passes make check-world ;-)

I resolved the worry I had last night about the range of table values
by putting in logic to check the range and choose a suitable table
element type.  There are a couple of existing calls where we manage
to fit the hashtable elements into int8 that way; of course, by
definition that doesn't save a whole lot of space since such tables
couldn't have many elements, but it seems cleaner anyway.

            regards, tom lane

diff --git a/src/common/Makefile b/src/common/Makefile
index 317b071..d0c2b97 100644
*** a/src/common/Makefile
--- b/src/common/Makefile
*************** OBJS_FRONTEND = $(OBJS_COMMON) fe_memuti
*** 63,68 ****
--- 63,73 ----
  OBJS_SHLIB = $(OBJS_FRONTEND:%.o=%_shlib.o)
  OBJS_SRV = $(OBJS_COMMON:%.o=%_srv.o)

+ # where to find gen_keywordlist.pl and subsidiary files
+ TOOLSDIR = $(top_srcdir)/src/tools
+ GEN_KEYWORDLIST = $(PERL) -I $(TOOLSDIR) $(TOOLSDIR)/gen_keywordlist.pl
+ GEN_KEYWORDLIST_DEPS = $(TOOLSDIR)/gen_keywordlist.pl $(TOOLSDIR)/PerfectHash.pm
+
  all: libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a

  distprep: kwlist_d.h
*************** libpgcommon_srv.a: $(OBJS_SRV)
*** 118,125 ****
      $(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@

  # generate SQL keyword lookup table to be included into keywords*.o.
! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(top_srcdir)/src/tools/gen_keywordlist.pl
!     $(PERL) $(top_srcdir)/src/tools/gen_keywordlist.pl --extern $<

  # Dependencies of keywords*.o need to be managed explicitly to make sure
  # that you don't get broken parsing code, even in a non-enable-depend build.
--- 123,130 ----
      $(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@

  # generate SQL keyword lookup table to be included into keywords*.o.
! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(GEN_KEYWORDLIST_DEPS)
!     $(GEN_KEYWORDLIST) --extern $<

  # Dependencies of keywords*.o need to be managed explicitly to make sure
  # that you don't get broken parsing code, even in a non-enable-depend build.
diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c
index d72842e..9dc1fee 100644
*** a/src/common/kwlookup.c
--- b/src/common/kwlookup.c
***************
*** 35,94 ****
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *text,
                    const ScanKeywordList *keywords)
  {
!     int            len,
!                 i;
!     char        word[NAMEDATALEN];
!     const char *kw_string;
!     const uint16 *kw_offsets;
!     const uint16 *low;
!     const uint16 *high;
!
!     len = strlen(text);

      if (len > keywords->max_kw_len)
!         return -1;                /* too long to be any keyword */
!
!     /* We assume all keywords are shorter than NAMEDATALEN. */
!     Assert(len < NAMEDATALEN);

      /*
!      * Apply an ASCII-only downcasing.  We must not use tolower() since it may
!      * produce the wrong translation in some locales (eg, Turkish).
       */
!     for (i = 0; i < len; i++)
!     {
!         char        ch = text[i];

!         if (ch >= 'A' && ch <= 'Z')
!             ch += 'a' - 'A';
!         word[i] = ch;
!     }
!     word[len] = '\0';

      /*
!      * Now do a binary search using plain strcmp() comparison.
       */
!     kw_string = keywords->kw_string;
!     kw_offsets = keywords->kw_offsets;
!     low = kw_offsets;
!     high = kw_offsets + (keywords->num_keywords - 1);
!     while (low <= high)
      {
!         const uint16 *middle;
!         int            difference;

!         middle = low + (high - low) / 2;
!         difference = strcmp(kw_string + *middle, word);
!         if (difference == 0)
!             return middle - kw_offsets;
!         else if (difference < 0)
!             low = middle + 1;
!         else
!             high = middle - 1;
      }

!     return -1;
  }
--- 35,89 ----
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *str,
                    const ScanKeywordList *keywords)
  {
!     size_t        len;
!     int            h;
!     const char *kw;

+     /*
+      * Reject immediately if too long to be any keyword.  This saves useless
+      * hashing and downcasing work on long strings.
+      */
+     len = strlen(str);
      if (len > keywords->max_kw_len)
!         return -1;

      /*
!      * Compute the hash function.  We assume it was generated to produce
!      * case-insensitive results.  Since it's a perfect hash, we need only
!      * match to the specific keyword it identifies.
       */
!     h = keywords->hash(str, len);

!     /*
!      * An out-of-range result implies no match.  (This can happen for
!      * non-keyword inputs because the hash function will sum two unrelated
!      * hashtable entries.)
!      */
!     if (h < 0 || h >= keywords->num_keywords)
!         return -1;

      /*
!      * Compare character-by-character to see if we have a match, applying an
!      * ASCII-only downcasing to the input characters.  We must not use
!      * tolower() since it may produce the wrong translation in some locales
!      * (eg, Turkish).
       */
!     kw = GetScanKeyword(h, keywords);
!     while (*str != '\0')
      {
!         char        ch = *str++;

!         if (ch >= 'A' && ch <= 'Z')
!             ch += 'a' - 'A';
!         if (ch != *kw++)
!             return -1;
      }
+     if (*kw != '\0')
+         return -1;

!     /* Success! */
!     return h;
  }
diff --git a/src/include/common/kwlookup.h b/src/include/common/kwlookup.h
index 39efb35..dbff367 100644
*** a/src/include/common/kwlookup.h
--- b/src/include/common/kwlookup.h
***************
*** 14,19 ****
--- 14,22 ----
  #ifndef KWLOOKUP_H
  #define KWLOOKUP_H

+ /* Hash function used by ScanKeywordLookup */
+ typedef int (*ScanKeywordHashFunc) (const void *key, size_t keylen);
+
  /*
   * This struct contains the data needed by ScanKeywordLookup to perform a
   * search within a set of keywords.  The contents are typically generated by
*************** typedef struct ScanKeywordList
*** 23,28 ****
--- 26,32 ----
  {
      const char *kw_string;        /* all keywords in order, separated by \0 */
      const uint16 *kw_offsets;    /* offsets to the start of each keyword */
+     ScanKeywordHashFunc hash;    /* perfect hash function for keywords */
      int            num_keywords;    /* number of keywords */
      int            max_kw_len;        /* length of longest keyword */
  } ScanKeywordList;
diff --git a/src/interfaces/ecpg/preproc/Makefile b/src/interfaces/ecpg/preproc/Makefile
index b5b74a3..6c02f97 100644
*** a/src/interfaces/ecpg/preproc/Makefile
--- b/src/interfaces/ecpg/preproc/Makefile
*************** OBJS=    preproc.o pgc.o type.o ecpg.o outp
*** 28,34 ****
      keywords.o c_keywords.o ecpg_keywords.o typename.o descriptor.o variable.o \
      $(WIN32RES)

! GEN_KEYWORDLIST = $(top_srcdir)/src/tools/gen_keywordlist.pl

  # Suppress parallel build to avoid a bug in GNU make 3.82
  # (see comments in ../Makefile)
--- 28,37 ----
      keywords.o c_keywords.o ecpg_keywords.o typename.o descriptor.o variable.o \
      $(WIN32RES)

! # where to find gen_keywordlist.pl and subsidiary files
! TOOLSDIR = $(top_srcdir)/src/tools
! GEN_KEYWORDLIST = $(PERL) -I $(TOOLSDIR) $(TOOLSDIR)/gen_keywordlist.pl
! GEN_KEYWORDLIST_DEPS = $(TOOLSDIR)/gen_keywordlist.pl $(TOOLSDIR)/PerfectHash.pm

  # Suppress parallel build to avoid a bug in GNU make 3.82
  # (see comments in ../Makefile)
*************** preproc.y: ../../../backend/parser/gram.
*** 56,66 ****
      $(PERL) $(srcdir)/check_rules.pl $(srcdir) $<

  # generate keyword headers
! c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST)
!     $(PERL) $(GEN_KEYWORDLIST) --varname ScanCKeywords $<

! ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST)
!     $(PERL) $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $<

  # Force these dependencies to be known even without dependency info built:
  ecpg_keywords.o c_keywords.o keywords.o preproc.o pgc.o parser.o: preproc.h
--- 59,69 ----
      $(PERL) $(srcdir)/check_rules.pl $(srcdir) $<

  # generate keyword headers
! c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST_DEPS)
!     $(GEN_KEYWORDLIST) --varname ScanCKeywords --case $<

! ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST_DEPS)
!     $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $<

  # Force these dependencies to be known even without dependency info built:
  ecpg_keywords.o c_keywords.o keywords.o preproc.o pgc.o parser.o: preproc.h
diff --git a/src/interfaces/ecpg/preproc/c_keywords.c b/src/interfaces/ecpg/preproc/c_keywords.c
index 38ddf6f..387298b 100644
*** a/src/interfaces/ecpg/preproc/c_keywords.c
--- b/src/interfaces/ecpg/preproc/c_keywords.c
***************
*** 9,16 ****
   */
  #include "postgres_fe.h"

- #include <ctype.h>
-
  #include "preproc_extern.h"
  #include "preproc.h"

--- 9,14 ----
*************** static const uint16 ScanCKeywordTokens[]
*** 32,70 ****
   *
   * Returns the token value of the keyword, or -1 if no match.
   *
!  * Do a binary search using plain strcmp() comparison.  This is much like
   * ScanKeywordLookup(), except we want case-sensitive matching.
   */
  int
! ScanCKeywordLookup(const char *text)
  {
!     const char *kw_string;
!     const uint16 *kw_offsets;
!     const uint16 *low;
!     const uint16 *high;

!     if (strlen(text) > ScanCKeywords.max_kw_len)
!         return -1;                /* too long to be any keyword */

!     kw_string = ScanCKeywords.kw_string;
!     kw_offsets = ScanCKeywords.kw_offsets;
!     low = kw_offsets;
!     high = kw_offsets + (ScanCKeywords.num_keywords - 1);

!     while (low <= high)
!     {
!         const uint16 *middle;
!         int            difference;

!         middle = low + (high - low) / 2;
!         difference = strcmp(kw_string + *middle, text);
!         if (difference == 0)
!             return ScanCKeywordTokens[middle - kw_offsets];
!         else if (difference < 0)
!             low = middle + 1;
!         else
!             high = middle - 1;
!     }

      return -1;
  }
--- 30,71 ----
   *
   * Returns the token value of the keyword, or -1 if no match.
   *
!  * Do a hash search using plain strcmp() comparison.  This is much like
   * ScanKeywordLookup(), except we want case-sensitive matching.
   */
  int
! ScanCKeywordLookup(const char *str)
  {
!     size_t        len;
!     int            h;
!     const char *kw;

!     /*
!      * Reject immediately if too long to be any keyword.  This saves useless
!      * hashing work on long strings.
!      */
!     len = strlen(str);
!     if (len > ScanCKeywords.max_kw_len)
!         return -1;

!     /*
!      * Compute the hash function.  Since it's a perfect hash, we need only
!      * match to the specific keyword it identifies.
!      */
!     h = ScanCKeywords_hash_func(str, len);

!     /*
!      * An out-of-range result implies no match.  (This can happen for
!      * non-keyword inputs because the hash function will sum two unrelated
!      * hashtable entries.)
!      */
!     if (h < 0 || h >= ScanCKeywords.num_keywords)
!         return -1;

!     kw = GetScanKeyword(h, &ScanCKeywords);
!
!     if (strcmp(kw, str) == 0)
!         return ScanCKeywordTokens[h];

      return -1;
  }
diff --git a/src/pl/plpgsql/src/Makefile b/src/pl/plpgsql/src/Makefile
index 9dd4a74..8a0f294 100644
*** a/src/pl/plpgsql/src/Makefile
--- b/src/pl/plpgsql/src/Makefile
*************** REGRESS_OPTS = --dbname=$(PL_TESTDB)
*** 29,35 ****
  REGRESS = plpgsql_call plpgsql_control plpgsql_domain plpgsql_record \
      plpgsql_cache plpgsql_transaction plpgsql_varprops

! GEN_KEYWORDLIST = $(top_srcdir)/src/tools/gen_keywordlist.pl

  all: all-lib

--- 29,38 ----
  REGRESS = plpgsql_call plpgsql_control plpgsql_domain plpgsql_record \
      plpgsql_cache plpgsql_transaction plpgsql_varprops

! # where to find gen_keywordlist.pl and subsidiary files
! TOOLSDIR = $(top_srcdir)/src/tools
! GEN_KEYWORDLIST = $(PERL) -I $(TOOLSDIR) $(TOOLSDIR)/gen_keywordlist.pl
! GEN_KEYWORDLIST_DEPS = $(TOOLSDIR)/gen_keywordlist.pl $(TOOLSDIR)/PerfectHash.pm

  all: all-lib

*************** plerrcodes.h: $(top_srcdir)/src/backend/
*** 76,86 ****
      $(PERL) $(srcdir)/generate-plerrcodes.pl $< > $@

  # generate keyword headers for the scanner
! pl_reserved_kwlist_d.h: pl_reserved_kwlist.h $(GEN_KEYWORDLIST)
!     $(PERL) $(GEN_KEYWORDLIST) --varname ReservedPLKeywords $<

! pl_unreserved_kwlist_d.h: pl_unreserved_kwlist.h $(GEN_KEYWORDLIST)
!     $(PERL) $(GEN_KEYWORDLIST) --varname UnreservedPLKeywords $<


  check: submake
--- 79,89 ----
      $(PERL) $(srcdir)/generate-plerrcodes.pl $< > $@

  # generate keyword headers for the scanner
! pl_reserved_kwlist_d.h: pl_reserved_kwlist.h $(GEN_KEYWORDLIST_DEPS)
!     $(GEN_KEYWORDLIST) --varname ReservedPLKeywords $<

! pl_unreserved_kwlist_d.h: pl_unreserved_kwlist.h $(GEN_KEYWORDLIST_DEPS)
!     $(GEN_KEYWORDLIST) --varname UnreservedPLKeywords $<


  check: submake
diff --git a/src/tools/PerfectHash.pm b/src/tools/PerfectHash.pm
index ...34d55cf .
*** a/src/tools/PerfectHash.pm
--- b/src/tools/PerfectHash.pm
***************
*** 0 ****
--- 1,336 ----
+ #----------------------------------------------------------------------
+ #
+ # PerfectHash.pm
+ #    Perl module that constructs minimal perfect hash functions
+ #
+ # This code constructs a minimal perfect hash function for the given
+ # set of keys, using an algorithm described in
+ # "An optimal algorithm for generating minimal perfect hash functions"
+ # by Czech, Havas and Majewski in Information Processing Letters,
+ # 43(5):256-264, October 1992.
+ # This implementation is loosely based on NetBSD's "nbperf",
+ # which was written by Joerg Sonnenberger.
+ #
+ # The resulting hash function is perfect in the sense that if the presented
+ # key is one of the original set, it will return the key's index in the set
+ # (in range 0..N-1).  However, the caller must still verify the match,
+ # as false positives are possible.  Also, the hash function may return
+ # values that are out of range (negative, or >= N).  This indicates that
+ # the presented key is definitely not in the set.
+ #
+ #
+ # Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ # Portions Copyright (c) 1994, Regents of the University of California
+ #
+ # src/tools/PerfectHash.pm
+ #
+ #----------------------------------------------------------------------
+
+ package PerfectHash;
+
+ use strict;
+ use warnings;
+ use Exporter 'import';
+
+ our @EXPORT_OK = qw(
+   generate_hash_function
+ );
+
+
+ # At runtime, we'll compute two simple hash functions of the input key,
+ # and use them to index into a mapping table.  The hash functions are just
+ # multiply-and-add in uint32 arithmetic, with different multipliers but
+ # the same initial seed.  All the complexity in this module is concerned
+ # with selecting hash parameters that will work and building the mapping
+ # table.
+
+ # We support making case-insensitive hash functions, though this only
+ # works for a strict-ASCII interpretation of case insensitivity,
+ # ie, A-Z maps onto a-z and nothing else.
+ my $case_insensitive = 0;
+
+
+ #
+ # Construct a C function implementing a perfect hash for the given keys.
+ # The C function definition is returned as a string.
+ #
+ # The keys can be any set of Perl strings; it is caller's responsibility
+ # that there not be any duplicates.  (Note that the "strings" can be
+ # binary data, but endianness is the caller's problem.)
+ #
+ # The name to use for the function is caller-specified, but its signature
+ # is always "int f(const void *key, size_t keylen)".  The caller may
+ # prepend "static " to the result string if it wants a static function.
+ #
+ # If $ci is true, the function is case-insensitive, for the limited idea
+ # of case-insensitivity explained above.
+ #
+ sub generate_hash_function
+ {
+     my ($keys_ref, $funcname, $ci) = @_;
+
+     # It's not worth passing this around as a parameter; just use a global.
+     $case_insensitive = $ci;
+
+     # Try different hash function parameters until we find a set that works
+     # for these keys.  In principle we might need to change multipliers,
+     # but these two multipliers are chosen to be cheap to calculate via
+     # shift-and-add, so don't change them except at great need.
+     my $hash_mult1 = 31;
+     my $hash_mult2 = 37;
+
+     # We just try successive hash seed values until we find one that works.
+     # (Commonly, random seeds are tried, but we want reproducible results
+     # from this program so we don't do that.)
+     my $hash_seed;
+     my @subresult;
+     for ($hash_seed = 0;; $hash_seed++)
+     {
+         @subresult =
+           _construct_hash_table($keys_ref, $hash_mult1, $hash_mult2,
+             $hash_seed);
+         last if @subresult;
+     }
+
+     # Extract info from the function result array.
+     my $elemtype = $subresult[0];
+     my @hashtab  = @{ $subresult[1] };
+     my $nhash    = scalar(@hashtab);
+
+     # OK, construct the hash function definition including the hash table.
+     my $f = '';
+     $f .= sprintf "int\n";
+     $f .= sprintf "%s(const void *key, size_t keylen)\n{\n", $funcname;
+     $f .= sprintf "\tstatic const %s h[%d] = {\n", $elemtype, $nhash;
+     for (my $i = 0; $i < $nhash; $i++)
+     {
+         $f .= sprintf "%s%6d,%s",
+           ($i % 8 == 0 ? "\t\t" : " "),
+           $hashtab[$i],
+           ($i % 8 == 7 ? "\n" : "");
+     }
+     $f .= sprintf "\n" if ($nhash % 8 != 0);
+     $f .= sprintf "\t};\n\n";
+     $f .= sprintf "\tconst unsigned char *k = key;\n";
+     $f .= sprintf "\tuint32\t\ta = %d;\n",   $hash_seed;
+     $f .= sprintf "\tuint32\t\tb = %d;\n\n", $hash_seed;
+     $f .= sprintf "\twhile (keylen--)\n\t{\n";
+     $f .= sprintf "\t\tunsigned char c = *k++";
+     $f .= sprintf " | 0x20" if $case_insensitive;    # see comment below
+     $f .= sprintf ";\n\n";
+     $f .= sprintf "\t\ta = a * %d + c;\n", $hash_mult1;
+     $f .= sprintf "\t\tb = b * %d + c;\n", $hash_mult2;
+     $f .= sprintf "\t}\n";
+     $f .= sprintf "\treturn h[a %% %d] + h[b %% %d];\n", $nhash, $nhash;
+     $f .= sprintf "}\n";
+
+     return $f;
+ }
+
+
+ # Calculate a hash function as the run-time code will do.
+ #
+ # If we are making a case-insensitive hash function, we implement that
+ # by OR'ing 0x20 into each byte of the key.  This correctly transforms
+ # upper-case ASCII into lower-case ASCII, while not changing digits or
+ # dollar signs.  (It does change '_', else we could just skip adjusting
+ # $cn here at all, for typical keyword strings.)
+ sub _calc_hash
+ {
+     my ($key, $mult, $seed) = @_;
+
+     my $result = $seed;
+     for my $c (split //, $key)
+     {
+         my $cn = ord($c);
+         $cn |= 0x20 if $case_insensitive;
+         $result = ($result * $mult + $cn) % 4294967296;
+     }
+     return $result;
+ }
+
+
+ # Attempt to construct a mapping table for a minimal perfect hash function
+ # for the given keys, using the specified hash parameters.
+ #
+ # Returns an array containing the mapping table element type name as the
+ # first element, and a ref to an array of the table values as the second.
+ #
+ # Returns an empty array on failure; then caller should choose different
+ # hash parameter(s) and try again.
+ sub _construct_hash_table
+ {
+     my ($keys_ref, $hash_mult1, $hash_mult2, $hash_seed) = @_;
+     my @keys = @{$keys_ref};
+
+     # This algorithm is based on a graph whose edges correspond to the
+     # keys and whose vertices correspond to entries of the mapping table.
+     # A key edge links the two vertices whose indexes are the outputs of
+     # the two hash functions for that key.  For K keys, the mapping
+     # table must have at least 2*K+1 entries, guaranteeing that there's at
+     # least one unused entry.  (In principle, larger mapping tables make it
+     # easier to find a workable hash and increase the number of inputs that
+     # can be rejected due to touching unused hashtable entries.  In practice,
+     # neither effect seems strong enough to justify using a larger table.)
+     my $nedges = scalar @keys;       # number of edges
+     my $nverts = 2 * $nedges + 1;    # number of vertices
+
+     # Initialize the array of edges.
+     my @E = ();
+     foreach my $kw (@keys)
+     {
+         # Calculate hashes for this key.
+         # The hashes are immediately reduced modulo the mapping table size.
+         my $hash1 = _calc_hash($kw, $hash_mult1, $hash_seed) % $nverts;
+         my $hash2 = _calc_hash($kw, $hash_mult2, $hash_seed) % $nverts;
+
+         # If the two hashes are the same for any key, we have to fail
+         # since this edge would itself form a cycle in the graph.
+         return () if $hash1 == $hash2;
+
+         # Add the edge for this key.
+         push @E, { left => $hash1, right => $hash2 };
+     }
+
+     # Initialize the array of vertices, giving them all empty lists
+     # of associated edges.  (The lists will be hashes of edge numbers.)
+     my @V = ();
+     for (my $v = 0; $v < $nverts; $v++)
+     {
+         push @V, { edges => {} };
+     }
+
+     # Insert each edge in the lists of edges using its vertices.
+     for (my $e = 0; $e < $nedges; $e++)
+     {
+         my $v = $E[$e]{left};
+         $V[$v]{edges}->{$e} = 1;
+
+         $v = $E[$e]{right};
+         $V[$v]{edges}->{$e} = 1;
+     }
+
+     # Now we attempt to prove the graph acyclic.
+     # A cycle-free graph is either empty or has some vertex of degree 1.
+     # Removing the edge attached to that vertex doesn't change this property,
+     # so doing that repeatedly will reduce the size of the graph.
+     # If the graph is empty at the end of the process, it was acyclic.
+     # We track the order of edge removal so that the next phase can process
+     # them in reverse order of removal.
+     my @output_order = ();
+
+     # Consider each vertex as a possible starting point for edge-removal.
+     for (my $startv = 0; $startv < $nverts; $startv++)
+     {
+         my $v = $startv;
+
+         # If vertex v is of degree 1 (i.e. exactly 1 edge connects to it),
+         # remove that edge, and then consider the edge's other vertex to see
+         # if it is now of degree 1.  The inner loop repeats until reaching a
+         # vertex not of degree 1.
+         while (scalar(keys(%{ $V[$v]{edges} })) == 1)
+         {
+             # Unlink its only edge.
+             my $e = (keys(%{ $V[$v]{edges} }))[0];
+             delete($V[$v]{edges}->{$e});
+
+             # Unlink the edge from its other vertex, too.
+             my $v2 = $E[$e]{left};
+             $v2 = $E[$e]{right} if ($v2 == $v);
+             delete($V[$v2]{edges}->{$e});
+
+             # Push e onto the front of the output-order list.
+             unshift @output_order, $e;
+
+             # Consider v2 on next iteration of inner loop.
+             $v = $v2;
+         }
+     }
+
+     # We succeeded only if all edges were removed from the graph.
+     return () if (scalar(@output_order) != $nedges);
+
+     # OK, build the hash table of size $nverts.
+     my @hashtab = (0) x $nverts;
+     # We need a "visited" flag array in this step, too.
+     my @visited = (0) x $nverts;
+
+     # The idea is that for any key, the sum of the hash table entries
+     # for its first and second hash values is the desired output (i.e., the
+     # key number).  By assigning hash table values in the selected edge
+     # order, we can guarantee that that's true.
+     foreach my $e (@output_order)
+     {
+         my $l = $E[$e]{left};
+         my $r = $E[$e]{right};
+         if (!$visited[$l])
+         {
+             # $hashtab[$r] might be zero, or some previously assigned value.
+             $hashtab[$l] = $e - $hashtab[$r];
+         }
+         else
+         {
+             die "oops, doubly used hashtab entry" if $visited[$r];
+             # $hashtab[$l] might be zero, or some previously assigned value.
+             $hashtab[$r] = $e - $hashtab[$l];
+         }
+         # Now freeze both of these hashtab entries.
+         $visited[$l] = 1;
+         $visited[$r] = 1;
+     }
+
+     # Detect range of values needed in hash table.
+     my $hmin = $nedges;
+     my $hmax = 0;
+     for (my $v = 0; $v < $nverts; $v++)
+     {
+         $hmin = $hashtab[$v] if $hashtab[$v] < $hmin;
+         $hmax = $hashtab[$v] if $hashtab[$v] > $hmax;
+     }
+
+     # Choose width of hashtable entries.  In addition to the actual values,
+     # we need to be able to store a flag for unused entries, and we wish to
+     # have the property that adding any other entry value to the flag gives
+     # an out-of-range result (>= $nedges).
+     my $elemtype;
+     my $unused_flag;
+
+     if (   $hmin >= -0x7F
+         && $hmax <= 0x7F
+         && $hmin + 0x7F >= $nedges)
+     {
+         # int8 will work
+         $elemtype    = 'int8';
+         $unused_flag = 0x7F;
+     }
+     elsif ($hmin >= -0x7FFF
+         && $hmax <= 0x7FFF
+         && $hmin + 0x7FFF >= $nedges)
+     {
+         # int16 will work
+         $elemtype    = 'int16';
+         $unused_flag = 0x7FFF;
+     }
+     elsif ($hmin >= -0x7FFFFFFF
+         && $hmax <= 0x7FFFFFFF
+         && $hmin + 0x3FFFFFFF >= $nedges)
+     {
+         # int32 will work
+         $elemtype    = 'int32';
+         $unused_flag = 0x3FFFFFFF;
+     }
+     else
+     {
+         die "hash table values too wide";
+     }
+
+     # Set any unvisited hashtable entries to $unused_flag.
+     for (my $v = 0; $v < $nverts; $v++)
+     {
+         $hashtab[$v] = $unused_flag if !$visited[$v];
+     }
+
+     return ($elemtype, \@hashtab);
+ }
+
+ 1;
diff --git a/src/tools/gen_keywordlist.pl b/src/tools/gen_keywordlist.pl
index d764aff..e912c3e 100644
*** a/src/tools/gen_keywordlist.pl
--- b/src/tools/gen_keywordlist.pl
***************
*** 14,19 ****
--- 14,25 ----
  # variable named according to the -v switch ("ScanKeywords" by default).
  # The variable is marked "static" unless the -e switch is given.
  #
+ # ScanKeywordList uses hash-based lookup, so this script also selects
+ # a minimal perfect hash function for the keyword set, and emits a
+ # static hash function that is referenced in the ScanKeywordList struct.
+ # The hash function is case-insensitive unless --case is specified.
+ # Note that case insensitivity assumes all-ASCII keywords!
+ #
  #
  # Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
  # Portions Copyright (c) 1994, Regents of the University of California
***************
*** 25,39 ****
  use strict;
  use warnings;
  use Getopt::Long;

  my $output_path = '';
  my $extern = 0;
  my $varname = 'ScanKeywords';

  GetOptions(
!     'output:s' => \$output_path,
!     'extern'   => \$extern,
!     'varname:s' => \$varname) || usage();

  my $kw_input_file = shift @ARGV || die "No input file.\n";

--- 31,48 ----
  use strict;
  use warnings;
  use Getopt::Long;
+ use PerfectHash;

  my $output_path = '';
  my $extern = 0;
+ my $case_sensitive = 0;
  my $varname = 'ScanKeywords';

  GetOptions(
!     'output:s'       => \$output_path,
!     'extern'         => \$extern,
!     'case-sensitive' => \$case_sensitive,
!     'varname:s'      => \$varname) || usage();

  my $kw_input_file = shift @ARGV || die "No input file.\n";

*************** while (<$kif>)
*** 87,93 ****
--- 96,117 ----
      }
  }

+ # When being case-insensitive, insist that the input be all-lower-case.
+ if (!$case_sensitive)
+ {
+     foreach my $kw (@keywords)
+     {
+         die qq|The keyword "$kw" is not lower-case in $kw_input_file\n|
+           if ($kw ne lc $kw);
+     }
+ }
+
  # Error out if the keyword names are not in ASCII order.
+ #
+ # While this isn't really necessary with hash-based lookup, it's still
+ # helpful because it provides a cheap way to reject duplicate keywords.
+ # Also, insisting on sorted order ensures that code that scans the keyword
+ # table linearly will see the keywords in a canonical order.
  for my $i (0..$#keywords - 1)
  {
      die qq|The keyword "$keywords[$i + 1]" is out of order in $kw_input_file\n|
*************** print $kwdef "};\n\n";
*** 128,142 ****

  printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords;

  # Emit the struct that wraps all this lookup info into one variable.

! print $kwdef "static " if !$extern;
  printf $kwdef "const ScanKeywordList %s = {\n", $varname;
  printf $kwdef qq|\t%s_kw_string,\n|, $varname;
  printf $kwdef qq|\t%s_kw_offsets,\n|, $varname;
  printf $kwdef qq|\t%s_NUM_KEYWORDS,\n|, uc $varname;
  printf $kwdef qq|\t%d\n|, $max_len;
! print $kwdef "};\n\n";

  printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename;

--- 152,176 ----

  printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords;

+ # Emit the definition of the hash function.
+
+ my $funcname = $varname . "_hash_func";
+
+ my $f = PerfectHash::generate_hash_function(\@keywords,
+     $funcname, !$case_sensitive);
+
+ printf $kwdef qq|static %s\n|, $f;
+
  # Emit the struct that wraps all this lookup info into one variable.

! printf $kwdef "static " if !$extern;
  printf $kwdef "const ScanKeywordList %s = {\n", $varname;
  printf $kwdef qq|\t%s_kw_string,\n|, $varname;
  printf $kwdef qq|\t%s_kw_offsets,\n|, $varname;
+ printf $kwdef qq|\t%s,\n|, $funcname;
  printf $kwdef qq|\t%s_NUM_KEYWORDS,\n|, uc $varname;
  printf $kwdef qq|\t%d\n|, $max_len;
! printf $kwdef "};\n\n";

  printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename;

*************** Usage: gen_keywordlist.pl [--output/-o <
*** 148,153 ****
--- 182,188 ----
      --output   Output directory (default '.')
      --varname  Name for ScanKeywordList variable (default 'ScanKeywords')
      --extern   Allow the ScanKeywordList variable to be globally visible
+     --case     Keyword matching is to be case-sensitive

  gen_keywordlist.pl transforms a list of keywords into a ScanKeywordList.
  The output filename is derived from the input file by inserting _d,
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 937bf18..8f54e45 100644
*** a/src/tools/msvc/Solution.pm
--- b/src/tools/msvc/Solution.pm
*************** sub GenerateFiles
*** 414,420 ****
              'src/include/parser/kwlist.h'))
      {
          print "Generating kwlist_d.h...\n";
!         system('perl src/tools/gen_keywordlist.pl --extern -o src/common src/include/parser/kwlist.h');
      }

      if (IsNewer(
--- 414,420 ----
              'src/include/parser/kwlist.h'))
      {
          print "Generating kwlist_d.h...\n";
!         system('perl -I src/tools src/tools/gen_keywordlist.pl --extern -o src/common src/include/parser/kwlist.h');
      }

      if (IsNewer(
*************** sub GenerateFiles
*** 426,433 ****
      {
          print "Generating pl_reserved_kwlist_d.h and pl_unreserved_kwlist_d.h...\n";
          chdir('src/pl/plpgsql/src');
!         system('perl ../../../tools/gen_keywordlist.pl --varname ReservedPLKeywords pl_reserved_kwlist.h');
!         system('perl ../../../tools/gen_keywordlist.pl --varname UnreservedPLKeywords pl_unreserved_kwlist.h');
          chdir('../../../..');
      }

--- 426,433 ----
      {
          print "Generating pl_reserved_kwlist_d.h and pl_unreserved_kwlist_d.h...\n";
          chdir('src/pl/plpgsql/src');
!         system('perl -I ../../../tools ../../../tools/gen_keywordlist.pl --varname ReservedPLKeywords
pl_reserved_kwlist.h');
!         system('perl -I ../../../tools ../../../tools/gen_keywordlist.pl --varname UnreservedPLKeywords
pl_unreserved_kwlist.h');
          chdir('../../../..');
      }

*************** sub GenerateFiles
*** 440,447 ****
      {
          print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n";
          chdir('src/interfaces/ecpg/preproc');
!         system('perl ../../../tools/gen_keywordlist.pl --varname ScanCKeywords c_kwlist.h');
!         system('perl ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h');
          chdir('../../../..');
      }

--- 440,447 ----
      {
          print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n";
          chdir('src/interfaces/ecpg/preproc');
!         system('perl -I ../../../tools ../../../tools/gen_keywordlist.pl --varname ScanCKeywords --case c_kwlist.h');
!         system('perl -I ../../../tools ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h');
          chdir('../../../..');
      }


Andres Freund <andres@anarazel.de> writes:
> On 2019-01-08 13:41:16 -0500, John Naylor wrote:
>> Do you mean the fmgr table?

> Not the entire fmgr table, but just the builtin oid index, generated by
> the following section:
> ...
> The generated fmgr_builtin_oid_index is pretty sparse, and a more dense
> hashtable might e.g. more efficient from a cache perspective.

I experimented with this, but TBH I think it's a dead loss.  We currently
have 2768 built-in functions, so the perfect hash table requires 5537
int16 entries, which is not *that* much less than the 10000 entries
that are in fmgr_builtin_oid_index presently.  When you consider the
extra cycles needed to do the hashing, and the fact that you have to
touch (usually) two cache lines not one in the lookup table, it's hard
to see how this could net out as a win performance-wise.

Also, I fail to understand why fmgr_builtin_oid_index has 10000 entries
anyway.  We could easily have fmgrtab.c expose the last actually assigned
builtin function OID (presently 6121) and make the index array only
that big, which just about eliminates the space advantage completely.

BTW, I found out while trying this that Joerg's fear of the hash
multipliers being too simplistic is valid: the perfect hash generator
failed until I changed them.  I picked a larger value that should be
just as easy to use for shift-and-add purposes.

            regards, tom lane

diff --git a/src/backend/utils/Gen_fmgrtab.pl b/src/backend/utils/Gen_fmgrtab.pl
index cafe408..9fceb60 100644
*** a/src/backend/utils/Gen_fmgrtab.pl
--- b/src/backend/utils/Gen_fmgrtab.pl
*************** use Catalog;
*** 18,23 ****
--- 18,24 ----

  use strict;
  use warnings;
+ use PerfectHash;

  # Collect arguments
  my @input_files;
*************** foreach my $s (sort { $a->{oid} <=> $b->
*** 219,237 ****
      print $pfh "extern Datum $s->{prosrc}(PG_FUNCTION_ARGS);\n";
  }

! # Create the fmgr_builtins table, collect data for fmgr_builtin_oid_index
  print $tfh "\nconst FmgrBuiltin fmgr_builtins[] = {\n";
  my %bmap;
  $bmap{'t'} = 'true';
  $bmap{'f'} = 'false';
! my @fmgr_builtin_oid_index;
  my $fmgr_count = 0;
  foreach my $s (sort { $a->{oid} <=> $b->{oid} } @fmgr)
  {
      print $tfh
        "  { $s->{oid}, $s->{nargs}, $bmap{$s->{strict}}, $bmap{$s->{retset}}, \"$s->{prosrc}\", $s->{prosrc} }";

!     $fmgr_builtin_oid_index[ $s->{oid} ] = $fmgr_count++;

      if ($fmgr_count <= $#fmgr)
      {
--- 220,244 ----
      print $pfh "extern Datum $s->{prosrc}(PG_FUNCTION_ARGS);\n";
  }

! # Create the fmgr_builtins table, collect data for hash function
  print $tfh "\nconst FmgrBuiltin fmgr_builtins[] = {\n";
  my %bmap;
  $bmap{'t'} = 'true';
  $bmap{'f'} = 'false';
! my @fmgr_builtin_oids;
! my $prev_oid = 0;
  my $fmgr_count = 0;
  foreach my $s (sort { $a->{oid} <=> $b->{oid} } @fmgr)
  {
      print $tfh
        "  { $s->{oid}, $s->{nargs}, $bmap{$s->{strict}}, $bmap{$s->{retset}}, \"$s->{prosrc}\", $s->{prosrc} }";

!     die "duplicate OIDs" if $s->{oid} <= $prev_oid;
!     $prev_oid = $s->{oid};
!
!     push @fmgr_builtin_oids, pack("n", $s->{oid});
!
!     $fmgr_count++;

      if ($fmgr_count <= $#fmgr)
      {
*************** print $tfh "};\n";
*** 246,283 ****

  print $tfh qq|
  const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin));
- |;

-
- # Create fmgr_builtins_oid_index table.
- #
- # Note that the array has to be filled up to FirstGenbkiObjectId,
- # as we can't rely on zero initialization as 0 is a valid mapping.
- print $tfh qq|
- const uint16 fmgr_builtin_oid_index[FirstGenbkiObjectId] = {
  |;

- for (my $i = 0; $i < $FirstGenbkiObjectId; $i++)
- {
-     my $oid = $fmgr_builtin_oid_index[$i];

!     # fmgr_builtin_oid_index is sparse, map nonexistant functions to
!     # InvalidOidBuiltinMapping
!     if (not defined $oid)
!     {
!         $oid = 'InvalidOidBuiltinMapping';
!     }

!     if ($i + 1 == $FirstGenbkiObjectId)
!     {
!         print $tfh "  $oid\n";
!     }
!     else
!     {
!         print $tfh "  $oid,\n";
!     }
! }
! print $tfh "};\n";


  # And add the file footers.
--- 253,267 ----

  print $tfh qq|
  const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin));

  |;


! # Create perfect hash function for searching fmgr_builtin by OID.

! print $tfh PerfectHash::generate_hash_function(\@fmgr_builtin_oids,
!                            "fmgr_builtin_oid_hash",
!                            0);


  # And add the file footers.
diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c
index b41649f..ad93032 100644
*** a/src/backend/utils/fmgr/fmgr.c
--- b/src/backend/utils/fmgr/fmgr.c
*************** extern Datum fmgr_security_definer(PG_FU
*** 72,92 ****
  static const FmgrBuiltin *
  fmgr_isbuiltin(Oid id)
  {
!     uint16        index;

      /* fast lookup only possible if original oid still assigned */
      if (id >= FirstGenbkiObjectId)
          return NULL;

      /*
!      * Lookup function data. If there's a miss in that range it's likely a
!      * nonexistant function, returning NULL here will trigger an ERROR later.
       */
!     index = fmgr_builtin_oid_index[id];
!     if (index == InvalidOidBuiltinMapping)
          return NULL;

!     return &fmgr_builtins[index];
  }

  /*
--- 72,103 ----
  static const FmgrBuiltin *
  fmgr_isbuiltin(Oid id)
  {
!     const FmgrBuiltin *result;
!     uint16        hashkey;
!     int index;

      /* fast lookup only possible if original oid still assigned */
      if (id >= FirstGenbkiObjectId)
          return NULL;

      /*
!      * Lookup function data.  The hash key for this is the low-order 16 bits
!      * of the OID, in network byte order.
       */
!     hashkey = htons(id);
!     index = fmgr_builtin_oid_hash(&hashkey, sizeof(hashkey));
!
!     /* Out-of-range hash result means definitely no match */
!     if (index < 0 || index >= fmgr_nbuiltins)
          return NULL;

!     result = &fmgr_builtins[index];
!
!     /* We have to verify the match, though */
!     if (id != result->foid)
!         return NULL;
!
!     return result;
  }

  /*
diff --git a/src/include/utils/fmgrtab.h b/src/include/utils/fmgrtab.h
index a778f88..f27aff5 100644
*** a/src/include/utils/fmgrtab.h
--- b/src/include/utils/fmgrtab.h
*************** extern const FmgrBuiltin fmgr_builtins[]
*** 36,46 ****

  extern const int fmgr_nbuiltins;    /* number of entries in table */

! /*
!  * Mapping from a builtin function's oid to the index in the fmgr_builtins
!  * array.
!  */
! #define InvalidOidBuiltinMapping PG_UINT16_MAX
! extern const uint16 fmgr_builtin_oid_index[FirstGenbkiObjectId];

  #endif                            /* FMGRTAB_H */
--- 36,41 ----

  extern const int fmgr_nbuiltins;    /* number of entries in table */

! extern int fmgr_builtin_oid_hash(const void *key, size_t keylen);

  #endif                            /* FMGRTAB_H */
diff --git a/src/tools/PerfectHash.pm b/src/tools/PerfectHash.pm
index 34d55cf..862357b 100644
*** a/src/tools/PerfectHash.pm
--- b/src/tools/PerfectHash.pm
*************** sub generate_hash_function
*** 77,83 ****
      # but these two multipliers are chosen to be cheap to calculate via
      # shift-and-add, so don't change them except at great need.
      my $hash_mult1 = 31;
!     my $hash_mult2 = 37;

      # We just try successive hash seed values until we find one that works.
      # (Commonly, random seeds are tried, but we want reproducible results
--- 77,83 ----
      # but these two multipliers are chosen to be cheap to calculate via
      # shift-and-add, so don't change them except at great need.
      my $hash_mult1 = 31;
!     my $hash_mult2 = 1029;

      # We just try successive hash seed values until we find one that works.
      # (Commonly, random seeds are tried, but we want reproducible results

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On Tue, Jan 8, 2019 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I'll take a crack at separating into a module.  I'll wait a bit in
> > case there are any stylistic suggestions on the patch as it stands.
>
> I had a go at that myself.  I'm sure there's plenty to criticize in
> the result, but at least it passes make check-world ;-)

Just a couple comments about the module:

-If you qualify the function's module name as you did
(PerfectHash::generate_hash_function), you don't have to export the
function into the callers namespace, so you can skip the @EXPORT_OK
setting. Most of our modules don't export.

-There is a bit of a cognitive clash between $case_sensitive in
gen_keywordlist.pl and $case_insensitive in PerfectHash.pm. They each
make sense in their own file, but might it be worth using one or the
other?

-As for the graph algorithm, I'd have to play with it to understand
how it works.


In the committed keyword patch, I noticed that in common/keywords.c,
the array length is defined with

ScanKeywordCategories[SCANKEYWORDS_NUM_KEYWORDS]

but other keyword arrays just have ...[]. Is there a reason for the difference?


John Naylor <john.naylor@2ndquadrant.com> writes:
> Just a couple comments about the module:

> -If you qualify the function's module name as you did
> (PerfectHash::generate_hash_function), you don't have to export the
> function into the callers namespace, so you can skip the @EXPORT_OK
> setting. Most of our modules don't export.

OK by me.  I was more concerned about hiding the stuff that isn't
supposed to be exported.

> -There is a bit of a cognitive clash between $case_sensitive in
> gen_keywordlist.pl and $case_insensitive in PerfectHash.pm. They each
> make sense in their own file, but might it be worth using one or the
> other?

Yeah, dunno.  It seems to make sense for the command-line-level default of
gen_keywordlist.pl to be "case insensitive", since most users want that.
But that surely shouldn't be the default in PerfectHash.pm, and I'm not
very sure how to reconcile the discrepancy.

> In the committed keyword patch, I noticed that in common/keywords.c,
> the array length is defined with
> ScanKeywordCategories[SCANKEYWORDS_NUM_KEYWORDS]
> but other keyword arrays just have ...[]. Is there a reason for the difference?

The length macro was readily available there so I used it.  AFAIR
that wasn't true elsewhere, though I might've missed something.
It's pretty much just belt-and-suspenders coding anyway, since all
those arrays are machine generated ...

            regards, tom lane


John Naylor <john.naylor@2ndquadrant.com> writes:
> -As for the graph algorithm, I'd have to play with it to understand
> how it works.

I improved the comment about how come the hash table entry assignment
works.  One thing I'm not clear about myself is

    # A cycle-free graph is either empty or has some vertex of degree 1.

That sounds like a standard graph theory result, but I'm not familiar
with a proof for it.

            regards, tom lane

#----------------------------------------------------------------------
#
# PerfectHash.pm
#    Perl module that constructs minimal perfect hash functions
#
# This code constructs a minimal perfect hash function for the given
# set of keys, using an algorithm described in
# "An optimal algorithm for generating minimal perfect hash functions"
# by Czech, Havas and Majewski in Information Processing Letters,
# 43(5):256-264, October 1992.
# This implementation is loosely based on NetBSD's "nbperf",
# which was written by Joerg Sonnenberger.
#
# The resulting hash function is perfect in the sense that if the presented
# key is one of the original set, it will return the key's index in the set
# (in range 0..N-1).  However, the caller must still verify the match,
# as false positives are possible.  Also, the hash function may return
# values that are out of range (negative, or >= N).  This indicates that
# the presented key is definitely not in the set.
#
#
# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
# Portions Copyright (c) 1994, Regents of the University of California
#
# src/tools/PerfectHash.pm
#
#----------------------------------------------------------------------

package PerfectHash;

use strict;
use warnings;


# At runtime, we'll compute two simple hash functions of the input key,
# and use them to index into a mapping table.  The hash functions are just
# multiply-and-add in uint32 arithmetic, with different multipliers but
# the same initial seed.  All the complexity in this module is concerned
# with selecting hash parameters that will work and building the mapping
# table.

# We support making case-insensitive hash functions, though this only
# works for a strict-ASCII interpretation of case insensitivity,
# ie, A-Z maps onto a-z and nothing else.
my $case_insensitive = 0;


#
# Construct a C function implementing a perfect hash for the given keys.
# The C function definition is returned as a string.
#
# The keys can be any set of Perl strings; it is caller's responsibility
# that there not be any duplicates.  (Note that the "strings" can be
# binary data, but endianness is the caller's problem.)
#
# The name to use for the function is caller-specified, but its signature
# is always "int f(const void *key, size_t keylen)".  The caller may
# prepend "static " to the result string if it wants a static function.
#
# If $ci is true, the function is case-insensitive, for the limited idea
# of case-insensitivity explained above.
#
sub generate_hash_function
{
    my ($keys_ref, $funcname, $ci) = @_;

    # It's not worth passing this around as a parameter; just use a global.
    $case_insensitive = $ci;

    # Try different hash function parameters until we find a set that works
    # for these keys.  In principle we might need to change multipliers,
    # but these two multipliers are chosen to be primes that are cheap to
    # calculate via shift-and-add, so don't change them without care.
    my $hash_mult1 = 31;
    my $hash_mult2 = 2053;

    # We just try successive hash seed values until we find one that works.
    # (Commonly, random seeds are tried, but we want reproducible results
    # from this program so we don't do that.)
    my $hash_seed;
    my @subresult;
    for ($hash_seed = 0; $hash_seed < 1000; $hash_seed++)
    {
        @subresult =
          _construct_hash_table($keys_ref, $hash_mult1, $hash_mult2,
            $hash_seed);
        last if @subresult;
    }

    # Choke if we didn't succeed in a reasonable number of tries.
    die "failed to generate perfect hash" if !@subresult;

    # Extract info from the function result array.
    my $elemtype = $subresult[0];
    my @hashtab  = @{ $subresult[1] };
    my $nhash    = scalar(@hashtab);

    # OK, construct the hash function definition including the hash table.
    my $f = '';
    $f .= sprintf "int\n";
    $f .= sprintf "%s(const void *key, size_t keylen)\n{\n", $funcname;
    $f .= sprintf "\tstatic const %s h[%d] = {\n", $elemtype, $nhash;
    for (my $i = 0; $i < $nhash; $i++)
    {
        $f .= sprintf "%s%6d,%s",
          ($i % 8 == 0 ? "\t\t" : " "),
          $hashtab[$i],
          ($i % 8 == 7 ? "\n" : "");
    }
    $f .= sprintf "\n" if ($nhash % 8 != 0);
    $f .= sprintf "\t};\n\n";
    $f .= sprintf "\tconst unsigned char *k = key;\n";
    $f .= sprintf "\tuint32\t\ta = %d;\n",   $hash_seed;
    $f .= sprintf "\tuint32\t\tb = %d;\n\n", $hash_seed;
    $f .= sprintf "\twhile (keylen--)\n\t{\n";
    $f .= sprintf "\t\tunsigned char c = *k++";
    $f .= sprintf " | 0x20" if $case_insensitive;    # see comment below
    $f .= sprintf ";\n\n";
    $f .= sprintf "\t\ta = a * %d + c;\n", $hash_mult1;
    $f .= sprintf "\t\tb = b * %d + c;\n", $hash_mult2;
    $f .= sprintf "\t}\n";
    $f .= sprintf "\treturn h[a %% %d] + h[b %% %d];\n", $nhash, $nhash;
    $f .= sprintf "}\n";

    return $f;
}


# Calculate a hash function as the run-time code will do.
#
# If we are making a case-insensitive hash function, we implement that
# by OR'ing 0x20 into each byte of the key.  This correctly transforms
# upper-case ASCII into lower-case ASCII, while not changing digits or
# dollar signs.  (It does change '_', else we could just skip adjusting
# $cn here at all, for typical keyword strings.)
sub _calc_hash
{
    my ($key, $mult, $seed) = @_;

    my $result = $seed;
    for my $c (split //, $key)
    {
        my $cn = ord($c);
        $cn |= 0x20 if $case_insensitive;
        $result = ($result * $mult + $cn) % 4294967296;
    }
    return $result;
}


# Attempt to construct a mapping table for a minimal perfect hash function
# for the given keys, using the specified hash parameters.
#
# Returns an array containing the mapping table element type name as the
# first element, and a ref to an array of the table values as the second.
#
# Returns an empty array on failure; then caller should choose different
# hash parameter(s) and try again.
sub _construct_hash_table
{
    my ($keys_ref, $hash_mult1, $hash_mult2, $hash_seed) = @_;
    my @keys = @{$keys_ref};

    # This algorithm is based on a graph whose edges correspond to the
    # keys and whose vertices correspond to entries of the mapping table.
    # A key's edge links the two vertices whose indexes are the outputs of
    # the two hash functions for that key.  For K keys, the mapping
    # table must have at least 2*K+1 entries, guaranteeing that there's at
    # least one unused entry.  (In principle, larger mapping tables make it
    # easier to find a workable hash and increase the number of inputs that
    # can be rejected due to touching unused hashtable entries.  In practice,
    # neither effect seems strong enough to justify using a larger table.)
    my $nedges = scalar @keys;       # number of edges
    my $nverts = 2 * $nedges + 1;    # number of vertices

    # Initialize the array of edges.
    my @E = ();
    foreach my $kw (@keys)
    {
        # Calculate hashes for this key.
        # The hashes are immediately reduced modulo the mapping table size.
        my $hash1 = _calc_hash($kw, $hash_mult1, $hash_seed) % $nverts;
        my $hash2 = _calc_hash($kw, $hash_mult2, $hash_seed) % $nverts;

        # If the two hashes are the same for any key, we have to fail
        # since this edge would itself form a cycle in the graph.
        return () if $hash1 == $hash2;

        # Add the edge for this key.
        push @E, { left => $hash1, right => $hash2 };
    }

    # Initialize the array of vertices, giving them all empty lists
    # of associated edges.  (The lists will be hashes of edge numbers.)
    my @V = ();
    for (my $v = 0; $v < $nverts; $v++)
    {
        push @V, { edges => {} };
    }

    # Insert each edge in the lists of edges using its vertices.
    for (my $e = 0; $e < $nedges; $e++)
    {
        my $v = $E[$e]{left};
        $V[$v]{edges}->{$e} = 1;

        $v = $E[$e]{right};
        $V[$v]{edges}->{$e} = 1;
    }

    # Now we attempt to prove the graph acyclic.
    # A cycle-free graph is either empty or has some vertex of degree 1.
    # Removing the edge attached to that vertex doesn't change this property,
    # so doing that repeatedly will reduce the size of the graph.
    # If the graph is empty at the end of the process, it was acyclic.
    # We track the order of edge removal so that the next phase can process
    # them in reverse order of removal.
    my @output_order = ();

    # Consider each vertex as a possible starting point for edge-removal.
    for (my $startv = 0; $startv < $nverts; $startv++)
    {
        my $v = $startv;

        # If vertex v is of degree 1 (i.e. exactly 1 edge connects to it),
        # remove that edge, and then consider the edge's other vertex to see
        # if it is now of degree 1.  The inner loop repeats until reaching a
        # vertex not of degree 1.
        while (scalar(keys(%{ $V[$v]{edges} })) == 1)
        {
            # Unlink its only edge.
            my $e = (keys(%{ $V[$v]{edges} }))[0];
            delete($V[$v]{edges}->{$e});

            # Unlink the edge from its other vertex, too.
            my $v2 = $E[$e]{left};
            $v2 = $E[$e]{right} if ($v2 == $v);
            delete($V[$v2]{edges}->{$e});

            # Push e onto the front of the output-order list.
            unshift @output_order, $e;

            # Consider v2 on next iteration of inner loop.
            $v = $v2;
        }
    }

    # We succeeded only if all edges were removed from the graph.
    return () if (scalar(@output_order) != $nedges);

    # OK, build the hash table of size $nverts.
    my @hashtab = (0) x $nverts;
    # We need a "visited" flag array in this step, too.
    my @visited = (0) x $nverts;

    # The goal is that for any key, the sum of the hash table entries for
    # its first and second hash values is the desired output (i.e., the key
    # number).  By assigning hash table values in the selected edge order,
    # we can guarantee that that's true.  This works because the edge first
    # removed from the graph (and hence last to be visited here) must have
    # at least one vertex it shared with no other edge; hence it will have at
    # least one vertex (hashtable entry) still unvisited when we reach it here,
    # and we can assign that unvisited entry a value that makes the sum come
    # out as we wish.  By induction, the same holds for all the other edges.
    foreach my $e (@output_order)
    {
        my $l = $E[$e]{left};
        my $r = $E[$e]{right};
        if (!$visited[$l])
        {
            # $hashtab[$r] might be zero, or some previously assigned value.
            $hashtab[$l] = $e - $hashtab[$r];
        }
        else
        {
            die "oops, doubly used hashtab entry" if $visited[$r];
            # $hashtab[$l] might be zero, or some previously assigned value.
            $hashtab[$r] = $e - $hashtab[$l];
        }
        # Now freeze both of these hashtab entries.
        $visited[$l] = 1;
        $visited[$r] = 1;
    }

    # Detect range of values needed in hash table.
    my $hmin = $nedges;
    my $hmax = 0;
    for (my $v = 0; $v < $nverts; $v++)
    {
        $hmin = $hashtab[$v] if $hashtab[$v] < $hmin;
        $hmax = $hashtab[$v] if $hashtab[$v] > $hmax;
    }

    # Choose width of hashtable entries.  In addition to the actual values,
    # we need to be able to store a flag for unused entries, and we wish to
    # have the property that adding any other entry value to the flag gives
    # an out-of-range result (>= $nedges).
    my $elemtype;
    my $unused_flag;

    if (   $hmin >= -0x7F
        && $hmax <= 0x7F
        && $hmin + 0x7F >= $nedges)
    {
        # int8 will work
        $elemtype    = 'int8';
        $unused_flag = 0x7F;
    }
    elsif ($hmin >= -0x7FFF
        && $hmax <= 0x7FFF
        && $hmin + 0x7FFF >= $nedges)
    {
        # int16 will work
        $elemtype    = 'int16';
        $unused_flag = 0x7FFF;
    }
    elsif ($hmin >= -0x7FFFFFFF
        && $hmax <= 0x7FFFFFFF
        && $hmin + 0x3FFFFFFF >= $nedges)
    {
        # int32 will work
        $elemtype    = 'int32';
        $unused_flag = 0x3FFFFFFF;
    }
    else
    {
        die "hash table values too wide";
    }

    # Set any unvisited hashtable entries to $unused_flag.
    for (my $v = 0; $v < $nverts; $v++)
    {
        $hashtab[$v] = $unused_flag if !$visited[$v];
    }

    return ($elemtype, \@hashtab);
}

1;

Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Joerg Sonnenberger
Date:
On Tue, Jan 08, 2019 at 05:53:25PM -0500, Tom Lane wrote:
> John Naylor <john.naylor@2ndquadrant.com> writes:
> > -As for the graph algorithm, I'd have to play with it to understand
> > how it works.
> 
> I improved the comment about how come the hash table entry assignment
> works.  One thing I'm not clear about myself is
> 
>     # A cycle-free graph is either empty or has some vertex of degree 1.
> 
> That sounds like a standard graph theory result, but I'm not familiar
> with a proof for it.

Let's assume all vertexes have a degree > 1, the graph is acyclic and
non-empty. Pick any vertex. Let's construct a path now starting from
this vertex. It is connected to at least one other vertex. Let's follow
that path. Again, there must be connected to one more vertex and it can't
go back to the starting point (since that would be a cycle). The next
vertex must still have another connections and it can't go back to any
already visited vertexes. Continue until you run out of vertex...

Joerg


I wrote:
> John Naylor <john.naylor@2ndquadrant.com> writes:
>> -There is a bit of a cognitive clash between $case_sensitive in
>> gen_keywordlist.pl and $case_insensitive in PerfectHash.pm. They each
>> make sense in their own file, but might it be worth using one or the
>> other?

> Yeah, dunno.  It seems to make sense for the command-line-level default of
> gen_keywordlist.pl to be "case insensitive", since most users want that.
> But that surely shouldn't be the default in PerfectHash.pm, and I'm not
> very sure how to reconcile the discrepancy.

Working on the fmgr-oid-lookup idea gave me the thought that
PerfectHash.pm ought to support fixed-length keys.  Rather than start
adding random parameters to the function, I borrowed an idea from
PostgresNode.pm and made the options be keyword-style parameters.  Now
the impedance mismatch about case sensitivity is handled with 

my $f = PerfectHash::generate_hash_function(\@keywords, $funcname,
    case_insensitive => !$case_sensitive);

which is at least a little clearer than before, though I'm not sure
if it entirely solves the problem.

Also, in view of finding that the original multiplier choices failed
on the fmgr oid problem, I spent a little effort making the code
able to try more combinations of hash multipliers and seeds.  It'd
be nice to have some theory rather than just heuristics about what
will work, though ...

Barring objections or further review, I plan to push this soon.

            regards, tom lane

diff --git a/src/common/Makefile b/src/common/Makefile
index 317b071..d0c2b97 100644
*** a/src/common/Makefile
--- b/src/common/Makefile
*************** OBJS_FRONTEND = $(OBJS_COMMON) fe_memuti
*** 63,68 ****
--- 63,73 ----
  OBJS_SHLIB = $(OBJS_FRONTEND:%.o=%_shlib.o)
  OBJS_SRV = $(OBJS_COMMON:%.o=%_srv.o)

+ # where to find gen_keywordlist.pl and subsidiary files
+ TOOLSDIR = $(top_srcdir)/src/tools
+ GEN_KEYWORDLIST = $(PERL) -I $(TOOLSDIR) $(TOOLSDIR)/gen_keywordlist.pl
+ GEN_KEYWORDLIST_DEPS = $(TOOLSDIR)/gen_keywordlist.pl $(TOOLSDIR)/PerfectHash.pm
+
  all: libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a

  distprep: kwlist_d.h
*************** libpgcommon_srv.a: $(OBJS_SRV)
*** 118,125 ****
      $(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@

  # generate SQL keyword lookup table to be included into keywords*.o.
! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(top_srcdir)/src/tools/gen_keywordlist.pl
!     $(PERL) $(top_srcdir)/src/tools/gen_keywordlist.pl --extern $<

  # Dependencies of keywords*.o need to be managed explicitly to make sure
  # that you don't get broken parsing code, even in a non-enable-depend build.
--- 123,130 ----
      $(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@

  # generate SQL keyword lookup table to be included into keywords*.o.
! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(GEN_KEYWORDLIST_DEPS)
!     $(GEN_KEYWORDLIST) --extern $<

  # Dependencies of keywords*.o need to be managed explicitly to make sure
  # that you don't get broken parsing code, even in a non-enable-depend build.
diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c
index d72842e..6545480 100644
*** a/src/common/kwlookup.c
--- b/src/common/kwlookup.c
***************
*** 35,94 ****
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *text,
                    const ScanKeywordList *keywords)
  {
!     int            len,
!                 i;
!     char        word[NAMEDATALEN];
!     const char *kw_string;
!     const uint16 *kw_offsets;
!     const uint16 *low;
!     const uint16 *high;
!
!     len = strlen(text);

      if (len > keywords->max_kw_len)
!         return -1;                /* too long to be any keyword */
!
!     /* We assume all keywords are shorter than NAMEDATALEN. */
!     Assert(len < NAMEDATALEN);

      /*
!      * Apply an ASCII-only downcasing.  We must not use tolower() since it may
!      * produce the wrong translation in some locales (eg, Turkish).
       */
!     for (i = 0; i < len; i++)
!     {
!         char        ch = text[i];

!         if (ch >= 'A' && ch <= 'Z')
!             ch += 'a' - 'A';
!         word[i] = ch;
!     }
!     word[len] = '\0';

      /*
!      * Now do a binary search using plain strcmp() comparison.
       */
!     kw_string = keywords->kw_string;
!     kw_offsets = keywords->kw_offsets;
!     low = kw_offsets;
!     high = kw_offsets + (keywords->num_keywords - 1);
!     while (low <= high)
      {
!         const uint16 *middle;
!         int            difference;

!         middle = low + (high - low) / 2;
!         difference = strcmp(kw_string + *middle, word);
!         if (difference == 0)
!             return middle - kw_offsets;
!         else if (difference < 0)
!             low = middle + 1;
!         else
!             high = middle - 1;
      }

!     return -1;
  }
--- 35,85 ----
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *str,
                    const ScanKeywordList *keywords)
  {
!     size_t        len;
!     int            h;
!     const char *kw;

+     /*
+      * Reject immediately if too long to be any keyword.  This saves useless
+      * hashing and downcasing work on long strings.
+      */
+     len = strlen(str);
      if (len > keywords->max_kw_len)
!         return -1;

      /*
!      * Compute the hash function.  We assume it was generated to produce
!      * case-insensitive results.  Since it's a perfect hash, we need only
!      * match to the specific keyword it identifies.
       */
!     h = keywords->hash(str, len);

!     /* An out-of-range result implies no match */
!     if (h < 0 || h >= keywords->num_keywords)
!         return -1;

      /*
!      * Compare character-by-character to see if we have a match, applying an
!      * ASCII-only downcasing to the input characters.  We must not use
!      * tolower() since it may produce the wrong translation in some locales
!      * (eg, Turkish).
       */
!     kw = GetScanKeyword(h, keywords);
!     while (*str != '\0')
      {
!         char        ch = *str++;

!         if (ch >= 'A' && ch <= 'Z')
!             ch += 'a' - 'A';
!         if (ch != *kw++)
!             return -1;
      }
+     if (*kw != '\0')
+         return -1;

!     /* Success! */
!     return h;
  }
diff --git a/src/include/common/kwlookup.h b/src/include/common/kwlookup.h
index 39efb35..dbff367 100644
*** a/src/include/common/kwlookup.h
--- b/src/include/common/kwlookup.h
***************
*** 14,19 ****
--- 14,22 ----
  #ifndef KWLOOKUP_H
  #define KWLOOKUP_H

+ /* Hash function used by ScanKeywordLookup */
+ typedef int (*ScanKeywordHashFunc) (const void *key, size_t keylen);
+
  /*
   * This struct contains the data needed by ScanKeywordLookup to perform a
   * search within a set of keywords.  The contents are typically generated by
*************** typedef struct ScanKeywordList
*** 23,28 ****
--- 26,32 ----
  {
      const char *kw_string;        /* all keywords in order, separated by \0 */
      const uint16 *kw_offsets;    /* offsets to the start of each keyword */
+     ScanKeywordHashFunc hash;    /* perfect hash function for keywords */
      int            num_keywords;    /* number of keywords */
      int            max_kw_len;        /* length of longest keyword */
  } ScanKeywordList;
diff --git a/src/interfaces/ecpg/preproc/Makefile b/src/interfaces/ecpg/preproc/Makefile
index b5b74a3..6c02f97 100644
*** a/src/interfaces/ecpg/preproc/Makefile
--- b/src/interfaces/ecpg/preproc/Makefile
*************** OBJS=    preproc.o pgc.o type.o ecpg.o outp
*** 28,34 ****
      keywords.o c_keywords.o ecpg_keywords.o typename.o descriptor.o variable.o \
      $(WIN32RES)

! GEN_KEYWORDLIST = $(top_srcdir)/src/tools/gen_keywordlist.pl

  # Suppress parallel build to avoid a bug in GNU make 3.82
  # (see comments in ../Makefile)
--- 28,37 ----
      keywords.o c_keywords.o ecpg_keywords.o typename.o descriptor.o variable.o \
      $(WIN32RES)

! # where to find gen_keywordlist.pl and subsidiary files
! TOOLSDIR = $(top_srcdir)/src/tools
! GEN_KEYWORDLIST = $(PERL) -I $(TOOLSDIR) $(TOOLSDIR)/gen_keywordlist.pl
! GEN_KEYWORDLIST_DEPS = $(TOOLSDIR)/gen_keywordlist.pl $(TOOLSDIR)/PerfectHash.pm

  # Suppress parallel build to avoid a bug in GNU make 3.82
  # (see comments in ../Makefile)
*************** preproc.y: ../../../backend/parser/gram.
*** 56,66 ****
      $(PERL) $(srcdir)/check_rules.pl $(srcdir) $<

  # generate keyword headers
! c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST)
!     $(PERL) $(GEN_KEYWORDLIST) --varname ScanCKeywords $<

! ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST)
!     $(PERL) $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $<

  # Force these dependencies to be known even without dependency info built:
  ecpg_keywords.o c_keywords.o keywords.o preproc.o pgc.o parser.o: preproc.h
--- 59,69 ----
      $(PERL) $(srcdir)/check_rules.pl $(srcdir) $<

  # generate keyword headers
! c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST_DEPS)
!     $(GEN_KEYWORDLIST) --varname ScanCKeywords --case $<

! ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST_DEPS)
!     $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $<

  # Force these dependencies to be known even without dependency info built:
  ecpg_keywords.o c_keywords.o keywords.o preproc.o pgc.o parser.o: preproc.h
diff --git a/src/interfaces/ecpg/preproc/c_keywords.c b/src/interfaces/ecpg/preproc/c_keywords.c
index 38ddf6f..80aa7d5 100644
*** a/src/interfaces/ecpg/preproc/c_keywords.c
--- b/src/interfaces/ecpg/preproc/c_keywords.c
***************
*** 9,16 ****
   */
  #include "postgres_fe.h"

- #include <ctype.h>
-
  #include "preproc_extern.h"
  #include "preproc.h"

--- 9,14 ----
*************** static const uint16 ScanCKeywordTokens[]
*** 32,70 ****
   *
   * Returns the token value of the keyword, or -1 if no match.
   *
!  * Do a binary search using plain strcmp() comparison.  This is much like
   * ScanKeywordLookup(), except we want case-sensitive matching.
   */
  int
! ScanCKeywordLookup(const char *text)
  {
!     const char *kw_string;
!     const uint16 *kw_offsets;
!     const uint16 *low;
!     const uint16 *high;

!     if (strlen(text) > ScanCKeywords.max_kw_len)
!         return -1;                /* too long to be any keyword */

!     kw_string = ScanCKeywords.kw_string;
!     kw_offsets = ScanCKeywords.kw_offsets;
!     low = kw_offsets;
!     high = kw_offsets + (ScanCKeywords.num_keywords - 1);

!     while (low <= high)
!     {
!         const uint16 *middle;
!         int            difference;

!         middle = low + (high - low) / 2;
!         difference = strcmp(kw_string + *middle, text);
!         if (difference == 0)
!             return ScanCKeywordTokens[middle - kw_offsets];
!         else if (difference < 0)
!             low = middle + 1;
!         else
!             high = middle - 1;
!     }

      return -1;
  }
--- 30,67 ----
   *
   * Returns the token value of the keyword, or -1 if no match.
   *
!  * Do a hash search using plain strcmp() comparison.  This is much like
   * ScanKeywordLookup(), except we want case-sensitive matching.
   */
  int
! ScanCKeywordLookup(const char *str)
  {
!     size_t        len;
!     int            h;
!     const char *kw;

!     /*
!      * Reject immediately if too long to be any keyword.  This saves useless
!      * hashing work on long strings.
!      */
!     len = strlen(str);
!     if (len > ScanCKeywords.max_kw_len)
!         return -1;

!     /*
!      * Compute the hash function.  Since it's a perfect hash, we need only
!      * match to the specific keyword it identifies.
!      */
!     h = ScanCKeywords_hash_func(str, len);

!     /* An out-of-range result implies no match */
!     if (h < 0 || h >= ScanCKeywords.num_keywords)
!         return -1;

!     kw = GetScanKeyword(h, &ScanCKeywords);
!
!     if (strcmp(kw, str) == 0)
!         return ScanCKeywordTokens[h];

      return -1;
  }
diff --git a/src/pl/plpgsql/src/Makefile b/src/pl/plpgsql/src/Makefile
index f5958d1..cc1c261 100644
*** a/src/pl/plpgsql/src/Makefile
--- b/src/pl/plpgsql/src/Makefile
*************** REGRESS_OPTS = --dbname=$(PL_TESTDB)
*** 29,35 ****
  REGRESS = plpgsql_call plpgsql_control plpgsql_domain plpgsql_record \
      plpgsql_cache plpgsql_transaction plpgsql_trigger plpgsql_varprops

! GEN_KEYWORDLIST = $(top_srcdir)/src/tools/gen_keywordlist.pl

  all: all-lib

--- 29,38 ----
  REGRESS = plpgsql_call plpgsql_control plpgsql_domain plpgsql_record \
      plpgsql_cache plpgsql_transaction plpgsql_trigger plpgsql_varprops

! # where to find gen_keywordlist.pl and subsidiary files
! TOOLSDIR = $(top_srcdir)/src/tools
! GEN_KEYWORDLIST = $(PERL) -I $(TOOLSDIR) $(TOOLSDIR)/gen_keywordlist.pl
! GEN_KEYWORDLIST_DEPS = $(TOOLSDIR)/gen_keywordlist.pl $(TOOLSDIR)/PerfectHash.pm

  all: all-lib

*************** plerrcodes.h: $(top_srcdir)/src/backend/
*** 76,86 ****
      $(PERL) $(srcdir)/generate-plerrcodes.pl $< > $@

  # generate keyword headers for the scanner
! pl_reserved_kwlist_d.h: pl_reserved_kwlist.h $(GEN_KEYWORDLIST)
!     $(PERL) $(GEN_KEYWORDLIST) --varname ReservedPLKeywords $<

! pl_unreserved_kwlist_d.h: pl_unreserved_kwlist.h $(GEN_KEYWORDLIST)
!     $(PERL) $(GEN_KEYWORDLIST) --varname UnreservedPLKeywords $<


  check: submake
--- 79,89 ----
      $(PERL) $(srcdir)/generate-plerrcodes.pl $< > $@

  # generate keyword headers for the scanner
! pl_reserved_kwlist_d.h: pl_reserved_kwlist.h $(GEN_KEYWORDLIST_DEPS)
!     $(GEN_KEYWORDLIST) --varname ReservedPLKeywords $<

! pl_unreserved_kwlist_d.h: pl_unreserved_kwlist.h $(GEN_KEYWORDLIST_DEPS)
!     $(GEN_KEYWORDLIST) --varname UnreservedPLKeywords $<


  check: submake
diff --git a/src/tools/PerfectHash.pm b/src/tools/PerfectHash.pm
index ...12223fa .
*** a/src/tools/PerfectHash.pm
--- b/src/tools/PerfectHash.pm
***************
*** 0 ****
--- 1,375 ----
+ #----------------------------------------------------------------------
+ #
+ # PerfectHash.pm
+ #    Perl module that constructs minimal perfect hash functions
+ #
+ # This code constructs a minimal perfect hash function for the given
+ # set of keys, using an algorithm described in
+ # "An optimal algorithm for generating minimal perfect hash functions"
+ # by Czech, Havas and Majewski in Information Processing Letters,
+ # 43(5):256-264, October 1992.
+ # This implementation is loosely based on NetBSD's "nbperf",
+ # which was written by Joerg Sonnenberger.
+ #
+ # The resulting hash function is perfect in the sense that if the presented
+ # key is one of the original set, it will return the key's index in the set
+ # (in range 0..N-1).  However, the caller must still verify the match,
+ # as false positives are possible.  Also, the hash function may return
+ # values that are out of range (negative or >= N), due to summing unrelated
+ # hashtable entries.  This indicates that the presented key is definitely
+ # not in the set.
+ #
+ #
+ # Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ # Portions Copyright (c) 1994, Regents of the University of California
+ #
+ # src/tools/PerfectHash.pm
+ #
+ #----------------------------------------------------------------------
+
+ package PerfectHash;
+
+ use strict;
+ use warnings;
+
+
+ # At runtime, we'll compute two simple hash functions of the input key,
+ # and use them to index into a mapping table.  The hash functions are just
+ # multiply-and-add in uint32 arithmetic, with different multipliers and
+ # initial seeds.  All the complexity in this module is concerned with
+ # selecting hash parameters that will work and building the mapping table.
+
+ # We support making case-insensitive hash functions, though this only
+ # works for a strict-ASCII interpretation of case insensitivity,
+ # ie, A-Z maps onto a-z and nothing else.
+ my $case_insensitive = 0;
+
+
+ #
+ # Construct a C function implementing a perfect hash for the given keys.
+ # The C function definition is returned as a string.
+ #
+ # The keys should be passed as an array reference.  They can be any set
+ # of Perl strings; it is caller's responsibility that there not be any
+ # duplicates.  (Note that the "strings" can be binary data, but hashing
+ # e.g. OIDs has endianness hazards that callers must overcome.)
+ #
+ # The name to use for the function is specified as the second argument.
+ # It will be a global function by default, but the caller may prepend
+ # "static " to the result string if it wants a static function.
+ #
+ # Additional options can be specified as keyword-style arguments:
+ #
+ # case_insensitive => bool
+ # If specified as true, the hash function is case-insensitive, for the
+ # limited idea of case-insensitivity explained above.
+ #
+ # fixed_key_length => N
+ # If specified, all keys are assumed to have length N bytes, and the
+ # hash function signature will be just "int f(const void *key)"
+ # rather than "int f(const void *key, size_t keylen)".
+ #
+ sub generate_hash_function
+ {
+     my ($keys_ref, $funcname, %options) = @_;
+
+     # It's not worth passing this around as a parameter; just use a global.
+     $case_insensitive = $options{case_insensitive} || 0;
+
+     # Try different hash function parameters until we find a set that works
+     # for these keys.  The multipliers are chosen to be primes that are cheap
+     # to calculate via shift-and-add, so don't change them without care.
+     # (Commonly, random seeds are tried, but we want reproducible results
+     # from this program so we don't do that.)
+     my $hash_mult1 = 31;
+     my $hash_mult2;
+     my $hash_seed1;
+     my $hash_seed2;
+     my @subresult;
+   FIND_PARAMS:
+     foreach (127, 257, 521, 1033, 2053)
+     {
+         $hash_mult2 = $_;    # "foreach $hash_mult2" doesn't work
+         for ($hash_seed1 = 0; $hash_seed1 < 10; $hash_seed1++)
+         {
+             for ($hash_seed2 = 0; $hash_seed2 < 10; $hash_seed2++)
+             {
+                 @subresult = _construct_hash_table(
+                     $keys_ref,   $hash_mult1, $hash_mult2,
+                     $hash_seed1, $hash_seed2);
+                 last FIND_PARAMS if @subresult;
+             }
+         }
+     }
+
+     # Choke if we couldn't find a workable set of parameters.
+     die "failed to generate perfect hash" if !@subresult;
+
+     # Extract info from _construct_hash_table's result array.
+     my $elemtype = $subresult[0];
+     my @hashtab  = @{ $subresult[1] };
+     my $nhash    = scalar(@hashtab);
+
+     # OK, construct the hash function definition including the hash table.
+     my $f = '';
+     $f .= sprintf "int\n";
+     if (defined $options{fixed_key_length})
+     {
+         $f .= sprintf "%s(const void *key)\n{\n", $funcname;
+     }
+     else
+     {
+         $f .= sprintf "%s(const void *key, size_t keylen)\n{\n", $funcname;
+     }
+     $f .= sprintf "\tstatic const %s h[%d] = {\n", $elemtype, $nhash;
+     for (my $i = 0; $i < $nhash; $i++)
+     {
+         $f .= sprintf "%s%6d,%s",
+           ($i % 8 == 0 ? "\t\t" : " "),
+           $hashtab[$i],
+           ($i % 8 == 7 ? "\n" : "");
+     }
+     $f .= sprintf "\n" if ($nhash % 8 != 0);
+     $f .= sprintf "\t};\n\n";
+     $f .= sprintf "\tconst unsigned char *k = key;\n";
+     $f .= sprintf "\tsize_t\t\tkeylen = %d;\n", $options{fixed_key_length}
+       if (defined $options{fixed_key_length});
+     $f .= sprintf "\tuint32\t\ta = %d;\n",   $hash_seed1;
+     $f .= sprintf "\tuint32\t\tb = %d;\n\n", $hash_seed2;
+     $f .= sprintf "\twhile (keylen--)\n\t{\n";
+     $f .= sprintf "\t\tunsigned char c = *k++";
+     $f .= sprintf " | 0x20" if $case_insensitive;    # see comment below
+     $f .= sprintf ";\n\n";
+     $f .= sprintf "\t\ta = a * %d + c;\n", $hash_mult1;
+     $f .= sprintf "\t\tb = b * %d + c;\n", $hash_mult2;
+     $f .= sprintf "\t}\n";
+     $f .= sprintf "\treturn h[a %% %d] + h[b %% %d];\n", $nhash, $nhash;
+     $f .= sprintf "}\n";
+
+     return $f;
+ }
+
+
+ # Calculate a hash function as the run-time code will do.
+ #
+ # If we are making a case-insensitive hash function, we implement that
+ # by OR'ing 0x20 into each byte of the key.  This correctly transforms
+ # upper-case ASCII into lower-case ASCII, while not changing digits or
+ # dollar signs.  (It does change '_', else we could just skip adjusting
+ # $cn here at all, for typical keyword strings.)
+ sub _calc_hash
+ {
+     my ($key, $mult, $seed) = @_;
+
+     my $result = $seed;
+     for my $c (split //, $key)
+     {
+         my $cn = ord($c);
+         $cn |= 0x20 if $case_insensitive;
+         $result = ($result * $mult + $cn) % 4294967296;
+     }
+     return $result;
+ }
+
+
+ # Attempt to construct a mapping table for a minimal perfect hash function
+ # for the given keys, using the specified hash parameters.
+ #
+ # Returns an array containing the mapping table element type name as the
+ # first element, and a ref to an array of the table values as the second.
+ #
+ # Returns an empty array on failure; then caller should choose different
+ # hash parameter(s) and try again.
+ sub _construct_hash_table
+ {
+     my ($keys_ref, $hash_mult1, $hash_mult2, $hash_seed1, $hash_seed2) = @_;
+     my @keys = @{$keys_ref};
+
+     # This algorithm is based on a graph whose edges correspond to the
+     # keys and whose vertices correspond to entries of the mapping table.
+     # A key's edge links the two vertices whose indexes are the outputs of
+     # the two hash functions for that key.  For K keys, the mapping
+     # table must have at least 2*K+1 entries, guaranteeing that there's at
+     # least one unused entry.  (In principle, larger mapping tables make it
+     # easier to find a workable hash and increase the number of inputs that
+     # can be rejected due to touching unused hashtable entries.  In practice,
+     # neither effect seems strong enough to justify using a larger table.)
+     my $nedges = scalar @keys;       # number of edges
+     my $nverts = 2 * $nedges + 1;    # number of vertices
+
+     # However, it would be very bad if $nverts were exactly equal to either
+     # $hash_mult1 or $hash_mult2: effectively, that hash function would be
+     # sensitive to only the last byte of each key.  Cases where $nverts is a
+     # multiple of either multiplier likewise lose information.  (But $nverts
+     # can't actually divide them, if they've been intelligently chosen as
+     # primes.)  We can avoid such problems by adjusting the table size.
+     while ($nverts % $hash_mult1 == 0
+         || $nverts % $hash_mult2 == 0)
+     {
+         $nverts++;
+     }
+
+     # Initialize the array of edges.
+     my @E = ();
+     foreach my $kw (@keys)
+     {
+         # Calculate hashes for this key.
+         # The hashes are immediately reduced modulo the mapping table size.
+         my $hash1 = _calc_hash($kw, $hash_mult1, $hash_seed1) % $nverts;
+         my $hash2 = _calc_hash($kw, $hash_mult2, $hash_seed2) % $nverts;
+
+         # If the two hashes are the same for any key, we have to fail
+         # since this edge would itself form a cycle in the graph.
+         return () if $hash1 == $hash2;
+
+         # Add the edge for this key.
+         push @E, { left => $hash1, right => $hash2 };
+     }
+
+     # Initialize the array of vertices, giving them all empty lists
+     # of associated edges.  (The lists will be hashes of edge numbers.)
+     my @V = ();
+     for (my $v = 0; $v < $nverts; $v++)
+     {
+         push @V, { edges => {} };
+     }
+
+     # Insert each edge in the lists of edges using its vertices.
+     for (my $e = 0; $e < $nedges; $e++)
+     {
+         my $v = $E[$e]{left};
+         $V[$v]{edges}->{$e} = 1;
+
+         $v = $E[$e]{right};
+         $V[$v]{edges}->{$e} = 1;
+     }
+
+     # Now we attempt to prove the graph acyclic.
+     # A cycle-free graph is either empty or has some vertex of degree 1.
+     # Removing the edge attached to that vertex doesn't change this property,
+     # so doing that repeatedly will reduce the size of the graph.
+     # If the graph is empty at the end of the process, it was acyclic.
+     # We track the order of edge removal so that the next phase can process
+     # them in reverse order of removal.
+     my @output_order = ();
+
+     # Consider each vertex as a possible starting point for edge-removal.
+     for (my $startv = 0; $startv < $nverts; $startv++)
+     {
+         my $v = $startv;
+
+         # If vertex v is of degree 1 (i.e. exactly 1 edge connects to it),
+         # remove that edge, and then consider the edge's other vertex to see
+         # if it is now of degree 1.  The inner loop repeats until reaching a
+         # vertex not of degree 1.
+         while (scalar(keys(%{ $V[$v]{edges} })) == 1)
+         {
+             # Unlink its only edge.
+             my $e = (keys(%{ $V[$v]{edges} }))[0];
+             delete($V[$v]{edges}->{$e});
+
+             # Unlink the edge from its other vertex, too.
+             my $v2 = $E[$e]{left};
+             $v2 = $E[$e]{right} if ($v2 == $v);
+             delete($V[$v2]{edges}->{$e});
+
+             # Push e onto the front of the output-order list.
+             unshift @output_order, $e;
+
+             # Consider v2 on next iteration of inner loop.
+             $v = $v2;
+         }
+     }
+
+     # We succeeded only if all edges were removed from the graph.
+     return () if (scalar(@output_order) != $nedges);
+
+     # OK, build the hash table of size $nverts.
+     my @hashtab = (0) x $nverts;
+     # We need a "visited" flag array in this step, too.
+     my @visited = (0) x $nverts;
+
+     # The goal is that for any key, the sum of the hash table entries for
+     # its first and second hash values is the desired output (i.e., the key
+     # number).  By assigning hash table values in the selected edge order,
+     # we can guarantee that that's true.  This works because the edge first
+     # removed from the graph (and hence last to be visited here) must have
+     # at least one vertex it shared with no other edge; hence it will have at
+     # least one vertex (hashtable entry) still unvisited when we reach it here,
+     # and we can assign that unvisited entry a value that makes the sum come
+     # out as we wish.  By induction, the same holds for all the other edges.
+     foreach my $e (@output_order)
+     {
+         my $l = $E[$e]{left};
+         my $r = $E[$e]{right};
+         if (!$visited[$l])
+         {
+             # $hashtab[$r] might be zero, or some previously assigned value.
+             $hashtab[$l] = $e - $hashtab[$r];
+         }
+         else
+         {
+             die "oops, doubly used hashtab entry" if $visited[$r];
+             # $hashtab[$l] might be zero, or some previously assigned value.
+             $hashtab[$r] = $e - $hashtab[$l];
+         }
+         # Now freeze both of these hashtab entries.
+         $visited[$l] = 1;
+         $visited[$r] = 1;
+     }
+
+     # Detect range of values needed in hash table.
+     my $hmin = $nedges;
+     my $hmax = 0;
+     for (my $v = 0; $v < $nverts; $v++)
+     {
+         $hmin = $hashtab[$v] if $hashtab[$v] < $hmin;
+         $hmax = $hashtab[$v] if $hashtab[$v] > $hmax;
+     }
+
+     # Choose width of hashtable entries.  In addition to the actual values,
+     # we need to be able to store a flag for unused entries, and we wish to
+     # have the property that adding any other entry value to the flag gives
+     # an out-of-range result (>= $nedges).
+     my $elemtype;
+     my $unused_flag;
+
+     if (   $hmin >= -0x7F
+         && $hmax <= 0x7F
+         && $hmin + 0x7F >= $nedges)
+     {
+         # int8 will work
+         $elemtype    = 'int8';
+         $unused_flag = 0x7F;
+     }
+     elsif ($hmin >= -0x7FFF
+         && $hmax <= 0x7FFF
+         && $hmin + 0x7FFF >= $nedges)
+     {
+         # int16 will work
+         $elemtype    = 'int16';
+         $unused_flag = 0x7FFF;
+     }
+     elsif ($hmin >= -0x7FFFFFFF
+         && $hmax <= 0x7FFFFFFF
+         && $hmin + 0x3FFFFFFF >= $nedges)
+     {
+         # int32 will work
+         $elemtype    = 'int32';
+         $unused_flag = 0x3FFFFFFF;
+     }
+     else
+     {
+         die "hash table values too wide";
+     }
+
+     # Set any unvisited hashtable entries to $unused_flag.
+     for (my $v = 0; $v < $nverts; $v++)
+     {
+         $hashtab[$v] = $unused_flag if !$visited[$v];
+     }
+
+     return ($elemtype, \@hashtab);
+ }
+
+ 1;
diff --git a/src/tools/gen_keywordlist.pl b/src/tools/gen_keywordlist.pl
index d764aff..2744e1d 100644
*** a/src/tools/gen_keywordlist.pl
--- b/src/tools/gen_keywordlist.pl
***************
*** 14,19 ****
--- 14,25 ----
  # variable named according to the -v switch ("ScanKeywords" by default).
  # The variable is marked "static" unless the -e switch is given.
  #
+ # ScanKeywordList uses hash-based lookup, so this script also selects
+ # a minimal perfect hash function for the keyword set, and emits a
+ # static hash function that is referenced in the ScanKeywordList struct.
+ # The hash function is case-insensitive unless --case is specified.
+ # Note that case insensitivity assumes all-ASCII keywords!
+ #
  #
  # Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
  # Portions Copyright (c) 1994, Regents of the University of California
***************
*** 25,39 ****
  use strict;
  use warnings;
  use Getopt::Long;

  my $output_path = '';
  my $extern = 0;
  my $varname = 'ScanKeywords';

  GetOptions(
!     'output:s' => \$output_path,
!     'extern'   => \$extern,
!     'varname:s' => \$varname) || usage();

  my $kw_input_file = shift @ARGV || die "No input file.\n";

--- 31,48 ----
  use strict;
  use warnings;
  use Getopt::Long;
+ use PerfectHash;

  my $output_path = '';
  my $extern = 0;
+ my $case_sensitive = 0;
  my $varname = 'ScanKeywords';

  GetOptions(
!     'output:s'       => \$output_path,
!     'extern'         => \$extern,
!     'case-sensitive' => \$case_sensitive,
!     'varname:s'      => \$varname) || usage();

  my $kw_input_file = shift @ARGV || die "No input file.\n";

*************** while (<$kif>)
*** 87,93 ****
--- 96,117 ----
      }
  }

+ # When being case-insensitive, insist that the input be all-lower-case.
+ if (!$case_sensitive)
+ {
+     foreach my $kw (@keywords)
+     {
+         die qq|The keyword "$kw" is not lower-case in $kw_input_file\n|
+           if ($kw ne lc $kw);
+     }
+ }
+
  # Error out if the keyword names are not in ASCII order.
+ #
+ # While this isn't really necessary with hash-based lookup, it's still
+ # helpful because it provides a cheap way to reject duplicate keywords.
+ # Also, insisting on sorted order ensures that code that scans the keyword
+ # table linearly will see the keywords in a canonical order.
  for my $i (0..$#keywords - 1)
  {
      die qq|The keyword "$keywords[$i + 1]" is out of order in $kw_input_file\n|
*************** print $kwdef "};\n\n";
*** 128,142 ****

  printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords;

  # Emit the struct that wraps all this lookup info into one variable.

! print $kwdef "static " if !$extern;
  printf $kwdef "const ScanKeywordList %s = {\n", $varname;
  printf $kwdef qq|\t%s_kw_string,\n|, $varname;
  printf $kwdef qq|\t%s_kw_offsets,\n|, $varname;
  printf $kwdef qq|\t%s_NUM_KEYWORDS,\n|, uc $varname;
  printf $kwdef qq|\t%d\n|, $max_len;
! print $kwdef "};\n\n";

  printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename;

--- 152,176 ----

  printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords;

+ # Emit the definition of the hash function.
+
+ my $funcname = $varname . "_hash_func";
+
+ my $f = PerfectHash::generate_hash_function(\@keywords, $funcname,
+     case_insensitive => !$case_sensitive);
+
+ printf $kwdef qq|static %s\n|, $f;
+
  # Emit the struct that wraps all this lookup info into one variable.

! printf $kwdef "static " if !$extern;
  printf $kwdef "const ScanKeywordList %s = {\n", $varname;
  printf $kwdef qq|\t%s_kw_string,\n|, $varname;
  printf $kwdef qq|\t%s_kw_offsets,\n|, $varname;
+ printf $kwdef qq|\t%s,\n|, $funcname;
  printf $kwdef qq|\t%s_NUM_KEYWORDS,\n|, uc $varname;
  printf $kwdef qq|\t%d\n|, $max_len;
! printf $kwdef "};\n\n";

  printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename;

*************** Usage: gen_keywordlist.pl [--output/-o <
*** 148,153 ****
--- 182,188 ----
      --output   Output directory (default '.')
      --varname  Name for ScanKeywordList variable (default 'ScanKeywords')
      --extern   Allow the ScanKeywordList variable to be globally visible
+     --case     Keyword matching is to be case-sensitive

  gen_keywordlist.pl transforms a list of keywords into a ScanKeywordList.
  The output filename is derived from the input file by inserting _d,
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 937bf18..8f54e45 100644
*** a/src/tools/msvc/Solution.pm
--- b/src/tools/msvc/Solution.pm
*************** sub GenerateFiles
*** 414,420 ****
              'src/include/parser/kwlist.h'))
      {
          print "Generating kwlist_d.h...\n";
!         system('perl src/tools/gen_keywordlist.pl --extern -o src/common src/include/parser/kwlist.h');
      }

      if (IsNewer(
--- 414,420 ----
              'src/include/parser/kwlist.h'))
      {
          print "Generating kwlist_d.h...\n";
!         system('perl -I src/tools src/tools/gen_keywordlist.pl --extern -o src/common src/include/parser/kwlist.h');
      }

      if (IsNewer(
*************** sub GenerateFiles
*** 426,433 ****
      {
          print "Generating pl_reserved_kwlist_d.h and pl_unreserved_kwlist_d.h...\n";
          chdir('src/pl/plpgsql/src');
!         system('perl ../../../tools/gen_keywordlist.pl --varname ReservedPLKeywords pl_reserved_kwlist.h');
!         system('perl ../../../tools/gen_keywordlist.pl --varname UnreservedPLKeywords pl_unreserved_kwlist.h');
          chdir('../../../..');
      }

--- 426,433 ----
      {
          print "Generating pl_reserved_kwlist_d.h and pl_unreserved_kwlist_d.h...\n";
          chdir('src/pl/plpgsql/src');
!         system('perl -I ../../../tools ../../../tools/gen_keywordlist.pl --varname ReservedPLKeywords
pl_reserved_kwlist.h');
!         system('perl -I ../../../tools ../../../tools/gen_keywordlist.pl --varname UnreservedPLKeywords
pl_unreserved_kwlist.h');
          chdir('../../../..');
      }

*************** sub GenerateFiles
*** 440,447 ****
      {
          print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n";
          chdir('src/interfaces/ecpg/preproc');
!         system('perl ../../../tools/gen_keywordlist.pl --varname ScanCKeywords c_kwlist.h');
!         system('perl ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h');
          chdir('../../../..');
      }

--- 440,447 ----
      {
          print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n";
          chdir('src/interfaces/ecpg/preproc');
!         system('perl -I ../../../tools ../../../tools/gen_keywordlist.pl --varname ScanCKeywords --case c_kwlist.h');
!         system('perl -I ../../../tools ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h');
          chdir('../../../..');
      }


I wrote:
> Also, I fail to understand why fmgr_builtin_oid_index has 10000 entries
> anyway.  We could easily have fmgrtab.c expose the last actually assigned
> builtin function OID (presently 6121) and make the index array only
> that big, which just about eliminates the space advantage completely.

Concretely, like the attached.

We could make the index table still smaller if we wanted to reassign
a couple dozen high-numbered functions down to lower OIDs, but I dunno
if it's worth the trouble.  It certainly isn't from a performance
standpoint, because those unused entry ranges will never be touched
in normal usage; but it'd make the server executable a couple KB smaller.

            regards, tom lane

diff --git a/src/backend/utils/Gen_fmgrtab.pl b/src/backend/utils/Gen_fmgrtab.pl
index cafe408..f970940 100644
*** a/src/backend/utils/Gen_fmgrtab.pl
--- b/src/backend/utils/Gen_fmgrtab.pl
*************** foreach my $datfile (@input_files)
*** 80,90 ****
      $catalog_data{$catname} = Catalog::ParseData($datfile, $schema, 0);
  }

- # Fetch some values for later.
- my $FirstGenbkiObjectId =
-   Catalog::FindDefinedSymbol('access/transam.h', $include_path,
-     'FirstGenbkiObjectId');
-
  # Collect certain fields from pg_proc.dat.
  my @fmgr = ();

--- 80,85 ----
*************** my %bmap;
*** 225,230 ****
--- 220,226 ----
  $bmap{'t'} = 'true';
  $bmap{'f'} = 'false';
  my @fmgr_builtin_oid_index;
+ my $last_builtin_oid = 0;
  my $fmgr_count = 0;
  foreach my $s (sort { $a->{oid} <=> $b->{oid} } @fmgr)
  {
*************** foreach my $s (sort { $a->{oid} <=> $b->
*** 232,237 ****
--- 228,234 ----
        "  { $s->{oid}, $s->{nargs}, $bmap{$s->{strict}}, $bmap{$s->{retset}}, \"$s->{prosrc}\", $s->{prosrc} }";

      $fmgr_builtin_oid_index[ $s->{oid} ] = $fmgr_count++;
+     $last_builtin_oid = $s->{oid};

      if ($fmgr_count <= $#fmgr)
      {
*************** foreach my $s (sort { $a->{oid} <=> $b->
*** 244,274 ****
  }
  print $tfh "};\n";

! print $tfh qq|
  const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin));
! |;


  # Create fmgr_builtins_oid_index table.
! #
! # Note that the array has to be filled up to FirstGenbkiObjectId,
! # as we can't rely on zero initialization as 0 is a valid mapping.
! print $tfh qq|
! const uint16 fmgr_builtin_oid_index[FirstGenbkiObjectId] = {
! |;

! for (my $i = 0; $i < $FirstGenbkiObjectId; $i++)
  {
      my $oid = $fmgr_builtin_oid_index[$i];

!     # fmgr_builtin_oid_index is sparse, map nonexistant functions to
      # InvalidOidBuiltinMapping
      if (not defined $oid)
      {
          $oid = 'InvalidOidBuiltinMapping';
      }

!     if ($i + 1 == $FirstGenbkiObjectId)
      {
          print $tfh "  $oid\n";
      }
--- 241,270 ----
  }
  print $tfh "};\n";

! printf $tfh qq|
  const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin));
!
! const Oid fmgr_last_builtin_oid = %u;
! |, $last_builtin_oid;


  # Create fmgr_builtins_oid_index table.
! printf $tfh qq|
! const uint16 fmgr_builtin_oid_index[%u] = {
! |, $last_builtin_oid + 1;

! for (my $i = 0; $i <= $last_builtin_oid; $i++)
  {
      my $oid = $fmgr_builtin_oid_index[$i];

!     # fmgr_builtin_oid_index is sparse, map nonexistent functions to
      # InvalidOidBuiltinMapping
      if (not defined $oid)
      {
          $oid = 'InvalidOidBuiltinMapping';
      }

!     if ($i == $last_builtin_oid)
      {
          print $tfh "  $oid\n";
      }
diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c
index b41649f..506eeef 100644
*** a/src/backend/utils/fmgr/fmgr.c
--- b/src/backend/utils/fmgr/fmgr.c
*************** fmgr_isbuiltin(Oid id)
*** 75,86 ****
      uint16        index;

      /* fast lookup only possible if original oid still assigned */
!     if (id >= FirstGenbkiObjectId)
          return NULL;

      /*
       * Lookup function data. If there's a miss in that range it's likely a
!      * nonexistant function, returning NULL here will trigger an ERROR later.
       */
      index = fmgr_builtin_oid_index[id];
      if (index == InvalidOidBuiltinMapping)
--- 75,86 ----
      uint16        index;

      /* fast lookup only possible if original oid still assigned */
!     if (id > fmgr_last_builtin_oid)
          return NULL;

      /*
       * Lookup function data. If there's a miss in that range it's likely a
!      * nonexistent function, returning NULL here will trigger an ERROR later.
       */
      index = fmgr_builtin_oid_index[id];
      if (index == InvalidOidBuiltinMapping)
diff --git a/src/include/utils/fmgrtab.h b/src/include/utils/fmgrtab.h
index a778f88..e981f34 100644
*** a/src/include/utils/fmgrtab.h
--- b/src/include/utils/fmgrtab.h
*************** extern const FmgrBuiltin fmgr_builtins[]
*** 36,46 ****

  extern const int fmgr_nbuiltins;    /* number of entries in table */

  /*
!  * Mapping from a builtin function's oid to the index in the fmgr_builtins
!  * array.
   */
  #define InvalidOidBuiltinMapping PG_UINT16_MAX
! extern const uint16 fmgr_builtin_oid_index[FirstGenbkiObjectId];

  #endif                            /* FMGRTAB_H */
--- 36,48 ----

  extern const int fmgr_nbuiltins;    /* number of entries in table */

+ extern const Oid fmgr_last_builtin_oid; /* highest function OID in table */
+
  /*
!  * Mapping from a builtin function's OID to its index in the fmgr_builtins
!  * array.  This is indexed from 0 through fmgr_last_builtin_oid.
   */
  #define InvalidOidBuiltinMapping PG_UINT16_MAX
! extern const uint16 fmgr_builtin_oid_index[];

  #endif                            /* FMGRTAB_H */

Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Alvaro Herrera
Date:
On 2019-Jan-09, Tom Lane wrote:

> We could make the index table still smaller if we wanted to reassign
> a couple dozen high-numbered functions down to lower OIDs, but I dunno
> if it's worth the trouble.  It certainly isn't from a performance
> standpoint, because those unused entry ranges will never be touched
> in normal usage; but it'd make the server executable a couple KB smaller.

Or two couples KB smaller, if we abandoned the idea that pg_proc OIDs
must not collide with those in any other catalog, and we renumbered all
functions to start at OID 1 or so.  duplicate_oids would complain about
that, though, I suppose ... and nobody who has ever hardcoded a function
OID would love this idea much.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andres Freund
Date:
Hi,

On 2019-01-09 14:44:24 -0500, Tom Lane wrote:
> I wrote:
> > Also, I fail to understand why fmgr_builtin_oid_index has 10000 entries
> > anyway.  We could easily have fmgrtab.c expose the last actually assigned
> > builtin function OID (presently 6121) and make the index array only
> > that big, which just about eliminates the space advantage completely.
> 
> Concretely, like the attached.

Seems like a good improvement.


> We could make the index table still smaller if we wanted to reassign
> a couple dozen high-numbered functions down to lower OIDs, but I dunno
> if it's worth the trouble.  It certainly isn't from a performance
> standpoint, because those unused entry ranges will never be touched
> in normal usage; but it'd make the server executable a couple KB smaller.

Probably indeed not worth it. I'm not 100% convinced on the performance
POV, but in contrast to the earlier binary search either approach is
fast enough that it probably hard to measure any difference.


> diff --git a/src/backend/utils/fmgr/fmgindex b41649f..506eeef 100644
> --- a/src/backend/utils/fmgr/fmgr.c
> +++ b/src/backend/utils/fmgr/fmgr.c
> @@ -75,12 +75,12 @@ fmgr_isbuiltin(Oid id)
>      uint16        index;
>  
>      /* fast lookup only possible if original oid still assigned */
> -    if (id >= FirstGenbkiObjectId)
> +    if (id > fmgr_last_builtin_oid)
>          return NULL;

An extern reference here will make the code a bit less efficient, but
it's probably not worth generating a header with a define for it
instead...

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> On 2019-01-09 14:44:24 -0500, Tom Lane wrote:
>> /* fast lookup only possible if original oid still assigned */
>> -    if (id >= FirstGenbkiObjectId)
>> +    if (id > fmgr_last_builtin_oid)
>>         return NULL;

> An extern reference here will make the code a bit less efficient, but
> it's probably not worth generating a header with a define for it
> instead...

Yeah, also that would be significantly more fragile, in that it'd
be hard to be sure where that OID had propagated to when rebuilding.
We haven't chosen to make fmgr_nbuiltins a #define either, and I think
it's best to treat this the same way.

            regards, tom lane


Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2019-Jan-09, Tom Lane wrote:
>> We could make the index table still smaller if we wanted to reassign
>> a couple dozen high-numbered functions down to lower OIDs, but I dunno
>> if it's worth the trouble.  It certainly isn't from a performance
>> standpoint, because those unused entry ranges will never be touched
>> in normal usage; but it'd make the server executable a couple KB smaller.

> Or two couples KB smaller, if we abandoned the idea that pg_proc OIDs
> must not collide with those in any other catalog, and we renumbered all
> functions to start at OID 1 or so.  duplicate_oids would complain about
> that, though, I suppose ... and nobody who has ever hardcoded a function
> OID would love this idea much.

I think that'd be a nonstarter for commonly-used functions.  I'm guessing
that pg_replication_origin_create() and so on, which are the immediate
problem, haven't been around long enough or get used often enough for
someone to have hard-coded their OIDs.  But I could be wrong.

(Speaking of which, I've been wondering for awhile if libpq ought not
obtain the OIDs of lo_create and friends by #including fmgroids.h
instead of doing a runtime query on every connection.  If we did that,
we'd be forever giving up the option to renumber them ... but do you
really want to bet that nobody else has done this already in some
other client code?)

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Andres Freund
Date:
Hi,

On 2019-01-09 15:03:35 -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > On 2019-Jan-09, Tom Lane wrote:
> >> We could make the index table still smaller if we wanted to reassign
> >> a couple dozen high-numbered functions down to lower OIDs, but I dunno
> >> if it's worth the trouble.  It certainly isn't from a performance
> >> standpoint, because those unused entry ranges will never be touched
> >> in normal usage; but it'd make the server executable a couple KB smaller.
> 
> > Or two couples KB smaller, if we abandoned the idea that pg_proc OIDs
> > must not collide with those in any other catalog, and we renumbered all
> > functions to start at OID 1 or so.  duplicate_oids would complain about
> > that, though, I suppose ... and nobody who has ever hardcoded a function
> > OID would love this idea much.
> 
> I think that'd be a nonstarter for commonly-used functions.  I'm guessing
> that pg_replication_origin_create() and so on, which are the immediate
> problem, haven't been around long enough or get used often enough for
> someone to have hard-coded their OIDs.  But I could be wrong.

I don't think it's likely that it'd be useful to hardcode them, and
therefore hope that nobody would do so.

I personally feel limited sympathy to people hardcoding oids across
major versions. The benefits of making pg easier to maintain and more
efficient seem higher than allowing for that.


> (Speaking of which, I've been wondering for awhile if libpq ought not
> obtain the OIDs of lo_create and friends by #including fmgroids.h
> instead of doing a runtime query on every connection.  If we did that,
> we'd be forever giving up the option to renumber them ... but do you
> really want to bet that nobody else has done this already in some
> other client code?)

I'm not enthusiastic about that. I kinda hope we're going to evolve that
interface further, which'd make it version dependent anyway (we don't
require all of them right now...). And it's not that expensive to query
their oids once.

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> On 2019-01-09 15:03:35 -0500, Tom Lane wrote:
>> (Speaking of which, I've been wondering for awhile if libpq ought not
>> obtain the OIDs of lo_create and friends by #including fmgroids.h
>> instead of doing a runtime query on every connection.  If we did that,
>> we'd be forever giving up the option to renumber them ... but do you
>> really want to bet that nobody else has done this already in some
>> other client code?)

> I'm not enthusiastic about that. I kinda hope we're going to evolve that
> interface further, which'd make it version dependent anyway (we don't
> require all of them right now...). And it's not that expensive to query
> their oids once.

Version dependency doesn't seem like much of an argument: we'd just teach
libpq to pay attention to the server version, which it knows anyway (and
uses for other purposes already).

But this is a bit off-topic for this thread, perhaps.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)

From
Joerg Sonnenberger
Date:
On Wed, Jan 09, 2019 at 02:04:15PM -0500, Tom Lane wrote:
> Also, in view of finding that the original multiplier choices failed
> on the fmgr oid problem, I spent a little effort making the code
> able to try more combinations of hash multipliers and seeds.  It'd
> be nice to have some theory rather than just heuristics about what
> will work, though ...

The theory is that the code needs two families of pair-wise independent
hash functions to give the O(1) number of tries. For practical purposes,
the Jenkins hash easily qualifies or any cryptographic hash function.
The downside is that those hash functions are very expensive though.
A multiplicative hash, especially using the high bits of the results
(e.g. the top 32bit of a 32x32->64 multiplication) qualifies for the
requirements, but for strings of input it would need a pair of constant
per word. So the choice of a hash function family for a performance
sensitive part is a heuristic in the sense of trying to get away with as
simple a function as possible.

Joerg


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On Wed, Jan 9, 2019 at 2:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > John Naylor <john.naylor@2ndquadrant.com> writes:
> >> -There is a bit of a cognitive clash between $case_sensitive in
> >> gen_keywordlist.pl and $case_insensitive in PerfectHash.pm. They each
> >> make sense in their own file, but might it be worth using one or the
> >> other?

> Working on the fmgr-oid-lookup idea gave me the thought that
> PerfectHash.pm ought to support fixed-length keys.  Rather than start
> adding random parameters to the function, I borrowed an idea from
> PostgresNode.pm and made the options be keyword-style parameters.  Now
> the impedance mismatch about case sensitivity is handled with
>
> my $f = PerfectHash::generate_hash_function(\@keywords, $funcname,
>         case_insensitive => !$case_sensitive);
>
> which is at least a little clearer than before, though I'm not sure
> if it entirely solves the problem.

It's a bit clearer, but thinking about this some more, it makes sense
for gen_keywordlist.pl to use $case_insensitive, because right now
every instance of the var is "!$case_sensitive". In the attached (on
top of v4), I change the command line option to --citext, and add the
ability to negate it within the option, as '--no-citext'. It's kind of
a double negative for the C-keywords invocation, but we can have the
option for both cases, so we don't need to worry about what the
default is (which is case_insensitive=1).

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On Tue, Jan 8, 2019 at 5:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> John Naylor <john.naylor@2ndquadrant.com> writes:
> > In the committed keyword patch, I noticed that in common/keywords.c,
> > the array length is defined with
> > ScanKeywordCategories[SCANKEYWORDS_NUM_KEYWORDS]
> > but other keyword arrays just have ...[]. Is there a reason for the difference?
>
> The length macro was readily available there so I used it.  AFAIR
> that wasn't true elsewhere, though I might've missed something.
> It's pretty much just belt-and-suspenders coding anyway, since all
> those arrays are machine generated ...

I tried using the available num_keywords macro in plpgsql and it
worked fine, but it makes the lines really long. Alternatively, as in
the attached, we could remove the single use of the core macro and
maybe add comments to the generated magic numbers.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On Wed, Jan 9, 2019 at 2:44 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> [patch to shrink oid index]

It would help maintaining its newfound sveltness if we warned if a
higher oid was assigned, as in the attached. I used 6200 as a soft
limit, but that could be anything similiar.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment
John Naylor <john.naylor@2ndquadrant.com> writes:
> On Wed, Jan 9, 2019 at 2:44 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> [patch to shrink oid index]

> It would help maintaining its newfound sveltness if we warned if a
> higher oid was assigned, as in the attached. I used 6200 as a soft
> limit, but that could be anything similiar.

I think the reason we have this issue is that people tend to use
high OIDs during development of a patch, so that their elbows won't
be joggled by unrelated changes.  Then sometimes they forget to
renumber them down before committing.  A warning like this would
lead to lots of noise during the development stage, which nobody
would thank us for.  If we could find a way to notice this only
when we were about to commit, it'd be good .. but I don't have an
idea about a nice way to do that.  (No, I don't want a commit hook
on gitmaster; that's warning too late, which is not better.)

            regards, tom lane


John Naylor <john.naylor@2ndquadrant.com> writes:
> On Wed, Jan 9, 2019 at 2:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Now the impedance mismatch about case sensitivity is handled with
>> my $f = PerfectHash::generate_hash_function(\@keywords, $funcname,
>>         case_insensitive => !$case_sensitive);
>> which is at least a little clearer than before, though I'm not sure
>> if it entirely solves the problem.

> It's a bit clearer, but thinking about this some more, it makes sense
> for gen_keywordlist.pl to use $case_insensitive, because right now
> every instance of the var is "!$case_sensitive". In the attached (on
> top of v4), I change the command line option to --citext, and add the
> ability to negate it within the option, as '--no-citext'. It's kind of
> a double negative for the C-keywords invocation, but we can have the
> option for both cases, so we don't need to worry about what the
> default is (which is case_insensitive=1).

Ah, I didn't realize that Getopt allows having a boolean option
defaulting to "on".  That makes it more practical to do something here.

I'm not in love with "citext" as the option name, though ... that has
little to recommend it except brevity, which is not a property we
really need here.  We could go with "[no-]case-insensitive", perhaps.
Or "[no-]case-fold", which is at least a little shorter and less
double-negative-y.

            regards, tom lane


John Naylor <john.naylor@2ndquadrant.com> writes:
> On Tue, Jan 8, 2019 at 5:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The length macro was readily available there so I used it.  AFAIR
>> that wasn't true elsewhere, though I might've missed something.
>> It's pretty much just belt-and-suspenders coding anyway, since all
>> those arrays are machine generated ...

> I tried using the available num_keywords macro in plpgsql and it
> worked fine, but it makes the lines really long. Alternatively, as in
> the attached, we could remove the single use of the core macro and
> maybe add comments to the generated magic numbers.

Meh, I'm not excited about removing the option just because there's
only one use of it now.  There might be more-compelling uses later.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
John Naylor
Date:
On Wed, Jan 9, 2019 at 5:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> really need here.  We could go with "[no-]case-insensitive", perhaps.
> Or "[no-]case-fold", which is at least a little shorter and less
> double-negative-y.

I'd be in favor of --[no-]case-fold.

On Tue, Jan 8, 2019 at 5:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I improved the comment about how come the hash table entry assignment
> works.

I've gone over the algorithm in more detail and I don't see any nicer
way to write it. This comment in PerfectHash.pm:

(It does change '_', else we could just skip adjusting
# $cn here at all, for typical keyword strings.)

...seems a bit out of place in the module, because of its reference to
keywords, of interest right now to its only caller. Maybe a bit of
context here. (I also don't quite understand why we could
hypothetically skip the adjustment.)

Lastly, the keyword headers still have a dire warning about ASCII
order and binary search. Those could be softened to match the comment
in gen_keywordlist.pl.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


John Naylor <john.naylor@2ndquadrant.com> writes:
> On Wed, Jan 9, 2019 at 5:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> really need here.  We could go with "[no-]case-insensitive", perhaps.
>> Or "[no-]case-fold", which is at least a little shorter and less
>> double-negative-y.

> I'd be in favor of --[no-]case-fold.

Yeah, I like that better too; I've been having to stop and think
every time as to which direction is which with the [in]sensitive
terminology.  I'll make it "case-fold" throughout.

> This comment in PerfectHash.pm:

> (It does change '_', else we could just skip adjusting
> # $cn here at all, for typical keyword strings.)

> ...seems a bit out of place in the module, because of its reference to
> keywords, of interest right now to its only caller. Maybe a bit of
> context here. (I also don't quite understand why we could
> hypothetically skip the adjustment.)

Were it not for the underscore case, we could plausibly assume that
the supplied keywords are already all-lower-case and don't need any
further folding.  But I agree that this comment is probably more
confusing than helpful; it's easier just to see that the code is
applying the same transform as the runtime lookup will do.

> Lastly, the keyword headers still have a dire warning about ASCII
> order and binary search. Those could be softened to match the comment
> in gen_keywordlist.pl.

Agreed, will do.

Thanks for reviewing!

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
Joel Jacobson
Date:
Many thanks for working on this, amazing work, really nice you made it a separate reusable Perl-module.

The generated hash functions reads one character at a time.
I've seen a performance trick in other hash functions [1]
to instead read multiple bytes in each iteration,
and then handle the remaining bytes after the loop.


I've done some testing and it looks like a ~30% speed-up of the generated
ScanKeywords_hash_func() function would be possible.

If you think this approach is promising, I would be happy to prepare a patch for it,
but I wanted to check with the project this idea has not already been considered and ruled out
for some technical reasons I've failed to see, is there any?

For this to work you would need to use larger constants for $hash_mult1 and $hash_mult2 though.
I've successfully used these values:
$hash_mult1 0x2c1b3c6d
$hash_mult2 (0x297a2d39, 0x85ebca6b, 0xc2b2ae35, 0x7feb352d, 0x846ca68b)

Here is the idea:

Generated C-code:

  for (; keylen >= 4; keylen -= 4, k += 4)
  {
    uint32_t v;
    memcpy(&v, k, 4);
    v |= 0x20202020;
    a = a * 739982445 + v;
    b = b * 2246822507 + v;
  }
  uint32_t v = 0;
  switch (keylen)
  {
  case 3:
    memcpy(&v, k, 3);
    v |= 0x202020;
    break;
  case 2:
    memcpy(&v, k, 2);
    v |= 0x2020;
    break;
  case 1:
    memcpy(&v, k, 1);
    v |= 0x20;
    break;
  }
  a = a * 739982445 + v;
  b = b * 2246822507 + v;
  return h[a % 883] + h[b % 883];

(Reding 8 bytes a time instead would perhaps be a win since some keywords are quite long.)

Perl-code:

sub _calc_hash
{
my ($key, $mult, $seed) = @_;

my $result = $seed;
my $i=0;
my $keylen = length($key);

for (; $keylen>=4; $keylen-=4, $i+=4) {
my $cn = (ord(substr($key,$i+0,1)) << 0)
       | (ord(substr($key,$i+1,1)) << 8)
| (ord(substr($key,$i+2,1)) << 16)
| (ord(substr($key,$i+3,1)) << 24);
$cn |= 0x20202020 if $case_fold;
$result = ($result * $mult + $cn) % 4294967296;
}

my $cn = 0;
if ($keylen == 3) {
$cn = (ord(substr($key,$i+0,1)) << 0)
       | (ord(substr($key,$i+1,1)) << 8)
| (ord(substr($key,$i+2,1)) << 16);
$cn |= 0x202020 if $case_fold;
} elsif ($keylen == 2) {
$cn = (ord(substr($key,$i+0,1)) << 0)
       | (ord(substr($key,$i+1,1)) << 8);
$cn |= 0x2020 if $case_fold;
} elsif ($keylen == 1) {
$cn = (ord(substr($key,$i+0,1)) << 0);
$cn |= 0x20 if $case_fold;
}
$result = ($result * $mult + $cn) % 4294967296;

return $result;
}



Joel Jacobson <joel@trustly.com> writes:
> I've seen a performance trick in other hash functions [1]
> to instead read multiple bytes in each iteration,
> and then handle the remaining bytes after the loop.
> [1] https://github.com/wangyi-fudan/wyhash/blob/master/wyhash.h#L29

I can't get very excited about this, seeing that we're only going to
be hashing short strings.  I don't really believe your 30% number
for short strings; and even if I did, there's no evidence that the
hash functions are worth any further optimization in terms of our
overall performance.

Also, as best I can tell, the approach you propose would result
in an endianness dependence, meaning we'd have to have separate
lookup tables for BE and LE machines.  That's not a dealbreaker
perhaps, but it is certainly another point on the "it's not worth it"
side of the argument.

            regards, tom lane


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

From
Joel Jacobson
Date:
On Wed, Mar 20, 2019 at 9:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Joel Jacobson <joel@trustly.com> writes:
> I've seen a performance trick in other hash functions [1]
> to instead read multiple bytes in each iteration,
> and then handle the remaining bytes after the loop.
> [1] https://github.com/wangyi-fudan/wyhash/blob/master/wyhash.h#L29

I can't get very excited about this, seeing that we're only going to
be hashing short strings.  I don't really believe your 30% number
for short strings; and even if I did, there's no evidence that the
hash functions are worth any further optimization in terms of our
overall performance.

I went ahead and tested this approach anyway, since I need this algorithm in a completely different project.

The benchmark below shows stats for three different keywords per length, compiled with -O2:

$ c++ -O2 -std=c++14 -o bench_perfect_hash bench_perfect_hash.cc
$ ./bench_perfect_hash

keyword              length char-a-time (ns) word-a-time (ns) diff (%)
as                   2      3.30             2.62             -0.21
at                   2      3.54             2.66             -0.25
by                   2      3.30             2.59             -0.22
add                  3      4.01             3.15             -0.21
all                  3      4.04             3.11             -0.23
and                  3      3.84             3.11             -0.19
also                 4      4.50             3.17             -0.30
both                 4      4.49             3.06             -0.32
call                 4      4.95             3.42             -0.31
abort                5      6.09             4.02             -0.34
admin                5      5.26             3.65             -0.31
after                5      5.18             3.76             -0.27
access               6      5.97             3.91             -0.34
action               6      5.86             3.89             -0.34
always               6      6.10             3.77             -0.38
analyse              7      6.67             4.64             -0.30
analyze              7      7.09             4.87             -0.31
between              7      7.02             4.66             -0.34
absolute             8      7.49             3.82             -0.49
backward             8      7.13             3.88             -0.46
cascaded             8      7.23             4.17             -0.42
aggregate            9      8.04             4.49             -0.44
assertion            9      7.98             4.52             -0.43
attribute            9      8.03             4.44             -0.45
assignment           10     8.58             4.67             -0.46
asymmetric           10     9.07             4.57             -0.50
checkpoint           10     9.15             4.53             -0.51
constraints          11     9.58             5.14             -0.46
insensitive          11     9.62             5.30             -0.45
publication          11     10.30            5.60             -0.46
concurrently         12     10.36            4.81             -0.54
current_date         12     11.17            5.48             -0.51
current_role         12     11.15            5.10             -0.54
authorization        13     11.87            5.50             -0.54
configuration        13     11.50            5.51             -0.52
xmlattributes        13     11.72            5.66             -0.52
current_schema       14     12.17            5.58             -0.54
localtimestamp       14     11.78            5.46             -0.54
characteristics      15     12.77            5.97             -0.53
current_catalog      15     12.65            5.87             -0.54
current_timestamp    17     14.19            6.12             -0.57
 
Also, as best I can tell, the approach you propose would result
in an endianness dependence, meaning we'd have to have separate
lookup tables for BE and LE machines.  That's not a dealbreaker
perhaps, but it is certainly another point on the "it's not worth it"
side of the argument.

I can see how the same problem has been worked-around in e.g. pg_crc32.h:

#ifdef WORDS_BIGENDIAN
#define FIN_CRC32C(crc) ((crc) = pg_bswap32(crc) ^ 0xFFFFFFFF)
#else
#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
#endif

So I used the same trick in PerfectHash.pm:

$f .= sprintf "#ifdef WORDS_BIGENDIAN\n";
$f .= sprintf "\t\tc4 = pg_bswap32(c4);\n";
$f .= sprintf "#endif\n";

I've also tried to measure the overall effect by hacking postgres.c:

+       struct timespec start, stop;
+       clock_gettime( CLOCK_REALTIME, &start);
+       for (int i=0; i<100000; i++)
+  {
+               List       *parsetree_list2;
+               MemoryContext oldcontext2;
+
+               oldcontext2 = MemoryContextSwitchTo(MessageContext);
+               parsetree_list2 = pg_parse_query(query_string);
+               MemoryContextSwitchTo(oldcontext2);
+//             MemoryContextReset(MessageContext);
+               CHECK_FOR_INTERRUPTS();
+  }
+       clock_gettime( CLOCK_REALTIME, &stop);
+       printf("Bench: %f\n", ( stop.tv_sec - start.tv_sec ) + (double)( stop.tv_nsec - start.tv_nsec ) / 1000000000L );

I measured the time for a big query found here: https://wiki.postgresql.org/wiki/Index_Maintenance

I might be doing something wrong, but it looks like thee overall effect is a ~3% improvement.

Attachment