Thread: reducing the footprint of ScanKeyword (was Re: Large writable variables)
On 10/15/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2018-10-15 16:36:26 -0400, Tom Lane wrote: >>> We could possibly fix these by changing the data structure so that >>> what's in a ScanKeywords entry is an offset into some giant string >>> constant somewhere. No idea how that would affect performance, but >>> I do notice that we could reduce the sizeof(ScanKeyword), which can't >>> hurt. > >> Yea, that might even help performancewise. Alternatively we could change >> ScanKeyword to store the keyword name inline, but that'd be a measurable >> size increase... > > Yeah. It also seems like doing it this way would improve locality of > access: the pieces of the giant string would presumably be in the same > order as the ScanKeywords entries, whereas with the current setup, > who knows where the compiler has put 'em or in what order. > > We'd need some tooling to generate the constants that way, though; > I can't see how to make it directly from kwlist.h. A few months ago I was looking into faster search algorithms for ScanKeywordLookup(), so this is interesting to me. While an optimal full replacement would be a lot of work, the above ideas are much less invasive and would still have some benefit. Unless anyone intends to work on this, I'd like to flesh out the offset-into-giant-string approach a bit further: Since there are several callers of the current approach that don't use the core keyword list, we'd have to keep the existing struct and lookup function, to keep the complexity manageable. Once we have an offset-based struct and function, it makes sense to use it for all searches of core keywords. This includes not only the core scanner, but also adt/rule_utils.c, fe_utils/string_utils.c, and ecpg/preproc/keywords.c. There would need to be a header with offsets replacing name strings, generated from parser/kwlist.h, maybe kwlist_offset.h. It'd probably be convenient if it was emitted into the common/ dir. The giant string would likely need its own header (kwlist_string.h?). Since PL/pgSQL uses the core scanner, we'd need to use offsets in its reserved_keywords[], too. Those don't change much, so we can probably get away with hard-coding the offsets and the giant string in that case. (If that's not acceptable, we could separate that out to pl_reserved_kwlist.h and reuse the above tooling to generate pl_reserved_kwlist_{offset,string}.h, but that's more complex.) The rest should be just a SMOP. Any issues I left out? -John Naylor
John Naylor <jcnaylor@gmail.com> writes: > A few months ago I was looking into faster search algorithms for > ScanKeywordLookup(), so this is interesting to me. While an optimal > full replacement would be a lot of work, the above ideas are much less > invasive and would still have some benefit. Unless anyone intends to > work on this, I'd like to flesh out the offset-into-giant-string > approach a bit further: Have at it... > Since PL/pgSQL uses the core scanner, we'd need to use offsets in its > reserved_keywords[], too. Those don't change much, so we can probably > get away with hard-coding the offsets and the giant string in that > case. (If that's not acceptable, we could separate that out to > pl_reserved_kwlist.h and reuse the above tooling to generate > pl_reserved_kwlist_{offset,string}.h, but that's more complex.) plpgsql isn't as stable as all that: people propose new syntax for it all the time. I do not think a hand-maintained array would be pleasant at all. Also, wouldn't we also adopt this technology for its unreserved keywords, too? regards, tom lane
On 12/17/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > John Naylor <jcnaylor@gmail.com> writes: >> Since PL/pgSQL uses the core scanner, we'd need to use offsets in its >> reserved_keywords[], too. Those don't change much, so we can probably >> get away with hard-coding the offsets and the giant string in that >> case. (If that's not acceptable, we could separate that out to >> pl_reserved_kwlist.h and reuse the above tooling to generate >> pl_reserved_kwlist_{offset,string}.h, but that's more complex.) > > plpgsql isn't as stable as all that: people propose new syntax for it > all the time. I do not think a hand-maintained array would be pleasant > at all. Okay. > Also, wouldn't we also adopt this technology for its unreserved keywords, > too? We wouldn't be forced to, but there might be other reasons to do so. Were you thinking of code consistency (within pl_scanner.c or globally)? Or something else? If we did adopt this setup for plpgsql unreserved keywords, ecpg/preproc/ecpg_keywords.c and ecpg/preproc/c_keywords.c would be left using the current ScanKeyword struct for search. Using offset search for all 5 types of keywords would be globally consistent, but it also means additional headers, generated headers, and makefile rules. -John Naylor
John Naylor <jcnaylor@gmail.com> writes: > On 12/17/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Also, wouldn't we also adopt this technology for its unreserved keywords, >> too? > We wouldn't be forced to, but there might be other reasons to do so. > Were you thinking of code consistency (within pl_scanner.c or > globally)? Or something else? > If we did adopt this setup for plpgsql unreserved keywords, > ecpg/preproc/ecpg_keywords.c and ecpg/preproc/c_keywords.c would be > left using the current ScanKeyword struct for search. Using offset > search for all 5 types of keywords would be globally consistent, but > it also means additional headers, generated headers, and makefile > rules. I'd be kind of inclined to convert all uses of ScanKeyword to the new way, if only for consistency's sake. On the other hand, I'm not the one volunteering to do the work. regards, tom lane
On 12/18/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'd be kind of inclined to convert all uses of ScanKeyword to the new way, > if only for consistency's sake. On the other hand, I'm not the one > volunteering to do the work. That's reasonable, as long as the design is nailed down first. Along those lines, attached is a heavily WIP patch that only touches plpgsql unreserved keywords, to test out the new methodology in a limited area. After settling APIs and name/directory bikeshedding, I'll move on to the other four keyword types. There's a new Perl script, src/common/gen_keywords.pl, which takes pl_unreserved_kwlist.h as input and outputs pl_unreserved_kwlist_offset.h and pl_unreserved_kwlist_string.h. The output headers are not installed or symlinked anywhere. Since the input keyword lists will never be #included directly, they might be better as .txt files, like errcodes.txt. If we went that far, we might also remove the PG_KEYWORD macros (they'd still be in the output files) and rename parser/kwlist.h to common/core_kwlist.txt. There's also a case for not changing things unnecessarily, especially if there's ever a new reason to include the base keyword list directly. To keep the other keyword types functional, I had to add a separate new struct ScanKeywordOffset and new function ScanKeywordLookupOffset(), so the patch is a bit messier than the final will be. With a 4-byte offset, ScankeyWordOffset is 8 bytes, down from 12, and is now a power of 2. I used the global .gitignore, but maybe that's an abuse of it. Make check passes, but I don't know how well it stresses keyword use. I'll create a commitfest entry soon. -John Naylor
Attachment
Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)
From
Andrew Gierth
Date:
>>>>> "John" == John Naylor <jcnaylor@gmail.com> writes: > On 12/18/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'd be kind of inclined to convert all uses of ScanKeyword to the >> new way, if only for consistency's sake. On the other hand, I'm not >> the one volunteering to do the work. John> That's reasonable, as long as the design is nailed down first. John> Along those lines, attached is a heavily WIP patch that only John> touches plpgsql unreserved keywords, to test out the new John> methodology in a limited area. After settling APIs and John> name/directory bikeshedding, I'll move on to the other four John> keyword types. Is there any particular reason not to go further and use a perfect hash function for the lookup, rather than binary search? -- Andrew (irc:RhodiumToad)
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andres Freund
Date:
Hi, On 2018-12-20 00:54:39 +0000, Andrew Gierth wrote: > >>>>> "John" == John Naylor <jcnaylor@gmail.com> writes: > > > On 12/18/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I'd be kind of inclined to convert all uses of ScanKeyword to the > >> new way, if only for consistency's sake. On the other hand, I'm not > >> the one volunteering to do the work. > > John> That's reasonable, as long as the design is nailed down first. > John> Along those lines, attached is a heavily WIP patch that only > John> touches plpgsql unreserved keywords, to test out the new > John> methodology in a limited area. After settling APIs and > John> name/directory bikeshedding, I'll move on to the other four > John> keyword types. > > Is there any particular reason not to go further and use a perfect hash > function for the lookup, rather than binary search? The last time I looked into perfect hash functions, it wasn't easy to find a generator that competed with a decent normal hashtable (in particular gperf's are very unconvincing). The added tooling is a concern imo. OTOH, we're comparing not with a hashtable, but a binary search, where the latter will usually loose. Wonder if we shouldn't generate a serialized non-perfect hashtable instead. The lookup code for a read-only hashtable without concern for adversarial input is pretty trivial. Greetings, Andres Freund
Andrew Gierth <andrew@tao11.riddles.org.uk> writes: > Is there any particular reason not to go further and use a perfect hash > function for the lookup, rather than binary search? Tooling? I seem to recall having looked at gperf and deciding that it pretty much sucked, so it's not real clear to me what we would use. regards, tom lane
On 12/19/18, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > Is there any particular reason not to go further and use a perfect hash > function for the lookup, rather than binary search? When I was investigating faster algorithms, I ruled out gperf based on discussions in the archives. The approach here has modest goals and shouldn't be too invasive. With the makefile support and separate keyword files in place, that'll be one less thing to do if we ever decide to replace binary search. The giant string will likely be useful as well. Since we're on the subject, I think some kind of trie would be ideal performance-wise, but a large amount of work. The nice thing about a trie is that it can be faster then a hash table for a key miss. I found a paper that described some space-efficient trie variations [1], but we'd likely have to code the algorithm and a way to emit a C code representation of it. I've found some libraries, but that would have more of the same difficulties in practicality that gperf had. [1] https://infoscience.epfl.ch/record/64394/files/triesearches.pdf -John Naylor
John Naylor <jcnaylor@gmail.com> writes: > On 12/18/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'd be kind of inclined to convert all uses of ScanKeyword to the new way, >> if only for consistency's sake. On the other hand, I'm not the one >> volunteering to do the work. > That's reasonable, as long as the design is nailed down first. Along > those lines, attached is a heavily WIP patch that only touches plpgsql > unreserved keywords, to test out the new methodology in a limited > area. After settling APIs and name/directory bikeshedding, I'll move > on to the other four keyword types. Let the bikeshedding begin ... > There's a new Perl script, src/common/gen_keywords.pl, I'd be inclined to put the script in src/tools, I think. IMO src/common is for code that actually gets built into our executables. > which takes > pl_unreserved_kwlist.h as input and outputs > pl_unreserved_kwlist_offset.h and pl_unreserved_kwlist_string.h. I wonder whether we'd not be better off producing just one output file, in which we have the offsets emitted as PG_KEYWORD macros and then the giant string emitted as a macro definition, ie something like #define PG_KEYWORD_STRING \ "absolute\0" \ "alias\0" \ ... That simplifies the Makefile-hacking, at least, and it possibly gives callers more flexibility about what they actually want to do with the string. > The > output headers are not installed or symlinked anywhere. Since the > input keyword lists will never be #included directly, they might be > better as .txt files, like errcodes.txt. If we went that far, we might > also remove the PG_KEYWORD macros (they'd still be in the output > files) and rename parser/kwlist.h to common/core_kwlist.txt. There's > also a case for not changing things unnecessarily, especially if > there's ever a new reason to include the base keyword list directly. I'm for "not change things unnecessarily". People might well be scraping the keyword list out of parser/kwlist.h for other purposes right now --- indeed, it's defined the way it is exactly to let people do that. I don't see a good reason to force them to redo whatever tooling they have that depends on that. So let's build kwlist_offsets.h alongside that, but not change kwlist.h itself. > To keep the other keyword types functional, I had to add a separate > new struct ScanKeywordOffset and new function > ScanKeywordLookupOffset(), so the patch is a bit messier than the > final will be. Check. > I used the global .gitignore, but maybe that's an abuse of it. Yeah, I'd say it is. > +# TODO: Error out if the keyword names are not in ASCII order. +many for including such a check. Also note that we don't require people to have Perl installed when building from a tarball. Therefore, these derived headers must get built during "make distprep" and removed by maintainer-clean but not distclean. I think this also has some implications for VPATH builds, but as long as you follow the pattern used for other derived header files (e.g. fmgroids.h), you should be fine. regards, tom lane
On 12/20/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'd be inclined to put the script in src/tools, I think. IMO src/common > is for code that actually gets built into our executables. Done. >> which takes >> pl_unreserved_kwlist.h as input and outputs >> pl_unreserved_kwlist_offset.h and pl_unreserved_kwlist_string.h. > > I wonder whether we'd not be better off producing just one output > file, in which we have the offsets emitted as PG_KEYWORD macros > and then the giant string emitted as a macro definition, ie > something like > > #define PG_KEYWORD_STRING \ > "absolute\0" \ > "alias\0" \ > ... > > That simplifies the Makefile-hacking, at least, and it possibly gives > callers more flexibility about what they actually want to do with the > string. Okay, I tried that. Since the script is turning one header into another, I borrowed the "*_d.h" nomenclature from the catalogs. Using a single file required some #ifdef hacks in the output file. Maybe there's a cleaner way to do this, but I don't know what it is. Using a single file also gave me another idea: Take value and category out of ScanKeyword, and replace them with an index into another array containing those, which will only be accessed in the event of a hit. That would shrink ScanKeyword to 4 bytes (offset, index), further increasing locality of reference. Might not be worth it, but I can try it after moving on to the core scanner. > I'm for "not change things unnecessarily". People might well be > scraping the keyword list out of parser/kwlist.h for other purposes > right now --- indeed, it's defined the way it is exactly to let > people do that. I don't see a good reason to force them to redo > whatever tooling they have that depends on that. So let's build > kwlist_offsets.h alongside that, but not change kwlist.h itself. Done. >> I used the global .gitignore, but maybe that's an abuse of it. > > Yeah, I'd say it is. Moved. >> +# TODO: Error out if the keyword names are not in ASCII order. > > +many for including such a check. Done. > Also note that we don't require people to have Perl installed when > building from a tarball. Therefore, these derived headers must get > built during "make distprep" and removed by maintainer-clean but > not distclean. I think this also has some implications for VPATH > builds, but as long as you follow the pattern used for other > derived header files (e.g. fmgroids.h), you should be fine. Done. I also blindly added support for MSVC. -John Naylor
Attachment
John Naylor <jcnaylor@gmail.com> writes: > Using a single file also gave me another idea: Take value and category > out of ScanKeyword, and replace them with an index into another array > containing those, which will only be accessed in the event of a hit. > That would shrink ScanKeyword to 4 bytes (offset, index), further > increasing locality of reference. Might not be worth it, but I can try > it after moving on to the core scanner. I like that idea a *lot*, actually, because it offers the opportunity to decouple this mechanism from all assumptions about what the auxiliary data for a keyword is. Basically, we'd redefine ScanKeywordLookup as having the API "given a string, return a keyword index if it is a keyword, -1 if it isn't"; then the caller would use the keyword index to look up the auxiliary data in a table that it owns, and ScanKeywordLookup doesn't know about at all. So that leads to a design like this: the master data is in a header that's just like kwlist.h is today, except now we are thinking of PG_KEYWORD as an N-argument macro not necessarily exactly 3 arguments. The Perl script reads that, paying attention only to the first argument of the macro calls, and outputs a file containing, say, static const uint16 kw_offsets[] = { 0, 6, 15, ... }; static const char kw_strings[] = "abort\0" "absolute\0" ... ; (it'd be a good idea to have a switch that allows specifying the prefix of these constant names). Then ScanKeywordLookup has the signature int ScanKeywordLookup(const char *string_to_lookup, const char *kw_strings, const uint16 *kw_offsets, int num_keywords); and a file using this stuff looks something like /* Payload data for keywords */ typedef struct MyKeyword { int16 value; int16 category; } MyKeyword; #define PG_KEYWORD(kwname, value, category) {value, category}, static const MyKeyword MyKeywords[] = { #include "kwlist.h" }; /* String lookup table for keywords */ #include "kwlist_d.h" /* Lookup code looks about like this: */ kwnum = ScanKeywordLookup(str, kw_strings, kw_offsets, lengthof(kw_offsets)); if (kwnum >= 0) ... look into MyKeywords[kwnum] for info ... Aside from being arguably better from the locality-of-reference standpoint, this gets us out of the weird ifdef'ing you've got in the v2 patch. The kwlist_d.h headers can be very ordinary headers. regards, tom lane
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andres Freund
Date:
Hi, On 2018-12-22 12:20:00 -0500, Tom Lane wrote: > John Naylor <jcnaylor@gmail.com> writes: > > Using a single file also gave me another idea: Take value and category > > out of ScanKeyword, and replace them with an index into another array > > containing those, which will only be accessed in the event of a hit. > > That would shrink ScanKeyword to 4 bytes (offset, index), further > > increasing locality of reference. Might not be worth it, but I can try > > it after moving on to the core scanner. > > I like that idea a *lot*, actually, because it offers the opportunity > to decouple this mechanism from all assumptions about what the > auxiliary data for a keyword is. OTOH, it doubles or triples the number of cachelines accessed when encountering a keyword. The fraction of keywords to not-keywords in SQL makes me wonder whether that makes it a good deal. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2018-12-22 12:20:00 -0500, Tom Lane wrote: >> I like that idea a *lot*, actually, because it offers the opportunity >> to decouple this mechanism from all assumptions about what the >> auxiliary data for a keyword is. > OTOH, it doubles or triples the number of cachelines accessed when > encountering a keyword. Compared to what? The current situation in that regard is a mess. Also, AFAICS this proposal involves the least amount of data touched during the lookup phase of anything we've discussed, so I do not even accept that your criticism is correct. One extra cacheline fetch to get the aux data for a particular keyword after the search is not going to tip the scales away from this being a win. regards, tom lane
On 12/22/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > John Naylor <jcnaylor@gmail.com> writes: >> Using a single file also gave me another idea: Take value and category >> out of ScanKeyword, and replace them with an index into another array >> containing those, which will only be accessed in the event of a hit. >> That would shrink ScanKeyword to 4 bytes (offset, index), further >> increasing locality of reference. Might not be worth it, but I can try >> it after moving on to the core scanner. > > I like that idea a *lot*, actually, because it offers the opportunity > to decouple this mechanism from all assumptions about what the > auxiliary data for a keyword is. Okay, in that case I went ahead and did it for WIP v3. > (it'd be a good idea to have a switch that allows specifying the > prefix of these constant names). Done as an optional switch, and tested, but not yet used in favor of the previous method as a fallback. I'll probably do it in the final version to keep lines below 80, and to add 'core_' to the core keyword vars. > /* Payload data for keywords */ > typedef struct MyKeyword > { > int16 value; > int16 category; > } MyKeyword; I tweaked this a bit to typedef struct ScanKeywordAux { int16 value; /* grammar's token code */ char category; /* see codes above */ } ScanKeywordAux; It seems that category was only 2 bytes to make ScanKeyword a power of 2 (of course that was on 32 bit machines and doesn't hold true anymore). Using char will save another few hundred bytes in the core scanner. Since we're only accessing this once per identifier, we may not need to worry so much about memory alignment. > Aside from being arguably better from the locality-of-reference > standpoint, this gets us out of the weird ifdef'ing you've got in > the v2 patch. The kwlist_d.h headers can be very ordinary headers. Yeah, that's a nice (and for me unexpected) bonus. -John Naylor
Attachment
On Wed, Dec 19, 2018 at 8:01 PM Andres Freund <andres@anarazel.de> wrote: > The last time I looked into perfect hash functions, it wasn't easy to > find a generator that competed with a decent normal hashtable (in > particular gperf's are very unconvincing). The added tooling is a > concern imo. OTOH, we're comparing not with a hashtable, but a binary > search, where the latter will usually loose. Wonder if we shouldn't > generate a serialized non-perfect hashtable instead. The lookup code for > a read-only hashtable without concern for adversarial input is pretty > trivial. I wonder if we could do something really simple like a lookup based on the first character of the scan keyword. It looks to me like there are 440 keywords right now, and the most common starting letter is 'c', which is the first letter of 51 keywords. So dispatching based on the first letter clips at least 3 steps off the binary search. I don't know whether that's enough to be worthwhile, but it's probably pretty simple to implement. I'm not sure that I understand quite what you have in mind for a serialized non-perfect hashtable. Are you thinking that we'd just construct a simplehash and serialize it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Dec 19, 2018 at 8:01 PM Andres Freund <andres@anarazel.de> wrote: >> The last time I looked into perfect hash functions, it wasn't easy to >> find a generator that competed with a decent normal hashtable (in >> particular gperf's are very unconvincing). The added tooling is a >> concern imo. OTOH, we're comparing not with a hashtable, but a binary >> search, where the latter will usually loose. Wonder if we shouldn't >> generate a serialized non-perfect hashtable instead. The lookup code for >> a read-only hashtable without concern for adversarial input is pretty >> trivial. > I wonder if we could do something really simple like a lookup based on > the first character of the scan keyword. It looks to me like there are > 440 keywords right now, and the most common starting letter is 'c', > which is the first letter of 51 keywords. So dispatching based on the > first letter clips at least 3 steps off the binary search. I don't > know whether that's enough to be worthwhile, but it's probably pretty > simple to implement. I think there's a lot of goalpost-moving going on here. The original idea was to trim the physical size of the data structure, as stated in the thread subject, and just reap whatever cache benefits we got along the way from that. I am dubious that we actually have any performance problem in this code that needs a big dollop of added complexity to fix. In my hands, the only part of the low-level parsing code that commonly shows up as interesting in profiles is the Bison engine. That's probably because the grammar tables are circa half a megabyte and blow out cache pretty badly :-(. I don't know of any way to make that better, unfortunately. I suspect that it's just going to get worse, because people keep submitting additions to the grammar. regards, tom lane
On Wed, Dec 26, 2018 at 11:22:39AM -0500, Tom Lane wrote: > > In my hands, the only part of the low-level parsing code that > commonly shows up as interesting in profiles is the Bison engine. Should we be considering others? As I understand it, steps have been made in this field since yacc was originally designed. Is LALR actually suitable for languages like SQL, or is it just there for historical reasons? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Wed, Dec 26, 2018 at 11:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think there's a lot of goalpost-moving going on here. The original > idea was to trim the physical size of the data structure, as stated > in the thread subject, and just reap whatever cache benefits we got > along the way from that. I am dubious that we actually have any > performance problem in this code that needs a big dollop of added > complexity to fix. I have seen ScanKeywordLookup show up in profiles quite often and fairly prominently -- like several percent of total runtime. I'm not trying to impose requirements on John's patch, and I agree that reducing the physical size of the structure is a good step whether anything else is done or not. However, I don't see that as a reason to shut down further discussion of other possible improvements. If his patch makes this disappear from profiles, cool, but if it doesn't, then sooner or later somebody's going to want to do more. FWIW, my bet is this helps but isn't enough to get rid of the problem completely. A 9-step binary search has got to be slower than a really well-optimized hash table lookup. In a perfect world the latter touches the cache line containing the keyword -- which presumably is already in cache since we just scanned it -- then computes a hash value without touching any other cache lines -- and then goes straight to the right entry. So it touches ONE new cache line. That might a level of optimization that's hard to achieve in practice, but I don't think it's crazy to want to get there. > In my hands, the only part of the low-level parsing code that commonly > shows up as interesting in profiles is the Bison engine. That's probably > because the grammar tables are circa half a megabyte and blow out cache > pretty badly :-(. I don't know of any way to make that better, > unfortunately. I suspect that it's just going to get worse, because > people keep submitting additions to the grammar. I'm kinda surprised that you haven't seen ScanKeywordLookup() in there, but I agree with you that the size of the main parser tables is a real issue, and that there's no easy solution. At various times there has been discussion of using some other parser generator, and I've also toyed with the idea of writing one specifically for PostgreSQL. Unfortunately, it seems like bison is all but unmaintained; the alternatives are immature and have limited adoption and limited community; and writing something from scratch is a ton of work. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
David Fetter <david@fetter.org> writes: > On Wed, Dec 26, 2018 at 11:22:39AM -0500, Tom Lane wrote: >> In my hands, the only part of the low-level parsing code that >> commonly shows up as interesting in profiles is the Bison engine. > Should we be considering others? We've looked around before, IIRC, and not really seen any arguably better tools. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > I'm kinda surprised that you haven't seen ScanKeywordLookup() in > there, but I agree with you that the size of the main parser tables is > a real issue, and that there's no easy solution. At various times > there has been discussion of using some other parser generator, and > I've also toyed with the idea of writing one specifically for > PostgreSQL. Unfortunately, it seems like bison is all but > unmaintained; the alternatives are immature and have limited adoption > and limited community; and writing something from scratch is a ton of > work. :-( Yeah, and also: SQL is a damn big and messy language, and so it's not very clear that it's really bison's fault that it's slow to parse. We might do a ton of work to implement an alternative, and then find ourselves no better off. regards, tom lane
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andres Freund
Date:
Hi, On 2018-12-26 11:50:18 -0500, Robert Haas wrote: > On Wed, Dec 26, 2018 at 11:22 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I think there's a lot of goalpost-moving going on here. The original > > idea was to trim the physical size of the data structure, as stated > > in the thread subject, and just reap whatever cache benefits we got > > along the way from that. I am dubious that we actually have any > > performance problem in this code that needs a big dollop of added > > complexity to fix. > > I have seen ScanKeywordLookup show up in profiles quite often and > fairly prominently -- like several percent of total runtime. I'm not > trying to impose requirements on John's patch, and I agree that > reducing the physical size of the structure is a good step whether > anything else is done or not. However, I don't see that as a reason to > shut down further discussion of other possible improvements. If his > patch makes this disappear from profiles, cool, but if it doesn't, > then sooner or later somebody's going to want to do more. I agree. And most of the patch would be a pre-requisite for anything more elaborate anyway. > FWIW, my bet is this helps but isn't enough to get rid of the problem > completely. A 9-step binary search has got to be slower than a really > well-optimized hash table lookup. Yea, at least with a non-optimized layout. If we'd used a binary search optimized lookup order it might be different, but probably at best equivalent to a good hashtable. > > In my hands, the only part of the low-level parsing code that commonly > > shows up as interesting in profiles is the Bison engine. That's probably > > because the grammar tables are circa half a megabyte and blow out cache > > pretty badly :-(. I don't know of any way to make that better, > > unfortunately. I suspect that it's just going to get worse, because > > people keep submitting additions to the grammar. > > I'm kinda surprised that you haven't seen ScanKeywordLookup() in > there, but I agree with you that the size of the main parser tables is > a real issue, and that there's no easy solution. At various times > there has been discussion of using some other parser generator, and > I've also toyed with the idea of writing one specifically for > PostgreSQL. Unfortunately, it seems like bison is all but > unmaintained; the alternatives are immature and have limited adoption > and limited community; and writing something from scratch is a ton of > work. :-( My bet is, and has been for quite a while, that we'll have to go for a hand-written recursive descent type parser. They can be *substantially* faster, and performance isn't as affected by the grammar size. And, about as important, they also allow for a lot more heuristics around grammar errors - I do think we'll soon have to better than to throw a generic syntax error for the cases where the grammar doesn't match at all. Greetings, Andres Freund
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andres Freund
Date:
Hi, On 2018-12-26 10:45:11 -0500, Robert Haas wrote: > I'm not sure that I understand quite what you have in mind for a > serialized non-perfect hashtable. Are you thinking that we'd just > construct a simplehash and serialize it? I was basically thinking that we'd have the perl script implement a simple hash and put the keyword (pointers) into an array, handling conflicts with the simplest linear probing thinkable. As there's never a need for modifications, that ought to be fairly simple. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > My bet is, and has been for quite a while, that we'll have to go for a > hand-written recursive descent type parser. I will state right up front that that will happen over my dead body. It's impossible to write correct RD parsers by hand for any but the most trivial, conflict-free languages, and what we have got to deal with is certainly neither of those; moreover, it's a constantly moving target. We'd be buying into an endless landscape of parser bugs if we go that way. It's *not* worth it. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On 2018-12-26 10:45:11 -0500, Robert Haas wrote: >> I'm not sure that I understand quite what you have in mind for a >> serialized non-perfect hashtable. Are you thinking that we'd just >> construct a simplehash and serialize it? > I was basically thinking that we'd have the perl script implement a > simple hash and put the keyword (pointers) into an array, handling > conflicts with the simplest linear probing thinkable. As there's never a > need for modifications, that ought to be fairly simple. I think it was Knuth who said that when you use hashing, you are putting a great deal of faith in the average case, because the worst case is terrible. The applicability of that to this problem is that if you hit a bad case (say, a long collision chain affecting some common keywords) you could end up with poor performance that affects a lot of people for a long time. And our keyword list is not so static that you could prove once that the behavior is OK and then forget about it. So I'm suspicious of proposals to use simplistic hashing here. There might well be some value in Robert's idea of keying off the first letter to get rid of the first few binary-search steps, not least because those steps are particularly terrible from a cache-footprint perspective. I'm not sold on doing anything significantly more invasive than that. regards, tom lane
On 12/26/18, Robert Haas <robertmhaas@gmail.com> wrote: > I wonder if we could do something really simple like a lookup based on > the first character of the scan keyword. It looks to me like there are > 440 keywords right now, and the most common starting letter is 'c', > which is the first letter of 51 keywords. So dispatching based on the > first letter clips at least 3 steps off the binary search. I don't > know whether that's enough to be worthwhile, but it's probably pretty > simple to implement. Using radix tree structures for the top couple of node levels is a known technique to optimize tries that need to be more space-efficient at lower levels, so this has precedent. In this case there would be a space trade off of (alphabet size, rounded up) * (size of index to lower boundary + size of index to upper boundary) = 32 * (2 + 2) = 128 bytes which is pretty small compared to what we'll save by offset-based lookup. On average, there'd be 4.1 binary search steps, which is nice. I agree it'd be fairly simple to do, and might raise the bar for doing anything more complex. -John Naylor
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andres Freund
Date:
Hi, On 2018-12-26 14:03:57 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > My bet is, and has been for quite a while, that we'll have to go for a > > hand-written recursive descent type parser. > > I will state right up front that that will happen over my dead body. > > It's impossible to write correct RD parsers by hand for any but the most > trivial, conflict-free languages, and what we have got to deal with > is certainly neither of those; moreover, it's a constantly moving target. > We'd be buying into an endless landscape of parser bugs if we go that way. > It's *not* worth it. It's not exactly new that people end up moving to bison to recursive descent parsers once they hit the performance problems and want to give better error messages. E.g. both gcc and clang have hand-written recursive-descent parsers for C and C++ these days. I don't buy that we're inable to write a descent parser that way. What I *do* buy is that it's more problematic for the design of our SQL dialect, because the use of bison often uncovers ambiguities in new extensions of the language. And I don't really have a good idea how to handle that. Greetings, Andres Freund
On 12/26/18, John Naylor <jcnaylor@gmail.com> wrote: > On 12/26/18, Robert Haas <robertmhaas@gmail.com> wrote: >> I wonder if we could do something really simple like a lookup based on >> the first character of the scan keyword. It looks to me like there are >> 440 keywords right now, and the most common starting letter is 'c', >> which is the first letter of 51 keywords. So dispatching based on the >> first letter clips at least 3 steps off the binary search. I don't >> know whether that's enough to be worthwhile, but it's probably pretty >> simple to implement. > I agree it'd be fairly simple to do, and might raise the bar for doing > anything more complex. I went ahead and did this for v4, but split out into a separate patch. In addition, I used a heuristic to bypass binary search for the most common keywords. Normally, the middle value is computed mathematically, but I found that in each range of keywords beginning with the same letter, there is often 1 or 2 common keywords that are good first guesses, such as select, from, join, limit, where. I taught the lookup to try those first, and then compute subsequent steps the usual way. Barring additional bikeshedding on 0001, I'll plan on implementing offset-based lookup for the other keyword types and retire the old ScanKeyword. Once that's done, we can benchmark and compare with the optimizations in 0002. -John Naylor
Attachment
Andres Freund <andres@anarazel.de> writes: > On 2018-12-26 14:03:57 -0500, Tom Lane wrote: >> It's impossible to write correct RD parsers by hand for any but the most >> trivial, conflict-free languages, and what we have got to deal with >> is certainly neither of those; moreover, it's a constantly moving target. >> We'd be buying into an endless landscape of parser bugs if we go that way. >> It's *not* worth it. > It's not exactly new that people end up moving to bison to recursive > descent parsers once they hit the performance problems and want to give > better error messages. E.g. both gcc and clang have hand-written > recursive-descent parsers for C and C++ these days. Note that they are dealing with fixed language definitions. Furthermore, there's no need to worry about whether that code has to be hacked on by less-than-expert people. Neither condition applies to us. The thing that most concerns me about not using a grammar tool of some sort is that with handwritten RD, it's very easy to get into situations where you've "defined" (well, implemented, because you never did have a formal definition) a language that is ambiguous, admitting of more than one valid parse interpretation. You won't find out until someone files a bug report complaining that some apparently-valid statement isn't doing what they expect. At that point you are in a world of hurt, because it's too late to fix it without changing the language definition and thus creating user-visible compatibility breakage. Now bison isn't perfect in this regard, because you can shoot yourself in the foot with ill-considered precedence specifications (and we've done so ;-(), but it is light-years more likely to detect ambiguous grammar up-front than any handwritten parser logic is. If we had a tool that proved a BNF grammar non-ambiguous and then wrote an RD parser for it, that'd be fine with me --- but we need a tool, not somebody claiming he can write an error-free RD parser for an arbitrary language. My position is that anyone claiming that is just plain deluded. I also do not buy your unsupported-by-any-evidence claim that the error reports would be better. I've worked on RD parsers in the past, and they're not really better, at least not without expending enormous amounts of effort --- and run-time cycles --- specifically on the error reporting aspect. Again, I don't see that happening for us. > I don't buy that we're inable to write a descent parser that way. I do not think that we could write one for the current state of the PG grammar without an investment of effort so large that it's not going to happen. Even if such a parser were to spring fully armed from somebody's forehead, we absolutely cannot expect that it would continue to work correctly after non-wizard contributors modify it. regards, tom lane
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andrew Dunstan
Date:
On 12/27/18 12:12 PM, Tom Lane wrote: >> I don't buy that we're inable to write a descent parser that way. > I do not think that we could write one for the current state of the > PG grammar without an investment of effort so large that it's not > going to happen. Even if such a parser were to spring fully armed > from somebody's forehead, we absolutely cannot expect that it would > continue to work correctly after non-wizard contributors modify it. I just did a quick survey of generator tools. Unfortunately, the best candidate alternative (ANTLR) no longer supports generating plain C code. I don't know of another tool that is well maintained, supports C, and generates top down parsers. Twenty-five years ago or so I wrote a top-down table-driven parser generator, but that was in another country, and besides, the wench is dead. There are well known techniques (See s 4.4 of the Dragon Book, if you have a copy) for formal analysis of grammars to determine predictive parser action. They aren't hard, and the tables they produce are typically much smaller than those used for LALR parsers. Still, probably not for the faint of heart. The tools that have moved to using hand cut RD parsers have done so precisely because they get a significant performance benefit from doing so. RD parsers are not terribly hard to write. Yes, the JSON grammar is tiny, but I think I wrote the basics of the RD parser we use for JSON in about an hour. I think arguing that our hacker base is not competent to maintain such a thing for the SQL grammar is wrong. We successfully maintain vastly more complex pieces of code. Having said all that, I don't intend to spend any time on implementing an alternative parser. It would as you say involve a heck of a lot of time, which I don't have. It would be a fine academic research project for some student. A smaller project might be to see if we can replace the binary keyword search in ScanKeyword with a perfect hashing function generated by gperf, or something similar. I had a quick look at that, too. Unfortunately the smallest hash table I could generate for our 440 symbols had 1815 entries, so I'm not sure how well that would work. Worth investigating, though. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
John Naylor <jcnaylor@gmail.com> writes: > Barring additional bikeshedding on 0001, I'll plan on implementing > offset-based lookup for the other keyword types and retire the old > ScanKeyword. Once that's done, we can benchmark and compare with the > optimizations in 0002. Sounds like a plan. Assorted minor bikeshedding on v4-0001 (just from eyeballing it, I didn't test it): +/* Like ScanKeywordLookup, but uses offsets into a keyword string. */ +int +ScanKeywordLookupOffset(const char *string_to_lookup, + const char *kw_strings, Not really "like" it, since the return value is totally different and so is the representation of the keyword list. I realize that your plan is probably to get rid of ScanKeywordLookup and then adapt the latter's comment for this code, but don't forget that you need to adjust said comment. +/* Payload data for keywords */ +typedef struct ScanKeywordAux +{ + int16 value; /* grammar's token code */ + char category; /* see codes above */ +} ScanKeywordAux; There isn't really any point in changing category to "char", because alignment considerations will mandate that sizeof(ScanKeywordAux) be a multiple of 2 anyway. With some compilers we could get around that with a pragma to force non-aligned storage, but doing so would be a net loss on most non-Intel architectures. If you really are hot about saving that other 440 bytes, the way to do it would be to drop the struct entirely and use two parallel arrays, an int16[] for value and a char[] (or better uint8[]) for category. Those would be filled by reading kwlist.h twice with different definitions for PG_KEYWORD. Not sure it's worth the trouble though --- in particular, not clear that it's a win from the standpoint of number of cache lines touched. diff --git a/src/pl/plpgsql/src/.gitignore b/src/pl/plpgsql/src/.gitignore @@ -1,3 +1,4 @@ +/*kwlist_d.h Not a fan of using wildcards in .gitignore files, at least not when there's just one or two files you intend to match. # Force these dependencies to be known even without dependency info built: -pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o: plpgsql.h pl_gram.h plerrcodes.h +pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o: plpgsql.h pl_gram.h plerrcodes.h pl_unreserved_kwlist_d.h Hm, do we really need any more than pl_scanner.o to depend on that header? +/* FIXME: Have to redefine this symbol for the WIP. */ +#undef PG_KEYWORD +#define PG_KEYWORD(kwname, value, category) {value, category}, + +static const ScanKeywordAux unreserved_keywords[] = { +#include "pl_unreserved_kwlist.h" }; The category isn't useful for this keyword list, so couldn't you just make this an array of uint16 values? diff --git a/src/pl/plpgsql/src/pl_unreserved_kwlist.h b/src/pl/plpgsql/src/pl_unreserved_kwlist.h +/* name, value, category */ +PG_KEYWORD("absolute", K_ABSOLUTE, UNRESERVED_KEYWORD) Likewise, I'd just have these be two-argument macros. There's no reason for the various kwlist.h headers to agree on the number of payload arguments for PG_KEYWORD. diff --git a/src/tools/gen_keywords.pl b/src/tools/gen_keywords.pl + elsif ($arg =~ /^-o/) + { + $output_path = length($arg) > 2 ? substr($arg, 2) : shift @ARGV; + } My perl-fu is not great, but it looks like this will accept arguments like "-ofilename", which is a style I don't like at all. I'd rather either insist on the filename being separate or write the switch like "-o=filename". Also, project style when taking both forms is usually more like -o filename --offset=filename +$kw_input_file =~ /((\w*)kwlist)\.h/; +my $base_filename = $1; +$prefix = $2 if !defined $prefix; Hmm, what happens if the input filename does not end with "kwlist.h"? +# Parse keyword header for names. +my @keywords; +while (<$kif>) +{ + if (/^PG_KEYWORD\("(\w+)",\s*\w+,\s*\w+\)/) This is assuming more than it should about the number of arguments for PG_KEYWORD, as well as what's in them. I think it'd be sufficient to match like this: if (/^PG_KEYWORD\("(\w+)",/) +Options: + -o output path + -p optional prefix for generated data structures This usage message is pretty vague about how you write the options (cf gripe above). I looked very briefly at v4-0002, and I'm not very convinced about the "middle" aspect of that optimization. It seems unmaintainable, plus you've not exhibited how the preferred keywords would get selected in the first place (wiring them into the Perl script is surely not acceptable). If you want to pursue that, please separate it into an 0002 that just adds the letter-range aspect and then an 0003 that adds the "middle" business on top. Then we can do testing to see whether either of those ideas are worthwhile. regards, tom lane
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > RD parsers are not terribly hard to write. Sure, as long as they are for grammars that are (a) small, (b) static, and (c) LL(1), which is strictly weaker than the LALR(1) grammar class that bison can handle. We already have a whole lot of constructs that are at the edges of what bison can handle, which makes me dubious that an RD parser could be built at all without a lot of performance-eating lookahead and/or backtracking. > A smaller project might be to see if we can replace the binary keyword > search in ScanKeyword with a perfect hashing function generated by > gperf, or something similar. I had a quick look at that, too. Yeah, we've looked at gperf before, eg https://www.postgresql.org/message-id/20170927183156.jqzcsy7ocjcbdnmo@alap3.anarazel.de Perhaps it'd be a win but I'm not very convinced. I don't know much about the theory of perfect hashing, but I wonder if we could just roll our own tool for that. Since we're not dealing with extremely large keyword sets, perhaps brute force search for a set of multipliers for a hash computation like (char[0] * some_prime + char[1] * some_other_prime ...) mod table_size would work. regards, tom lane
On 12/27/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > diff --git a/src/tools/gen_keywords.pl b/src/tools/gen_keywords.pl > + elsif ($arg =~ /^-o/) > + { > + $output_path = length($arg) > 2 ? substr($arg, 2) : shift @ARGV; > + } > > My perl-fu is not great, but it looks like this will accept arguments > like "-ofilename", which is a style I don't like at all. I'd rather > either insist on the filename being separate or write the switch like > "-o=filename". Also, project style when taking both forms is usually > more like > -o filename > --offset=filename This style was cargo-culted from the catalog scripts. I can settle on just the first form if you like. > +$kw_input_file =~ /((\w*)kwlist)\.h/; > +my $base_filename = $1; > +$prefix = $2 if !defined $prefix; > > Hmm, what happens if the input filename does not end with "kwlist.h"? If that's a maintainability hazard, I can force every invocation to provide a prefix instead. > I looked very briefly at v4-0002, and I'm not very convinced about > the "middle" aspect of that optimization. It seems unmaintainable, > plus you've not exhibited how the preferred keywords would get selected > in the first place (wiring them into the Perl script is surely not > acceptable). What if the second argument of the macro held this info? Something like: PG_KEYWORD("security", FULL_SEARCH, UNRESERVED_KEYWORD) PG_KEYWORD("select", OPTIMIZE, SELECT, RESERVED_KEYWORD) with a warning emitted if more than one keyword per range has OPTIMIZE. That would require all keyword lists to have that second argument, but selecting a preferred keyword would be optional. -John Naylor
John Naylor <jcnaylor@gmail.com> writes: > On 12/27/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> +$kw_input_file =~ /((\w*)kwlist)\.h/; >> +my $base_filename = $1; >> +$prefix = $2 if !defined $prefix; >> >> Hmm, what happens if the input filename does not end with "kwlist.h"? > If that's a maintainability hazard, I can force every invocation to > provide a prefix instead. I don't mind allowing the prefix to default to empty. What I was concerned about was that base_filename could end up undefined. Probably the thing to do is to generate base_filename separately, say by stripping any initial ".*/" sequence and then substitute '_' for '.'. >> I looked very briefly at v4-0002, and I'm not very convinced about >> the "middle" aspect of that optimization. It seems unmaintainable, >> plus you've not exhibited how the preferred keywords would get selected >> in the first place (wiring them into the Perl script is surely not >> acceptable). > What if the second argument of the macro held this info? Yeah, you'd have to do something like that. But I'm still concerned about the maintainability aspect: if we mark say "commit" as the starting point in the "c" group, future additions or deletions of keywords starting with "c" might render that an increasingly poor choice. But most likely nobody would ever notice that the marking was getting more and more suboptimal. regards, tom lane
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andrew Dunstan
Date:
On 12/27/18 3:00 PM, John Naylor wrote: > On 12/27/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> diff --git a/src/tools/gen_keywords.pl b/src/tools/gen_keywords.pl >> + elsif ($arg =~ /^-o/) >> + { >> + $output_path = length($arg) > 2 ? substr($arg, 2) : shift @ARGV; >> + } >> >> My perl-fu is not great, but it looks like this will accept arguments >> like "-ofilename", which is a style I don't like at all. I'd rather >> either insist on the filename being separate or write the switch like >> "-o=filename". Also, project style when taking both forms is usually >> more like >> -o filename >> --offset=filename > This style was cargo-culted from the catalog scripts. I can settle on > just the first form if you like. > I would rather we used the standard perl module Getopt::Long, as numerous programs we have already do. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > On 12/27/18 3:00 PM, John Naylor wrote: >> This style was cargo-culted from the catalog scripts. I can settle on >> just the first form if you like. > I would rather we used the standard perl module Getopt::Long, as > numerous programs we have already do. Hmm ... grepping finds that used only in src/tools/pgindent/pgindent src/tools/git_changelog src/pl/plperl/text2macro.pl so I'm not quite sure about the "numerous" claim. Adopting that here would possibly impose the requirement of having Getopt::Long on some developers who are getting by without it today. However, that's a pretty thin argument, and if Getopt::Long is present even in the most minimal Perl installations then it's certainly moot. On the whole I'm +1 for this. Perhaps also, as an independent patch, we should change the catalog scripts to use Getopt::Long. regards, tom lane
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andrew Dunstan
Date:
On 12/27/18 3:34 PM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> On 12/27/18 3:00 PM, John Naylor wrote: >>> This style was cargo-culted from the catalog scripts. I can settle on >>> just the first form if you like. >> I would rather we used the standard perl module Getopt::Long, as >> numerous programs we have already do. > Hmm ... grepping finds that used only in > > src/tools/pgindent/pgindent > src/tools/git_changelog > src/pl/plperl/text2macro.pl > > so I'm not quite sure about the "numerous" claim. Adopting that > here would possibly impose the requirement of having Getopt::Long > on some developers who are getting by without it today. However, > that's a pretty thin argument, and if Getopt::Long is present even > in the most minimal Perl installations then it's certainly moot. It's bundled separately, but on both systems I looked at it's needed by the base perl package. I don't recall ever seeing a system where it's not available. I'm reasonably careful about what packages the buildfarm requires, and it's used Getopt::Long from day one. > > On the whole I'm +1 for this. Perhaps also, as an independent patch, > we should change the catalog scripts to use Getopt::Long. > > Probably some others, too. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > On 12/27/18 3:34 PM, Tom Lane wrote: >> ... that's a pretty thin argument, and if Getopt::Long is present even >> in the most minimal Perl installations then it's certainly moot. > It's bundled separately, but on both systems I looked at it's needed by > the base perl package. I don't recall ever seeing a system where it's > not available. I'm reasonably careful about what packages the buildfarm > requires, and it's used Getopt::Long from day one. I poked around a little on my own machines, and I can confirm that Getopt::Long is present in a default Perl install-from-source at least as far back as perl 5.6.1. It's barely conceivable that some packager might omit it from their minimal package, but Red Hat, Apple, NetBSD, and OpenBSD all include it. So it sure looks to me like relying on it should be non-problematic. regards, tom lane
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Alvaro Herrera
Date:
On 2018-Dec-27, Tom Lane wrote: > I poked around a little on my own machines, and I can confirm that > Getopt::Long is present in a default Perl install-from-source at > least as far back as perl 5.6.1. It's barely conceivable that some > packager might omit it from their minimal package, but Red Hat, > Apple, NetBSD, and OpenBSD all include it. So it sure looks to > me like relying on it should be non-problematic. In Debian it's included in package perl-modules-5.24, which packages perl and libperl5.24 depend on. I suppose it's possible to install perl-base and not install perl-modules, but it'd be a really bare-bones machine. I'm not sure it's possible to build Postgres in such a machine. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Dec 27, 2018 at 07:04:41PM -0300, Alvaro Herrera wrote: > On 2018-Dec-27, Tom Lane wrote: > > > I poked around a little on my own machines, and I can confirm that > > Getopt::Long is present in a default Perl install-from-source at > > least as far back as perl 5.6.1. It's barely conceivable that some > > packager might omit it from their minimal package, but Red Hat, > > Apple, NetBSD, and OpenBSD all include it. So it sure looks to > > me like relying on it should be non-problematic. > > In Debian it's included in package perl-modules-5.24, which packages > perl and libperl5.24 depend on. I suppose it's possible to install > perl-base and not install perl-modules, but it'd be a really bare-bones > machine. I'm not sure it's possible to build Postgres in such a > machine. $ corelist -a Getopt::Long Data for 2018-11-29 Getopt::Long was first released with perl 5 5 undef 5.001 undef 5.002 2.01 5.00307 2.04 5.004 2.10 5.00405 2.19 5.005 2.17 5.00503 2.19 5.00504 2.20 [much output elided] Fortunately, this has been part of Perl core a lot further back than we promise to support for builds, so I think we're clear to use it everywhere we process options. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andres Freund
Date:
On 2018-12-27 14:22:11 -0500, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > > A smaller project might be to see if we can replace the binary keyword > > search in ScanKeyword with a perfect hashing function generated by > > gperf, or something similar. I had a quick look at that, too. > > Yeah, we've looked at gperf before, eg > > https://www.postgresql.org/message-id/20170927183156.jqzcsy7ocjcbdnmo@alap3.anarazel.de > > Perhaps it'd be a win but I'm not very convinced. Note that the tradeoffs mentioned there, by memory, aren't necessarily applicable here. As we're dealing with strings anyway, gperf wanting to deal with strings rather than being able to deal with numbers isn't problematic. > I don't know much about the theory of perfect hashing, but I wonder > if we could just roll our own tool for that. Since we're not dealing > with extremely large keyword sets, perhaps brute force search for a > set of multipliers for a hash computation like > (char[0] * some_prime + char[1] * some_other_prime ...) mod table_size > would work. The usual way to do do perfect hashing is to bascially have a two stage hashtable, with the first stage keyed by a "normal" hash fuinction, and the second one disambiguating the values that hash into the same bucket, by additionally keying a hash-function with the value in the cell in the intermediate hash table. Determining the parameters in the intermediate table is what takes time. That most perfect hash functions look like that way is also a good part of the reason why I doubt it's worthwhile to go there over a simple linear probing hashtable, with a good hashfunction - computing two hash-values will usually be worse than linear probing for *small* and *not modified* hashtables. A simple (i.e. slow for large numbers of keys) implementation for generating a perfect hash function isn't particularly hard. E.g. look at the python implementation at http://iswsa.acm.org/mphf/index.html and http://stevehanov.ca/blog/index.php?id=119 for an easy explanation with graphics. Greetings, Andres Freund
I think 0001 with complete keyword lookup replacement is in decent enough shape to post. Make check-world passes. A few notes and caveats: -I added an --extern option to the script for the core keyword headers. This also capitalizes variables. -ECPG keyword lookup is a bit different in that the ecpg and sql lookup functions are wrapped in a single function rather than called separately within pgc.l. It might be worth untangling that, but I have not done so. -Some variable names haven't changed even though now they're only referring to token values, which might be confusing. -I haven't checked if I need to install the generated headers. -I haven't measured performance or binary size. If anyone is excited enough to do that, great, otherwise I'll do that as time permits. -There are probably makefile bugs. Now, on to previous review points: On 12/27/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > +/* Payload data for keywords */ > +typedef struct ScanKeywordAux > +{ > + int16 value; /* grammar's token code */ > + char category; /* see codes above */ > +} ScanKeywordAux; > > There isn't really any point in changing category to "char", because > alignment considerations will mandate that sizeof(ScanKeywordAux) be > a multiple of 2 anyway. With some compilers we could get around that > with a pragma to force non-aligned storage, but doing so would be a > net loss on most non-Intel architectures. Reverted, especially since we can skip the struct entirely for some callers as you pointed out below. > diff --git a/src/pl/plpgsql/src/.gitignore b/src/pl/plpgsql/src/.gitignore > @@ -1,3 +1,4 @@ > +/*kwlist_d.h > > Not a fan of using wildcards in .gitignore files, at least not when > there's just one or two files you intend to match. Removed. > # Force these dependencies to be known even without dependency info built: > -pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o: > plpgsql.h pl_gram.h plerrcodes.h > +pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o: > plpgsql.h pl_gram.h plerrcodes.h pl_unreserved_kwlist_d.h > > Hm, do we really need any more than pl_scanner.o to depend on that header? I think you're right, so separated into a new rule. > +# Parse keyword header for names. > +my @keywords; > +while (<$kif>) > +{ > + if (/^PG_KEYWORD\("(\w+)",\s*\w+,\s*\w+\)/) > > This is assuming more than it should about the number of arguments for > PG_KEYWORD, as well as what's in them. I think it'd be sufficient to > match like this: > > if (/^PG_KEYWORD\("(\w+)",/) ...and... > diff --git a/src/pl/plpgsql/src/pl_unreserved_kwlist.h > b/src/pl/plpgsql/src/pl_unreserved_kwlist.h > +/* name, value, category */ > +PG_KEYWORD("absolute", K_ABSOLUTE, UNRESERVED_KEYWORD) > > Likewise, I'd just have these be two-argument macros. There's no reason > for the various kwlist.h headers to agree on the number of payload > arguments for PG_KEYWORD. Both done, however... > +/* FIXME: Have to redefine this symbol for the WIP. */ > +#undef PG_KEYWORD > +#define PG_KEYWORD(kwname, value, category) {value, category}, > + > +static const ScanKeywordAux unreserved_keywords[] = { > +#include "pl_unreserved_kwlist.h" > }; > > The category isn't useful for this keyword list, so couldn't you > just make this an array of uint16 values? Yes, this works for the unreserved keywords. The reserved ones still need the aux struct to work with the core scanner, even though scan.l doesn't reference category either. This has the consequence that we can't dispense with category, e.g.: PG_KEYWORD("all", K_ALL, RESERVED_KEYWORD) ...unless we do without the struct entirely, but that's not without disadvantages as you mentioned. I decided to export the struct (rather than just int16 for category) to the frontend, even though we have to set the token values to zero, since there might someday be another field of use to the frontend. Also to avoid confusion. > I don't mind allowing the prefix to default to empty. What I was > concerned about was that base_filename could end up undefined. > Probably the thing to do is to generate base_filename separately, > say by stripping any initial ".*/" sequence and then substitute > '_' for '.'. I removed assumptions about the filename. > +Options: > + -o output path > + -p optional prefix for generated data structures > > This usage message is pretty vague about how you write the options > (cf gripe above). I tried it like this: Usage: gen_keywords.pl [--output/-o <path>] [--prefix/-p <prefix>] input_file --output Output directory --prefix String prepended to var names in the output file On 12/27/18, Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > I would rather we used the standard perl module Getopt::Long, as > numerous programs we have already do. Done. I'll also send a patch later to bring some other scripts in line. -John Naylor
Attachment
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andres Freund
Date:
Hi, On 2018-12-29 16:59:52 -0500, John Naylor wrote: > I think 0001 with complete keyword lookup replacement is in decent > enough shape to post. Make check-world passes. A few notes and > caveats: I tried to take this for a spin, an for me the build fails because various frontend programs don't have KeywordOffsets/Strings defined, but reference it through various functions exposed to the frontend (like fmtId()). That I see that error but you don't is probably related to me using -fuse-ld=gold in CFLAGS. I can "fix" this by including kwlist_d.h in common/keywords.c regardless of FRONTEND. That also lead me to discover that the build dependencies somewhere aren't correctly set-up, because I need to force a clean rebuild to trigger the problem again, just changing keywords.c back doesn't trigger the problem. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2018-12-29 16:59:52 -0500, John Naylor wrote: >> I think 0001 with complete keyword lookup replacement is in decent >> enough shape to post. Make check-world passes. A few notes and >> caveats: > I tried to take this for a spin, an for me the build fails because various > frontend programs don't have KeywordOffsets/Strings defined, but reference it > through various functions exposed to the frontend (like fmtId()). That I see > that error but you don't is probably related to me using -fuse-ld=gold in > CFLAGS. I was just about to point out that the cfbot is seeing that too ... regards, tom lane
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Joerg Sonnenberger
Date:
On Sun, Dec 16, 2018 at 11:50:15AM -0500, John Naylor wrote: > A few months ago I was looking into faster search algorithms for > ScanKeywordLookup(), so this is interesting to me. While an optimal > full replacement would be a lot of work, the above ideas are much less > invasive and would still have some benefit. Unless anyone intends to > work on this, I'd like to flesh out the offset-into-giant-string > approach a bit further: Hello John, I was pointed at your patch on IRC and decided to look into adding my own pieces. What I can provide you is a fast perfect hash function generator. I've attached a sample hash function based on the current main keyword list. hash() essentially gives you the number of the only possible match, a final strcmp/memcmp is still necessary to verify that it is an actual keyword though. The |0x20 can be dropped if all cases have pre-lower-cased the input already. This would replace the binary search in the lookup functions. Returning offsets directly would be easy as well. That allows writing a single string where each entry is prefixed with a type mask, the token id, the length of the keyword and the actual keyword text. Does that sound useful to you? Joerg
Attachment
I wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2018-12-29 16:59:52 -0500, John Naylor wrote: >>> I think 0001 with complete keyword lookup replacement is in decent >>> enough shape to post. Make check-world passes. A few notes and >>> caveats: >> I tried to take this for a spin, an for me the build fails because various >> frontend programs don't have KeywordOffsets/Strings defined, but reference it >> through various functions exposed to the frontend (like fmtId()). That I see >> that error but you don't is probably related to me using -fuse-ld=gold in >> CFLAGS. > I was just about to point out that the cfbot is seeing that too ... Aside from the possible linkage problem, this will need a minor rebase over 4879a5172, which rearranged some of plpgsql's calls of ScanKeywordLookup. While I don't think it's going to be hard to resolve these issues, I'm wondering where we want to go with this. Is anyone excited about pursuing the perfect-hash-function idea? (Joerg's example function looked pretty neat to me.) If we are going to do that, does it make sense to push this version beforehand? regards, tom lane
On 12/30/18, Andres Freund <andres@anarazel.de> wrote: > I tried to take this for a spin, an for me the build fails because various > frontend programs don't have KeywordOffsets/Strings defined, but reference > it > through various functions exposed to the frontend (like fmtId()). That I > see > that error but you don't is probably related to me using -fuse-ld=gold in > CFLAGS. > > I can "fix" this by including kwlist_d.h in common/keywords.c > regardless of FRONTEND. That also lead me to discover that the build > dependencies somewhere aren't correctly set-up, because I need to > force a clean rebuild to trigger the problem again, just changing > keywords.c back doesn't trigger the problem. Hmm, that was a typo, and I didn't notice even when I found I had to include kwlist_d.h in ecpg/keywords.c. :-( I've fixed both of those in the attached v6. As far as dependencies, I'm far from sure I have it up to par. That piece could use some discussion. On 1/4/19, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Aside from the possible linkage problem, this will need a minor rebase > over 4879a5172, which rearranged some of plpgsql's calls of > ScanKeywordLookup. > > While I don't think it's going to be hard to resolve these issues, > I'm wondering where we want to go with this. Is anyone excited > about pursuing the perfect-hash-function idea? (Joerg's example > function looked pretty neat to me.) If we are going to do that, > does it make sense to push this version beforehand? If it does, for v6 I've also done the rebase, updated the copyright year, and fixed an error in MSVC. -John Naylor
Attachment
On 1/3/19, Joerg Sonnenberger <joerg@bec.de> wrote: > Hello John, > I was pointed at your patch on IRC and decided to look into adding my > own pieces. What I can provide you is a fast perfect hash function > generator. I've attached a sample hash function based on the current > main keyword list. hash() essentially gives you the number of the only > possible match, a final strcmp/memcmp is still necessary to verify that > it is an actual keyword though. The |0x20 can be dropped if all cases > have pre-lower-cased the input already. This would replace the binary > search in the lookup functions. Returning offsets directly would be easy > as well. That allows writing a single string where each entry is prefixed > with a type mask, the token id, the length of the keyword and the actual > keyword text. Does that sound useful to you? Judging by previous responses, there is still interest in using perfect hash functions, so thanks for this. I'm not knowledgeable enough to judge its implementation, so I'll leave that for others. -John Naylor
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andres Freund
Date:
On 2019-01-04 12:26:18 -0500, Tom Lane wrote: > I wrote: > > Andres Freund <andres@anarazel.de> writes: > >> On 2018-12-29 16:59:52 -0500, John Naylor wrote: > >>> I think 0001 with complete keyword lookup replacement is in decent > >>> enough shape to post. Make check-world passes. A few notes and > >>> caveats: > > >> I tried to take this for a spin, an for me the build fails because various > >> frontend programs don't have KeywordOffsets/Strings defined, but reference it > >> through various functions exposed to the frontend (like fmtId()). That I see > >> that error but you don't is probably related to me using -fuse-ld=gold in > >> CFLAGS. > > > I was just about to point out that the cfbot is seeing that too ... > > Aside from the possible linkage problem, this will need a minor rebase > over 4879a5172, which rearranged some of plpgsql's calls of > ScanKeywordLookup. > > While I don't think it's going to be hard to resolve these issues, > I'm wondering where we want to go with this. Is anyone excited > about pursuing the perfect-hash-function idea? (Joerg's example > function looked pretty neat to me.) If we are going to do that, > does it make sense to push this version beforehand? I think it does make sense to push this version beforehand. Most of the code would be needed anyway, so it's not like this is going to cause a lot of churn. Greetings, Andres Freund
John Naylor <jcnaylor@gmail.com> writes: > On 1/3/19, Joerg Sonnenberger <joerg@bec.de> wrote: >> I was pointed at your patch on IRC and decided to look into adding my >> own pieces. What I can provide you is a fast perfect hash function >> generator. I've attached a sample hash function based on the current >> main keyword list. hash() essentially gives you the number of the only >> possible match, a final strcmp/memcmp is still necessary to verify that >> it is an actual keyword though. The |0x20 can be dropped if all cases >> have pre-lower-cased the input already. This would replace the binary >> search in the lookup functions. Returning offsets directly would be easy >> as well. That allows writing a single string where each entry is prefixed >> with a type mask, the token id, the length of the keyword and the actual >> keyword text. Does that sound useful to you? > Judging by previous responses, there is still interest in using > perfect hash functions, so thanks for this. I'm not knowledgeable > enough to judge its implementation, so I'll leave that for others. We haven't actually seen the implementation, so it's hard to judge ;-). The sample hash function certainly looks great. I'm not terribly on board with using |0x20 as a substitute for lower-casing, but that's a minor detail. The foremost questions in my mind are: * What's the generator written in? (if the answer's not "Perl", wedging it into our build processes might be painful) * What license is it under? * Does it always suceed in producing a single-level lookup table? regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On 2019-01-04 12:26:18 -0500, Tom Lane wrote: >> I'm wondering where we want to go with this. Is anyone excited >> about pursuing the perfect-hash-function idea? (Joerg's example >> function looked pretty neat to me.) If we are going to do that, >> does it make sense to push this version beforehand? > I think it does make sense to push this version beforehand. Most of > the code would be needed anyway, so it's not like this is going to > cause a lot of churn. Yeah, I'm leaning in that direction too, first on the grounds of "don't let the perfect be the enemy of the good", and second because if we do end up with perfect hashing, we'd still need a table-generation step. The build infrastructure this adds would support a generator that produces perfect hashes just as well as what this is doing, even if we end up having to whack the API of ScanKeywordLookup around some more. So barring objections, I'll have a look at pushing this, and then we can think about using perfect hashing instead. regards, tom lane
On 12/27/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > If you really are hot about saving that other 440 bytes, the way to > do it would be to drop the struct entirely and use two parallel > arrays, an int16[] for value and a char[] (or better uint8[]) for > category. Those would be filled by reading kwlist.h twice with > different definitions for PG_KEYWORD. Not sure it's worth the > trouble though --- in particular, not clear that it's a win from > the standpoint of number of cache lines touched. Understood. That said, after re-implementing all keyword lookups, I wondered if there'd be a notational benefit to dropping the struct, especially since as yet no caller uses both token and category. It makes pl_scanner.c and its reserved keyword list a bit nicer, and gets rid of the need to force frontend to have 'zero' token numbers, but I'm not sure it's a clear win. I've attached a patch (applies on top of v6), gzipped to avoid confusing the cfbot. -John Naylor
Attachment
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Joerg Sonnenberger
Date:
On Fri, Jan 04, 2019 at 03:31:11PM -0500, Tom Lane wrote: > John Naylor <jcnaylor@gmail.com> writes: > > On 1/3/19, Joerg Sonnenberger <joerg@bec.de> wrote: > >> I was pointed at your patch on IRC and decided to look into adding my > >> own pieces. What I can provide you is a fast perfect hash function > >> generator. I've attached a sample hash function based on the current > >> main keyword list. hash() essentially gives you the number of the only > >> possible match, a final strcmp/memcmp is still necessary to verify that > >> it is an actual keyword though. The |0x20 can be dropped if all cases > >> have pre-lower-cased the input already. This would replace the binary > >> search in the lookup functions. Returning offsets directly would be easy > >> as well. That allows writing a single string where each entry is prefixed > >> with a type mask, the token id, the length of the keyword and the actual > >> keyword text. Does that sound useful to you? > > > Judging by previous responses, there is still interest in using > > perfect hash functions, so thanks for this. I'm not knowledgeable > > enough to judge its implementation, so I'll leave that for others. > > We haven't actually seen the implementation, so it's hard to judge ;-). It's a temporary hacked version of nbperf in the NetBSD tree. > The sample hash function certainly looks great. I'm not terribly on board > with using |0x20 as a substitute for lower-casing, but that's a minor > detail. Yeah, I've included that part more because I don't know the current use cases enough. If all instances are already doing lower-casing in advance, it is trivial to drop. > The foremost questions in my mind are: > > * What's the generator written in? (if the answer's not "Perl", wedging > it into our build processes might be painful) Plain C, nothing really fancy in it. > * What license is it under? Two clause BSD license. > * Does it always suceed in producing a single-level lookup table? This question is a bit tricky. The short answer is: yes. The longer answer: The choosen hash function in the example is very simple (e.g. just two variations of DJB-style hash), so with that: no, not without potentially fiddling a bit with the hash function if things ever get nasty like having two keywords that hit a funnel for both variants. The main concern for the choice was to be fast. When using two families of independent hash functions, the generator requires a probalistic linear time in the number of keys. That means with a strong enough hash function like the Jenkins hash used in PG elsewhere, it will succeed very fast. So if it fails on new keywords, making the mixing a bit stronger should be enough. Joerg
Joerg Sonnenberger <joerg@bec.de> writes: > On Fri, Jan 04, 2019 at 03:31:11PM -0500, Tom Lane wrote: >> The sample hash function certainly looks great. I'm not terribly on board >> with using |0x20 as a substitute for lower-casing, but that's a minor >> detail. > Yeah, I've included that part more because I don't know the current use > cases enough. If all instances are already doing lower-casing in > advance, it is trivial to drop. I think we probably don't need that, because we'd always need to generate a lower-cased version of the input anyway: either to compare to the potential keyword match, or to use as the normalized identifier if it turns out not to be a keyword. I don't think there are any cases where it's useful to delay downcasing till after the keyword lookup. >> * What's the generator written in? (if the answer's not "Perl", wedging >> it into our build processes might be painful) > Plain C, nothing really fancy in it. That's actually a bigger problem than you might think, because it doesn't fit in very nicely in a cross-compiling build: we might not have any C compiler at hand that generates programs that can execute on the build machine. That's why we prefer Perl for tools that need to execute during the build. However, if the code is pretty small and fast, maybe translating it to Perl is feasible. Or perhaps we could add sufficient autoconfiscation infrastructure to identify a native C compiler. It's not very likely that there isn't one, but it is possible that nothing we learned about the configured target compiler would apply to it :-( >> * What license is it under? > Two clause BSD license. OK, that works, at least. regards, tom lane
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andres Freund
Date:
Hi, On 2019-01-04 16:43:39 -0500, Tom Lane wrote: > Joerg Sonnenberger <joerg@bec.de> writes: > >> * What's the generator written in? (if the answer's not "Perl", wedging > >> it into our build processes might be painful) > > > Plain C, nothing really fancy in it. > > That's actually a bigger problem than you might think, because it > doesn't fit in very nicely in a cross-compiling build: we might not > have any C compiler at hand that generates programs that can execute > on the build machine. That's why we prefer Perl for tools that need > to execute during the build. However, if the code is pretty small > and fast, maybe translating it to Perl is feasible. Or perhaps > we could add sufficient autoconfiscation infrastructure to identify > a native C compiler. It's not very likely that there isn't one, > but it is possible that nothing we learned about the configured > target compiler would apply to it :-( I think it might be ok if we included the output of the generator in the buildtree? Not being able to add keywords while cross-compiling sounds like an acceptable restriction to me. I assume we'd likely grow further users of such a generator over time, and some of the input lists might be big enough that we'd not want to force it to be recomputed on every machine. Greetings, Andres Freund
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Joerg Sonnenberger
Date:
On Fri, Jan 04, 2019 at 02:36:15PM -0800, Andres Freund wrote: > Hi, > > On 2019-01-04 16:43:39 -0500, Tom Lane wrote: > > Joerg Sonnenberger <joerg@bec.de> writes: > > >> * What's the generator written in? (if the answer's not "Perl", wedging > > >> it into our build processes might be painful) > > > > > Plain C, nothing really fancy in it. > > > > That's actually a bigger problem than you might think, because it > > doesn't fit in very nicely in a cross-compiling build: we might not > > have any C compiler at hand that generates programs that can execute > > on the build machine. That's why we prefer Perl for tools that need > > to execute during the build. However, if the code is pretty small > > and fast, maybe translating it to Perl is feasible. Or perhaps > > we could add sufficient autoconfiscation infrastructure to identify > > a native C compiler. It's not very likely that there isn't one, > > but it is possible that nothing we learned about the configured > > target compiler would apply to it :-( There is a pre-made autoconf macro for doing the basic glue for CC_FOR_BUILD, it's been used by various projects already including libXt and friends. > I think it might be ok if we included the output of the generator in the > buildtree? Not being able to add keywords while cross-compiling sounds like > an acceptable restriction to me. I assume we'd likely grow further users > of such a generator over time, and some of the input lists might be big > enough that we'd not want to force it to be recomputed on every machine. This is quite reasonable as well. I wouldn't worry about the size of the input list at all. Processing the Webster dictionary needs something less than 0.4s on my laptop for 235k entries. Joerg
John Naylor <jcnaylor@gmail.com> writes: > [ v6-0001-Use-offset-based-keyword-lookup.patch ] I spent some time hacking on this today, and I think it's committable now, but I'm putting it back up in case anyone wants to have another look (and also so the cfbot can check it on Windows). Given the discussion about possibly switching to perfect hashing, I thought it'd be a good idea to try to make the APIs less dependent on the exact table representation. So in the attached, I created a struct ScanKeywordList that holds all the data ScanKeywordLookup needs, and the generated headers declare variables of that type, and we just pass around a pointer to that instead of passing several different things. I also went ahead with the idea of splitting the category and token data into separate arrays. That allows moving the backend token array out of src/common entirely, which I think is a good thing because of the dependency situation: we no longer need to run the bison build before we can compile src/common/keywords_srv.o. There's one remaining refactoring issue that I think we'd want to consider before trying to jack this up and wheel a perfect-hash lookup under it: where to do the downcasing transform. Right now, ecpg's c_keywords.c has its own copy of the binary-search logic because it doesn't want the downcasing transform that ScanKeywordLookup does. So unless we want it to also have a copy of the hash lookup logic, we need to rearrange that somehow. We could give ScanKeywordLookup a "bool downcase" argument, or we could refactor things so that the downcasing is done by callers if they need it (which many don't). I'm not very sure which of those three alternatives is best. My argument upthread that we could always do the downcasing before keyword lookup now feels a bit shaky, because I was reminded while working on this code that we actually have different downcasing rules for keywords and identifiers (yes, really), so that it's not possible for those code paths to share a downcasing transform. So the idea of moving the keyword-downcasing logic to the callers is likely to not work out quite as nicely as I thought. (This might also mean that I was overly hasty to reject Joerg's |0x20 hack. It's still an ugly hack, but it would save doing the keyword downcasing transform if we don't get a hashcode match...) regards, tom lane diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index e8ef966..9131991 100644 *** a/contrib/pg_stat_statements/pg_stat_statements.c --- b/contrib/pg_stat_statements/pg_stat_statements.c *************** fill_in_constant_lengths(pgssJumbleState *** 3075,3082 **** /* initialize the flex scanner --- should match raw_parser() */ yyscanner = scanner_init(query, &yyextra, ! ScanKeywords, ! NumScanKeywords); /* we don't want to re-emit any escape string warnings */ yyextra.escape_string_warning = false; --- 3075,3082 ---- /* initialize the flex scanner --- should match raw_parser() */ yyscanner = scanner_init(query, &yyextra, ! &ScanKeywords, ! ScanKeywordTokens); /* we don't want to re-emit any escape string warnings */ yyextra.escape_string_warning = false; diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c index 7e9b122..4c0c258 100644 *** a/src/backend/parser/parser.c --- b/src/backend/parser/parser.c *************** raw_parser(const char *str) *** 41,47 **** /* initialize the flex scanner */ yyscanner = scanner_init(str, &yyextra.core_yy_extra, ! ScanKeywords, NumScanKeywords); /* base_yylex() only needs this much initialization */ yyextra.have_lookahead = false; --- 41,47 ---- /* initialize the flex scanner */ yyscanner = scanner_init(str, &yyextra.core_yy_extra, ! &ScanKeywords, ScanKeywordTokens); /* base_yylex() only needs this much initialization */ yyextra.have_lookahead = false; diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l index fbeb86f..e1cae85 100644 *** a/src/backend/parser/scan.l --- b/src/backend/parser/scan.l *************** bool escape_string_warning = true; *** 67,72 **** --- 67,87 ---- bool standard_conforming_strings = true; /* + * Constant data exported from this file. This array maps from the + * zero-based keyword numbers returned by ScanKeywordLookup to the + * Bison token numbers needed by gram.y. This is exported because + * callers need to pass it to scanner_init, if they are using the + * standard keyword list ScanKeywords. + */ + #define PG_KEYWORD(kwname, value, category) value, + + const uint16 ScanKeywordTokens[] = { + #include "parser/kwlist.h" + }; + + #undef PG_KEYWORD + + /* * Set the type of YYSTYPE. */ #define YYSTYPE core_YYSTYPE *************** other . *** 504,521 **** * We will pass this along as a normal character string, * but preceded with an internally-generated "NCHAR". */ ! const ScanKeyword *keyword; SET_YYLLOC(); yyless(1); /* eat only 'n' this time */ ! keyword = ScanKeywordLookup("nchar", ! yyextra->keywords, ! yyextra->num_keywords); ! if (keyword != NULL) { ! yylval->keyword = keyword->name; ! return keyword->value; } else { --- 519,536 ---- * We will pass this along as a normal character string, * but preceded with an internally-generated "NCHAR". */ ! int kwnum; SET_YYLLOC(); yyless(1); /* eat only 'n' this time */ ! kwnum = ScanKeywordLookup("nchar", ! yyextra->keywordlist); ! if (kwnum >= 0) { ! yylval->keyword = GetScanKeyword(kwnum, ! yyextra->keywordlist); ! return yyextra->keyword_tokens[kwnum]; } else { *************** other . *** 1021,1039 **** {identifier} { ! const ScanKeyword *keyword; char *ident; SET_YYLLOC(); /* Is it a keyword? */ ! keyword = ScanKeywordLookup(yytext, ! yyextra->keywords, ! yyextra->num_keywords); ! if (keyword != NULL) { ! yylval->keyword = keyword->name; ! return keyword->value; } /* --- 1036,1054 ---- {identifier} { ! int kwnum; char *ident; SET_YYLLOC(); /* Is it a keyword? */ ! kwnum = ScanKeywordLookup(yytext, ! yyextra->keywordlist); ! if (kwnum >= 0) { ! yylval->keyword = GetScanKeyword(kwnum, ! yyextra->keywordlist); ! return yyextra->keyword_tokens[kwnum]; } /* *************** scanner_yyerror(const char *message, cor *** 1142,1149 **** core_yyscan_t scanner_init(const char *str, core_yy_extra_type *yyext, ! const ScanKeyword *keywords, ! int num_keywords) { Size slen = strlen(str); yyscan_t scanner; --- 1157,1164 ---- core_yyscan_t scanner_init(const char *str, core_yy_extra_type *yyext, ! const ScanKeywordList *keywordlist, ! const uint16 *keyword_tokens) { Size slen = strlen(str); yyscan_t scanner; *************** scanner_init(const char *str, *** 1153,1160 **** core_yyset_extra(yyext, scanner); ! yyext->keywords = keywords; ! yyext->num_keywords = num_keywords; yyext->backslash_quote = backslash_quote; yyext->escape_string_warning = escape_string_warning; --- 1168,1175 ---- core_yyset_extra(yyext, scanner); ! yyext->keywordlist = keywordlist; ! yyext->keyword_tokens = keyword_tokens; yyext->backslash_quote = backslash_quote; yyext->escape_string_warning = escape_string_warning; diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c index 7b69b82..746b7d2 100644 *** a/src/backend/utils/adt/misc.c --- b/src/backend/utils/adt/misc.c *************** pg_get_keywords(PG_FUNCTION_ARGS) *** 417,431 **** funcctx = SRF_PERCALL_SETUP(); ! if (funcctx->call_cntr < NumScanKeywords) { char *values[3]; HeapTuple tuple; /* cast-away-const is ugly but alternatives aren't much better */ ! values[0] = unconstify(char *, ScanKeywords[funcctx->call_cntr].name); ! switch (ScanKeywords[funcctx->call_cntr].category) { case UNRESERVED_KEYWORD: values[1] = "U"; --- 417,433 ---- funcctx = SRF_PERCALL_SETUP(); ! if (funcctx->call_cntr < ScanKeywords.num_keywords) { char *values[3]; HeapTuple tuple; /* cast-away-const is ugly but alternatives aren't much better */ ! values[0] = unconstify(char *, ! GetScanKeyword(funcctx->call_cntr, ! &ScanKeywords)); ! switch (ScanKeywordCategories[funcctx->call_cntr]) { case UNRESERVED_KEYWORD: values[1] = "U"; diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index 368eacf..77811f6 100644 *** a/src/backend/utils/adt/ruleutils.c --- b/src/backend/utils/adt/ruleutils.c *************** quote_identifier(const char *ident) *** 10601,10611 **** * Note: ScanKeywordLookup() does case-insensitive comparison, but * that's fine, since we already know we have all-lower-case. */ ! const ScanKeyword *keyword = ScanKeywordLookup(ident, ! ScanKeywords, ! NumScanKeywords); ! if (keyword != NULL && keyword->category != UNRESERVED_KEYWORD) safe = false; } --- 10601,10609 ---- * Note: ScanKeywordLookup() does case-insensitive comparison, but * that's fine, since we already know we have all-lower-case. */ ! int kwnum = ScanKeywordLookup(ident, &ScanKeywords); ! if (kwnum >= 0 && ScanKeywordCategories[kwnum] != UNRESERVED_KEYWORD) safe = false; } diff --git a/src/common/.gitignore b/src/common/.gitignore index ...ffa3284 . *** a/src/common/.gitignore --- b/src/common/.gitignore *************** *** 0 **** --- 1 ---- + /kwlist_d.h diff --git a/src/common/Makefile b/src/common/Makefile index ec8139f..317b071 100644 *** a/src/common/Makefile --- b/src/common/Makefile *************** override CPPFLAGS += -DVAL_LDFLAGS_EX="\ *** 41,51 **** override CPPFLAGS += -DVAL_LDFLAGS_SL="\"$(LDFLAGS_SL)\"" override CPPFLAGS += -DVAL_LIBS="\"$(LIBS)\"" ! override CPPFLAGS := -DFRONTEND $(CPPFLAGS) LIBS += $(PTHREAD_LIBS) OBJS_COMMON = base64.o config_info.o controldata_utils.o exec.o file_perm.o \ ! ip.o keywords.o link-canary.o md5.o pg_lzcompress.o \ pgfnames.o psprintf.o relpath.o \ rmtree.o saslprep.o scram-common.o string.o unicode_norm.o \ username.o wait_error.o --- 41,51 ---- override CPPFLAGS += -DVAL_LDFLAGS_SL="\"$(LDFLAGS_SL)\"" override CPPFLAGS += -DVAL_LIBS="\"$(LIBS)\"" ! override CPPFLAGS := -DFRONTEND -I. -I$(top_srcdir)/src/common $(CPPFLAGS) LIBS += $(PTHREAD_LIBS) OBJS_COMMON = base64.o config_info.o controldata_utils.o exec.o file_perm.o \ ! ip.o keywords.o kwlookup.o link-canary.o md5.o pg_lzcompress.o \ pgfnames.o psprintf.o relpath.o \ rmtree.o saslprep.o scram-common.o string.o unicode_norm.o \ username.o wait_error.o *************** OBJS_SRV = $(OBJS_COMMON:%.o=%_srv.o) *** 65,70 **** --- 65,72 ---- all: libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a + distprep: kwlist_d.h + # libpgcommon is needed by some contrib install: all installdirs $(INSTALL_STLIB) libpgcommon.a '$(DESTDIR)$(libdir)/libpgcommon.a' *************** libpgcommon_srv.a: $(OBJS_SRV) *** 115,130 **** %_srv.o: %.c %.o $(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@ ! # Dependencies of keywords.o need to be managed explicitly to make sure ! # that you don't get broken parsing code, even in a non-enable-depend build. ! # Note that gram.h isn't required for the frontend versions of keywords.o. ! $(top_builddir)/src/include/parser/gram.h: $(top_srcdir)/src/backend/parser/gram.y ! $(MAKE) -C $(top_builddir)/src/backend $(top_builddir)/src/include/parser/gram.h ! keywords.o: $(top_srcdir)/src/include/parser/kwlist.h ! keywords_shlib.o: $(top_srcdir)/src/include/parser/kwlist.h ! keywords_srv.o: $(top_builddir)/src/include/parser/gram.h $(top_srcdir)/src/include/parser/kwlist.h ! clean distclean maintainer-clean: rm -f libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a rm -f $(OBJS_FRONTEND) $(OBJS_SHLIB) $(OBJS_SRV) --- 117,134 ---- %_srv.o: %.c %.o $(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@ ! # generate SQL keyword lookup table to be included into keywords*.o. ! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(top_srcdir)/src/tools/gen_keywordlist.pl ! $(PERL) $(top_srcdir)/src/tools/gen_keywordlist.pl --extern $< ! # Dependencies of keywords*.o need to be managed explicitly to make sure ! # that you don't get broken parsing code, even in a non-enable-depend build. ! keywords.o keywords_shlib.o keywords_srv.o: kwlist_d.h ! # kwlist_d.h is in the distribution tarball, so it is not cleaned here. ! clean distclean: rm -f libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a rm -f $(OBJS_FRONTEND) $(OBJS_SHLIB) $(OBJS_SRV) + + maintainer-clean: distclean + rm -f kwlist_d.h diff --git a/src/common/keywords.c b/src/common/keywords.c index 6f99090..103166c 100644 *** a/src/common/keywords.c --- b/src/common/keywords.c *************** *** 1,7 **** /*------------------------------------------------------------------------- * * keywords.c ! * lexical token lookup for key words in PostgreSQL * * * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group --- 1,7 ---- /*------------------------------------------------------------------------- * * keywords.c ! * PostgreSQL's list of SQL keywords * * * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group *************** *** 19,114 **** #include "postgres_fe.h" #endif ! #ifndef FRONTEND ! ! #include "parser/gramparse.h" ! ! #define PG_KEYWORD(a,b,c) {a,b,c}, - #else ! #include "common/keywords.h" ! /* ! * We don't need the token number for frontend uses, so leave it out to avoid ! * requiring backend headers that won't compile cleanly here. ! */ ! #define PG_KEYWORD(a,b,c) {a,0,c}, ! #endif /* FRONTEND */ ! const ScanKeyword ScanKeywords[] = { #include "parser/kwlist.h" }; ! const int NumScanKeywords = lengthof(ScanKeywords); ! ! ! /* ! * ScanKeywordLookup - see if a given word is a keyword ! * ! * The table to be searched is passed explicitly, so that this can be used ! * to search keyword lists other than the standard list appearing above. ! * ! * Returns a pointer to the ScanKeyword table entry, or NULL if no match. ! * ! * The match is done case-insensitively. Note that we deliberately use a ! * dumbed-down case conversion that will only translate 'A'-'Z' into 'a'-'z', ! * even if we are in a locale where tolower() would produce more or different ! * translations. This is to conform to the SQL99 spec, which says that ! * keywords are to be matched in this way even though non-keyword identifiers ! * receive a different case-normalization mapping. ! */ ! const ScanKeyword * ! ScanKeywordLookup(const char *text, ! const ScanKeyword *keywords, ! int num_keywords) ! { ! int len, ! i; ! char word[NAMEDATALEN]; ! const ScanKeyword *low; ! const ScanKeyword *high; ! ! len = strlen(text); ! /* We assume all keywords are shorter than NAMEDATALEN. */ ! if (len >= NAMEDATALEN) ! return NULL; ! ! /* ! * Apply an ASCII-only downcasing. We must not use tolower() since it may ! * produce the wrong translation in some locales (eg, Turkish). ! */ ! for (i = 0; i < len; i++) ! { ! char ch = text[i]; ! ! if (ch >= 'A' && ch <= 'Z') ! ch += 'a' - 'A'; ! word[i] = ch; ! } ! word[len] = '\0'; ! ! /* ! * Now do a binary search using plain strcmp() comparison. ! */ ! low = keywords; ! high = keywords + (num_keywords - 1); ! while (low <= high) ! { ! const ScanKeyword *middle; ! int difference; ! ! middle = low + (high - low) / 2; ! difference = strcmp(middle->name, word); ! if (difference == 0) ! return middle; ! else if (difference < 0) ! low = middle + 1; ! else ! high = middle - 1; ! } ! ! return NULL; ! } --- 19,37 ---- #include "postgres_fe.h" #endif ! #include "common/keywords.h" ! /* ScanKeywordList lookup data for SQL keywords */ ! #include "kwlist_d.h" ! /* Keyword categories for SQL keywords */ + #define PG_KEYWORD(kwname, value, category) category, ! const uint8 ScanKeywordCategories[SCANKEYWORDS_NUM_KEYWORDS] = { #include "parser/kwlist.h" }; ! #undef PG_KEYWORD diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c index ...db62623 . *** a/src/common/kwlookup.c --- b/src/common/kwlookup.c *************** *** 0 **** --- 1,91 ---- + /*------------------------------------------------------------------------- + * + * kwlookup.c + * Key word lookup for PostgreSQL + * + * + * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/common/kwlookup.c + * + *------------------------------------------------------------------------- + */ + #include "c.h" + + #include "common/kwlookup.h" + + + /* + * ScanKeywordLookup - see if a given word is a keyword + * + * The list of keywords to be matched against is passed as a ScanKeywordList. + * + * Returns the keyword number (0..N-1) of the keyword, or -1 if no match. + * Callers typically use the keyword number to index into information + * arrays, but that is no concern of this code. + * + * The match is done case-insensitively. Note that we deliberately use a + * dumbed-down case conversion that will only translate 'A'-'Z' into 'a'-'z', + * even if we are in a locale where tolower() would produce more or different + * translations. This is to conform to the SQL99 spec, which says that + * keywords are to be matched in this way even though non-keyword identifiers + * receive a different case-normalization mapping. + */ + int + ScanKeywordLookup(const char *text, + const ScanKeywordList *keywords) + { + int len, + i; + char word[NAMEDATALEN]; + const char *kw_string; + const uint16 *kw_offsets; + const uint16 *low; + const uint16 *high; + + len = strlen(text); + /* We assume all keywords are shorter than NAMEDATALEN. */ + if (len >= NAMEDATALEN) + return -1; + + /* + * Apply an ASCII-only downcasing. We must not use tolower() since it may + * produce the wrong translation in some locales (eg, Turkish). + */ + for (i = 0; i < len; i++) + { + char ch = text[i]; + + if (ch >= 'A' && ch <= 'Z') + ch += 'a' - 'A'; + word[i] = ch; + } + word[len] = '\0'; + + /* + * Now do a binary search using plain strcmp() comparison. + */ + kw_string = keywords->kw_string; + kw_offsets = keywords->kw_offsets; + low = kw_offsets; + high = kw_offsets + (keywords->num_keywords - 1); + while (low <= high) + { + const uint16 *middle; + int difference; + + middle = low + (high - low) / 2; + difference = strcmp(kw_string + *middle, word); + if (difference == 0) + return middle - kw_offsets; + else if (difference < 0) + low = middle + 1; + else + high = middle - 1; + } + + return -1; + } diff --git a/src/fe_utils/string_utils.c b/src/fe_utils/string_utils.c index 9b47b62..5c1732a 100644 *** a/src/fe_utils/string_utils.c --- b/src/fe_utils/string_utils.c *************** fmtId(const char *rawid) *** 104,114 **** * Note: ScanKeywordLookup() does case-insensitive comparison, but * that's fine, since we already know we have all-lower-case. */ ! const ScanKeyword *keyword = ScanKeywordLookup(rawid, ! ScanKeywords, ! NumScanKeywords); ! if (keyword != NULL && keyword->category != UNRESERVED_KEYWORD) need_quotes = true; } --- 104,112 ---- * Note: ScanKeywordLookup() does case-insensitive comparison, but * that's fine, since we already know we have all-lower-case. */ ! int kwnum = ScanKeywordLookup(rawid, &ScanKeywords); ! if (kwnum >= 0 && ScanKeywordCategories[kwnum] != UNRESERVED_KEYWORD) need_quotes = true; } diff --git a/src/include/common/keywords.h b/src/include/common/keywords.h index 8f22f32..fb18858 100644 *** a/src/include/common/keywords.h --- b/src/include/common/keywords.h *************** *** 1,7 **** /*------------------------------------------------------------------------- * * keywords.h ! * lexical token lookup for key words in PostgreSQL * * * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group --- 1,7 ---- /*------------------------------------------------------------------------- * * keywords.h ! * PostgreSQL's list of SQL keywords * * * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group *************** *** 14,44 **** #ifndef KEYWORDS_H #define KEYWORDS_H /* Keyword categories --- should match lists in gram.y */ #define UNRESERVED_KEYWORD 0 #define COL_NAME_KEYWORD 1 #define TYPE_FUNC_NAME_KEYWORD 2 #define RESERVED_KEYWORD 3 - - typedef struct ScanKeyword - { - const char *name; /* in lower case */ - int16 value; /* grammar's token code */ - int16 category; /* see codes above */ - } ScanKeyword; - #ifndef FRONTEND ! extern PGDLLIMPORT const ScanKeyword ScanKeywords[]; ! extern PGDLLIMPORT const int NumScanKeywords; #else ! extern const ScanKeyword ScanKeywords[]; ! extern const int NumScanKeywords; #endif - - extern const ScanKeyword *ScanKeywordLookup(const char *text, - const ScanKeyword *keywords, - int num_keywords); - #endif /* KEYWORDS_H */ --- 14,33 ---- #ifndef KEYWORDS_H #define KEYWORDS_H + #include "common/kwlookup.h" + /* Keyword categories --- should match lists in gram.y */ #define UNRESERVED_KEYWORD 0 #define COL_NAME_KEYWORD 1 #define TYPE_FUNC_NAME_KEYWORD 2 #define RESERVED_KEYWORD 3 #ifndef FRONTEND ! extern PGDLLIMPORT const ScanKeywordList ScanKeywords; ! extern PGDLLIMPORT const uint8 ScanKeywordCategories[]; #else ! extern const ScanKeywordList ScanKeywords; ! extern const uint8 ScanKeywordCategories[]; #endif #endif /* KEYWORDS_H */ diff --git a/src/include/common/kwlookup.h b/src/include/common/kwlookup.h index ...3098df3 . *** a/src/include/common/kwlookup.h --- b/src/include/common/kwlookup.h *************** *** 0 **** --- 1,39 ---- + /*------------------------------------------------------------------------- + * + * kwlookup.h + * Key word lookup for PostgreSQL + * + * + * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/common/kwlookup.h + * + *------------------------------------------------------------------------- + */ + #ifndef KWLOOKUP_H + #define KWLOOKUP_H + + /* + * This struct contains the data needed by ScanKeywordLookup to perform a + * search within a set of keywords. The contents are typically generated by + * src/tools/gen_keywordlist.pl from a header containing PG_KEYWORD macros. + */ + typedef struct ScanKeywordList + { + const char *kw_string; /* all keywords in order, separated by \0 */ + const uint16 *kw_offsets; /* offsets to the start of each keyword */ + int num_keywords; /* number of keywords */ + } ScanKeywordList; + + + extern int ScanKeywordLookup(const char *text, const ScanKeywordList *keywords); + + /* Code that wants to retrieve the text of the N'th keyword should use this. */ + static inline const char * + GetScanKeyword(int n, const ScanKeywordList *keywords) + { + return keywords->kw_string + keywords->kw_offsets[n]; + } + + #endif /* KWLOOKUP_H */ diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h index 0256d53..b8902d3 100644 *** a/src/include/parser/kwlist.h --- b/src/include/parser/kwlist.h *************** *** 2,8 **** * * kwlist.h * ! * The keyword list is kept in its own source file for possible use by * automatic tools. The exact representation of a keyword is determined * by the PG_KEYWORD macro, which is not defined in this file; it can * be defined by the caller for special purposes. --- 2,8 ---- * * kwlist.h * ! * The keyword lists are kept in their own source files for use by * automatic tools. The exact representation of a keyword is determined * by the PG_KEYWORD macro, which is not defined in this file; it can * be defined by the caller for special purposes. diff --git a/src/include/parser/scanner.h b/src/include/parser/scanner.h index 009550f..91e1c83 100644 *** a/src/include/parser/scanner.h --- b/src/include/parser/scanner.h *************** typedef struct core_yy_extra_type *** 73,82 **** Size scanbuflen; /* ! * The keyword list to use. */ ! const ScanKeyword *keywords; ! int num_keywords; /* * Scanner settings to use. These are initialized from the corresponding --- 73,82 ---- Size scanbuflen; /* ! * The keyword list to use, and the associated grammar token codes. */ ! const ScanKeywordList *keywordlist; ! const uint16 *keyword_tokens; /* * Scanner settings to use. These are initialized from the corresponding *************** typedef struct core_yy_extra_type *** 116,126 **** typedef void *core_yyscan_t; /* Entry points in parser/scan.l */ extern core_yyscan_t scanner_init(const char *str, core_yy_extra_type *yyext, ! const ScanKeyword *keywords, ! int num_keywords); extern void scanner_finish(core_yyscan_t yyscanner); extern int core_yylex(core_YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner); --- 116,129 ---- typedef void *core_yyscan_t; + /* Constant data exported from parser/scan.l */ + extern PGDLLIMPORT const uint16 ScanKeywordTokens[]; + /* Entry points in parser/scan.l */ extern core_yyscan_t scanner_init(const char *str, core_yy_extra_type *yyext, ! const ScanKeywordList *keywordlist, ! const uint16 *keyword_tokens); extern void scanner_finish(core_yyscan_t yyscanner); extern int core_yylex(core_YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner); diff --git a/src/interfaces/ecpg/preproc/.gitignore b/src/interfaces/ecpg/preproc/.gitignore index 38ae2fe..958a826 100644 *** a/src/interfaces/ecpg/preproc/.gitignore --- b/src/interfaces/ecpg/preproc/.gitignore *************** *** 2,6 **** --- 2,8 ---- /preproc.c /preproc.h /pgc.c + /c_kwlist_d.h + /ecpg_kwlist_d.h /typename.c /ecpg diff --git a/src/interfaces/ecpg/preproc/Makefile b/src/interfaces/ecpg/preproc/Makefile index 69ddd8e..9b145a1 100644 *** a/src/interfaces/ecpg/preproc/Makefile --- b/src/interfaces/ecpg/preproc/Makefile *************** OBJS= preproc.o pgc.o type.o ecpg.o outp *** 28,33 **** --- 28,35 ---- keywords.o c_keywords.o ecpg_keywords.o typename.o descriptor.o variable.o \ $(WIN32RES) + GEN_KEYWORDLIST = $(top_srcdir)/src/tools/gen_keywordlist.pl + # Suppress parallel build to avoid a bug in GNU make 3.82 # (see comments in ../Makefile) ifeq ($(MAKE_VERSION),3.82) *************** preproc.y: ../../../backend/parser/gram. *** 53,61 **** $(PERL) $(srcdir)/parse.pl $(srcdir) < $< > $@ $(PERL) $(srcdir)/check_rules.pl $(srcdir) $< ecpg_keywords.o c_keywords.o keywords.o preproc.o pgc.o parser.o: preproc.h ! distprep: preproc.y preproc.c preproc.h pgc.c install: all installdirs $(INSTALL_PROGRAM) ecpg$(X) '$(DESTDIR)$(bindir)' --- 55,73 ---- $(PERL) $(srcdir)/parse.pl $(srcdir) < $< > $@ $(PERL) $(srcdir)/check_rules.pl $(srcdir) $< + # generate keyword headers + c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST) + $(PERL) $(GEN_KEYWORDLIST) --varname ScanCKeywords $< + + ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST) + $(PERL) $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $< + + # Force these dependencies to be known even without dependency info built: ecpg_keywords.o c_keywords.o keywords.o preproc.o pgc.o parser.o: preproc.h + ecpg_keywords.o: ecpg_kwlist_d.h + c_keywords.o: c_kwlist_d.h ! distprep: preproc.y preproc.c preproc.h pgc.c c_kwlist_d.h ecpg_kwlist_d.h install: all installdirs $(INSTALL_PROGRAM) ecpg$(X) '$(DESTDIR)$(bindir)' *************** installdirs: *** 66,77 **** uninstall: rm -f '$(DESTDIR)$(bindir)/ecpg$(X)' clean distclean: rm -f *.o ecpg$(X) rm -f typename.c - # `make distclean' must not remove preproc.y, preproc.c, preproc.h, or pgc.c - # since we want to ship those files in the distribution for people with - # inadequate tools. Instead, `make maintainer-clean' will remove them. maintainer-clean: distclean ! rm -f preproc.y preproc.c preproc.h pgc.c --- 78,88 ---- uninstall: rm -f '$(DESTDIR)$(bindir)/ecpg$(X)' + # preproc.y, preproc.c, preproc.h, pgc.c, c_kwlist_d.h, and ecpg_kwlist_d.h + # are in the distribution tarball, so they are not cleaned here. clean distclean: rm -f *.o ecpg$(X) rm -f typename.c maintainer-clean: distclean ! rm -f preproc.y preproc.c preproc.h pgc.c c_kwlist_d.h ecpg_kwlist_d.h diff --git a/src/interfaces/ecpg/preproc/c_keywords.c b/src/interfaces/ecpg/preproc/c_keywords.c index c367dbf..521992f 100644 *** a/src/interfaces/ecpg/preproc/c_keywords.c --- b/src/interfaces/ecpg/preproc/c_keywords.c *************** *** 14,85 **** #include "preproc_extern.h" #include "preproc.h" ! /* ! * List of (keyword-name, keyword-token-value) pairs. ! * ! * !!WARNING!!: This list must be sorted, because binary ! * search is used to locate entries. ! */ ! static const ScanKeyword ScanCKeywords[] = { ! /* name, value, category */ ! /* ! * category is not needed in ecpg, it is only here so we can share the ! * data structure with the backend ! */ ! {"VARCHAR", VARCHAR, 0}, ! {"auto", S_AUTO, 0}, ! {"bool", SQL_BOOL, 0}, ! {"char", CHAR_P, 0}, ! {"const", S_CONST, 0}, ! {"enum", ENUM_P, 0}, ! {"extern", S_EXTERN, 0}, ! {"float", FLOAT_P, 0}, ! {"hour", HOUR_P, 0}, ! {"int", INT_P, 0}, ! {"long", SQL_LONG, 0}, ! {"minute", MINUTE_P, 0}, ! {"month", MONTH_P, 0}, ! {"register", S_REGISTER, 0}, ! {"second", SECOND_P, 0}, ! {"short", SQL_SHORT, 0}, ! {"signed", SQL_SIGNED, 0}, ! {"static", S_STATIC, 0}, ! {"struct", SQL_STRUCT, 0}, ! {"to", TO, 0}, ! {"typedef", S_TYPEDEF, 0}, ! {"union", UNION, 0}, ! {"unsigned", SQL_UNSIGNED, 0}, ! {"varchar", VARCHAR, 0}, ! {"volatile", S_VOLATILE, 0}, ! {"year", YEAR_P, 0}, }; /* * Do a binary search using plain strcmp() comparison. This is much like * ScanKeywordLookup(), except we want case-sensitive matching. */ ! const ScanKeyword * ScanCKeywordLookup(const char *text) { ! const ScanKeyword *low = &ScanCKeywords[0]; ! const ScanKeyword *high = &ScanCKeywords[lengthof(ScanCKeywords) - 1]; while (low <= high) { ! const ScanKeyword *middle; int difference; middle = low + (high - low) / 2; ! difference = strcmp(middle->name, text); if (difference == 0) ! return middle; else if (difference < 0) low = middle + 1; else high = middle - 1; } ! return NULL; } --- 14,67 ---- #include "preproc_extern.h" #include "preproc.h" ! /* ScanKeywordList lookup data for C keywords */ ! #include "c_kwlist_d.h" ! /* Token codes for C keywords */ ! #define PG_KEYWORD(kwname, value) value, ! ! static const uint16 ScanCKeywordTokens[] = { ! #include "c_kwlist.h" }; + #undef PG_KEYWORD + /* + * ScanCKeywordLookup - see if a given word is a keyword + * + * Returns the token value of the keyword, or -1 if no match. + * * Do a binary search using plain strcmp() comparison. This is much like * ScanKeywordLookup(), except we want case-sensitive matching. */ ! int ScanCKeywordLookup(const char *text) { ! const char *kw_string; ! const uint16 *kw_offsets; ! const uint16 *low; ! const uint16 *high; ! ! kw_string = ScanCKeywords.kw_string; ! kw_offsets = ScanCKeywords.kw_offsets; ! low = kw_offsets; ! high = kw_offsets + (ScanCKeywords.num_keywords - 1); while (low <= high) { ! const uint16 *middle; int difference; middle = low + (high - low) / 2; ! difference = strcmp(kw_string + *middle, text); if (difference == 0) ! return ScanCKeywordTokens[middle - kw_offsets]; else if (difference < 0) low = middle + 1; else high = middle - 1; } ! return -1; } diff --git a/src/interfaces/ecpg/preproc/c_kwlist.h b/src/interfaces/ecpg/preproc/c_kwlist.h index ...4545505 . *** a/src/interfaces/ecpg/preproc/c_kwlist.h --- b/src/interfaces/ecpg/preproc/c_kwlist.h *************** *** 0 **** --- 1,53 ---- + /*------------------------------------------------------------------------- + * + * c_kwlist.h + * + * The keyword lists are kept in their own source files for use by + * automatic tools. The exact representation of a keyword is determined + * by the PG_KEYWORD macro, which is not defined in this file; it can + * be defined by the caller for special purposes. + * + * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/interfaces/ecpg/preproc/c_kwlist.h + * + *------------------------------------------------------------------------- + */ + + /* There is deliberately not an #ifndef C_KWLIST_H here. */ + + /* + * List of (keyword-name, keyword-token-value) pairs. + * + * !!WARNING!!: This list must be sorted by ASCII name, because binary + * search is used to locate entries. + */ + + /* name, value */ + PG_KEYWORD("VARCHAR", VARCHAR) + PG_KEYWORD("auto", S_AUTO) + PG_KEYWORD("bool", SQL_BOOL) + PG_KEYWORD("char", CHAR_P) + PG_KEYWORD("const", S_CONST) + PG_KEYWORD("enum", ENUM_P) + PG_KEYWORD("extern", S_EXTERN) + PG_KEYWORD("float", FLOAT_P) + PG_KEYWORD("hour", HOUR_P) + PG_KEYWORD("int", INT_P) + PG_KEYWORD("long", SQL_LONG) + PG_KEYWORD("minute", MINUTE_P) + PG_KEYWORD("month", MONTH_P) + PG_KEYWORD("register", S_REGISTER) + PG_KEYWORD("second", SECOND_P) + PG_KEYWORD("short", SQL_SHORT) + PG_KEYWORD("signed", SQL_SIGNED) + PG_KEYWORD("static", S_STATIC) + PG_KEYWORD("struct", SQL_STRUCT) + PG_KEYWORD("to", TO) + PG_KEYWORD("typedef", S_TYPEDEF) + PG_KEYWORD("union", UNION) + PG_KEYWORD("unsigned", SQL_UNSIGNED) + PG_KEYWORD("varchar", VARCHAR) + PG_KEYWORD("volatile", S_VOLATILE) + PG_KEYWORD("year", YEAR_P) diff --git a/src/interfaces/ecpg/preproc/ecpg_keywords.c b/src/interfaces/ecpg/preproc/ecpg_keywords.c index 37c97e1..4839c37 100644 *** a/src/interfaces/ecpg/preproc/ecpg_keywords.c --- b/src/interfaces/ecpg/preproc/ecpg_keywords.c *************** *** 16,97 **** #include "preproc_extern.h" #include "preproc.h" ! /* ! * List of (keyword-name, keyword-token-value) pairs. ! * ! * !!WARNING!!: This list must be sorted, because binary ! * search is used to locate entries. ! */ ! static const ScanKeyword ECPGScanKeywords[] = { ! /* name, value, category */ ! /* ! * category is not needed in ecpg, it is only here so we can share the ! * data structure with the backend ! */ ! {"allocate", SQL_ALLOCATE, 0}, ! {"autocommit", SQL_AUTOCOMMIT, 0}, ! {"bool", SQL_BOOL, 0}, ! {"break", SQL_BREAK, 0}, ! {"cardinality", SQL_CARDINALITY, 0}, ! {"connect", SQL_CONNECT, 0}, ! {"count", SQL_COUNT, 0}, ! {"datetime_interval_code", SQL_DATETIME_INTERVAL_CODE, 0}, ! {"datetime_interval_precision", SQL_DATETIME_INTERVAL_PRECISION, 0}, ! {"describe", SQL_DESCRIBE, 0}, ! {"descriptor", SQL_DESCRIPTOR, 0}, ! {"disconnect", SQL_DISCONNECT, 0}, ! {"found", SQL_FOUND, 0}, ! {"free", SQL_FREE, 0}, ! {"get", SQL_GET, 0}, ! {"go", SQL_GO, 0}, ! {"goto", SQL_GOTO, 0}, ! {"identified", SQL_IDENTIFIED, 0}, ! {"indicator", SQL_INDICATOR, 0}, ! {"key_member", SQL_KEY_MEMBER, 0}, ! {"length", SQL_LENGTH, 0}, ! {"long", SQL_LONG, 0}, ! {"nullable", SQL_NULLABLE, 0}, ! {"octet_length", SQL_OCTET_LENGTH, 0}, ! {"open", SQL_OPEN, 0}, ! {"output", SQL_OUTPUT, 0}, ! {"reference", SQL_REFERENCE, 0}, ! {"returned_length", SQL_RETURNED_LENGTH, 0}, ! {"returned_octet_length", SQL_RETURNED_OCTET_LENGTH, 0}, ! {"scale", SQL_SCALE, 0}, ! {"section", SQL_SECTION, 0}, ! {"short", SQL_SHORT, 0}, ! {"signed", SQL_SIGNED, 0}, ! {"sqlerror", SQL_SQLERROR, 0}, ! {"sqlprint", SQL_SQLPRINT, 0}, ! {"sqlwarning", SQL_SQLWARNING, 0}, ! {"stop", SQL_STOP, 0}, ! {"struct", SQL_STRUCT, 0}, ! {"unsigned", SQL_UNSIGNED, 0}, ! {"var", SQL_VAR, 0}, ! {"whenever", SQL_WHENEVER, 0}, }; /* * ScanECPGKeywordLookup - see if a given word is a keyword * ! * Returns a pointer to the ScanKeyword table entry, or NULL if no match. * Keywords are matched using the same case-folding rules as in the backend. */ ! const ScanKeyword * ScanECPGKeywordLookup(const char *text) { ! const ScanKeyword *res; /* First check SQL symbols defined by the backend. */ ! res = ScanKeywordLookup(text, SQLScanKeywords, NumSQLScanKeywords); ! if (res) ! return res; /* Try ECPG-specific keywords. */ ! res = ScanKeywordLookup(text, ECPGScanKeywords, lengthof(ECPGScanKeywords)); ! if (res) ! return res; ! return NULL; } --- 16,55 ---- #include "preproc_extern.h" #include "preproc.h" ! /* ScanKeywordList lookup data for ECPG keywords */ ! #include "ecpg_kwlist_d.h" ! /* Token codes for ECPG keywords */ ! #define PG_KEYWORD(kwname, value) value, ! ! static const uint16 ECPGScanKeywordTokens[] = { ! #include "ecpg_kwlist.h" }; + #undef PG_KEYWORD + + /* * ScanECPGKeywordLookup - see if a given word is a keyword * ! * Returns the token value of the keyword, or -1 if no match. ! * * Keywords are matched using the same case-folding rules as in the backend. */ ! int ScanECPGKeywordLookup(const char *text) { ! int kwnum; /* First check SQL symbols defined by the backend. */ ! kwnum = ScanKeywordLookup(text, &ScanKeywords); ! if (kwnum >= 0) ! return SQLScanKeywordTokens[kwnum]; /* Try ECPG-specific keywords. */ ! kwnum = ScanKeywordLookup(text, &ScanECPGKeywords); ! if (kwnum >= 0) ! return ECPGScanKeywordTokens[kwnum]; ! return -1; } diff --git a/src/interfaces/ecpg/preproc/ecpg_kwlist.h b/src/interfaces/ecpg/preproc/ecpg_kwlist.h index ...97ef254 . *** a/src/interfaces/ecpg/preproc/ecpg_kwlist.h --- b/src/interfaces/ecpg/preproc/ecpg_kwlist.h *************** *** 0 **** --- 1,68 ---- + /*------------------------------------------------------------------------- + * + * ecpg_kwlist.h + * + * The keyword lists are kept in their own source files for use by + * automatic tools. The exact representation of a keyword is determined + * by the PG_KEYWORD macro, which is not defined in this file; it can + * be defined by the caller for special purposes. + * + * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/interfaces/ecpg/preproc/ecpg_kwlist.h + * + *------------------------------------------------------------------------- + */ + + /* There is deliberately not an #ifndef ECPG_KWLIST_H here. */ + + /* + * List of (keyword-name, keyword-token-value) pairs. + * + * !!WARNING!!: This list must be sorted by ASCII name, because binary + * search is used to locate entries. + */ + + /* name, value */ + PG_KEYWORD("allocate", SQL_ALLOCATE) + PG_KEYWORD("autocommit", SQL_AUTOCOMMIT) + PG_KEYWORD("bool", SQL_BOOL) + PG_KEYWORD("break", SQL_BREAK) + PG_KEYWORD("cardinality", SQL_CARDINALITY) + PG_KEYWORD("connect", SQL_CONNECT) + PG_KEYWORD("count", SQL_COUNT) + PG_KEYWORD("datetime_interval_code", SQL_DATETIME_INTERVAL_CODE) + PG_KEYWORD("datetime_interval_precision", SQL_DATETIME_INTERVAL_PRECISION) + PG_KEYWORD("describe", SQL_DESCRIBE) + PG_KEYWORD("descriptor", SQL_DESCRIPTOR) + PG_KEYWORD("disconnect", SQL_DISCONNECT) + PG_KEYWORD("found", SQL_FOUND) + PG_KEYWORD("free", SQL_FREE) + PG_KEYWORD("get", SQL_GET) + PG_KEYWORD("go", SQL_GO) + PG_KEYWORD("goto", SQL_GOTO) + PG_KEYWORD("identified", SQL_IDENTIFIED) + PG_KEYWORD("indicator", SQL_INDICATOR) + PG_KEYWORD("key_member", SQL_KEY_MEMBER) + PG_KEYWORD("length", SQL_LENGTH) + PG_KEYWORD("long", SQL_LONG) + PG_KEYWORD("nullable", SQL_NULLABLE) + PG_KEYWORD("octet_length", SQL_OCTET_LENGTH) + PG_KEYWORD("open", SQL_OPEN) + PG_KEYWORD("output", SQL_OUTPUT) + PG_KEYWORD("reference", SQL_REFERENCE) + PG_KEYWORD("returned_length", SQL_RETURNED_LENGTH) + PG_KEYWORD("returned_octet_length", SQL_RETURNED_OCTET_LENGTH) + PG_KEYWORD("scale", SQL_SCALE) + PG_KEYWORD("section", SQL_SECTION) + PG_KEYWORD("short", SQL_SHORT) + PG_KEYWORD("signed", SQL_SIGNED) + PG_KEYWORD("sqlerror", SQL_SQLERROR) + PG_KEYWORD("sqlprint", SQL_SQLPRINT) + PG_KEYWORD("sqlwarning", SQL_SQLWARNING) + PG_KEYWORD("stop", SQL_STOP) + PG_KEYWORD("struct", SQL_STRUCT) + PG_KEYWORD("unsigned", SQL_UNSIGNED) + PG_KEYWORD("var", SQL_VAR) + PG_KEYWORD("whenever", SQL_WHENEVER) diff --git a/src/interfaces/ecpg/preproc/keywords.c b/src/interfaces/ecpg/preproc/keywords.c index 12409e9..0380409 100644 *** a/src/interfaces/ecpg/preproc/keywords.c --- b/src/interfaces/ecpg/preproc/keywords.c *************** *** 17,40 **** /* * This is much trickier than it looks. We are #include'ing kwlist.h ! * but the "value" numbers that go into the table are from preproc.h ! * not the backend's gram.h. Therefore this table will recognize all ! * keywords known to the backend, but will supply the token numbers used * by ecpg's grammar, which is what we need. The ecpg grammar must * define all the same token names the backend does, else we'll get * undefined-symbol failures in this compile. */ - #include "common/keywords.h" - #include "preproc_extern.h" #include "preproc.h" ! #define PG_KEYWORD(a,b,c) {a,b,c}, ! ! const ScanKeyword SQLScanKeywords[] = { #include "parser/kwlist.h" }; ! const int NumSQLScanKeywords = lengthof(SQLScanKeywords); --- 17,38 ---- /* * This is much trickier than it looks. We are #include'ing kwlist.h ! * but the token numbers that go into the table are from preproc.h ! * not the backend's gram.h. Therefore this token table will match ! * the ScanKeywords table supplied from common/keywords.c, including all ! * keywords known to the backend, but it will supply the token numbers used * by ecpg's grammar, which is what we need. The ecpg grammar must * define all the same token names the backend does, else we'll get * undefined-symbol failures in this compile. */ #include "preproc_extern.h" #include "preproc.h" + #define PG_KEYWORD(kwname, value, category) value, ! const uint16 SQLScanKeywordTokens[] = { #include "parser/kwlist.h" }; ! #undef PG_KEYWORD diff --git a/src/interfaces/ecpg/preproc/pgc.l b/src/interfaces/ecpg/preproc/pgc.l index a60564c..3131f5f 100644 *** a/src/interfaces/ecpg/preproc/pgc.l --- b/src/interfaces/ecpg/preproc/pgc.l *************** cppline {space}*#([^i][A-Za-z]*|{if}|{ *** 920,938 **** } {identifier} { - const ScanKeyword *keyword; - if (!isdefine()) { /* Is it an SQL/ECPG keyword? */ ! keyword = ScanECPGKeywordLookup(yytext); ! if (keyword != NULL) ! return keyword->value; /* Is it a C keyword? */ ! keyword = ScanCKeywordLookup(yytext); ! if (keyword != NULL) ! return keyword->value; /* * None of the above. Return it as an identifier. --- 920,938 ---- } {identifier} { if (!isdefine()) { + int kwvalue; + /* Is it an SQL/ECPG keyword? */ ! kwvalue = ScanECPGKeywordLookup(yytext); ! if (kwvalue >= 0) ! return kwvalue; /* Is it a C keyword? */ ! kwvalue = ScanCKeywordLookup(yytext); ! if (kwvalue >= 0) ! return kwvalue; /* * None of the above. Return it as an identifier. *************** cppline {space}*#([^i][A-Za-z]*|{if}|{ *** 1010,1021 **** return CPP_LINE; } <C>{identifier} { - const ScanKeyword *keyword; - /* * Try to detect a function name: * look for identifiers at the global scope ! * keep the last identifier before the first '(' and '{' */ if (braces_open == 0 && parenths_open == 0) { if (current_function) --- 1010,1020 ---- return CPP_LINE; } <C>{identifier} { /* * Try to detect a function name: * look for identifiers at the global scope ! * keep the last identifier before the first '(' and '{' ! */ if (braces_open == 0 && parenths_open == 0) { if (current_function) *************** cppline {space}*#([^i][A-Za-z]*|{if}|{ *** 1026,1034 **** /* however, some defines have to be taken care of for compatibility */ if ((!INFORMIX_MODE || !isinformixdefine()) && !isdefine()) { ! keyword = ScanCKeywordLookup(yytext); ! if (keyword != NULL) ! return keyword->value; else { base_yylval.str = mm_strdup(yytext); --- 1025,1035 ---- /* however, some defines have to be taken care of for compatibility */ if ((!INFORMIX_MODE || !isinformixdefine()) && !isdefine()) { ! int kwvalue; ! ! kwvalue = ScanCKeywordLookup(yytext); ! if (kwvalue >= 0) ! return kwvalue; else { base_yylval.str = mm_strdup(yytext); diff --git a/src/interfaces/ecpg/preproc/preproc_extern.h b/src/interfaces/ecpg/preproc/preproc_extern.h index 13eda67..9746780 100644 *** a/src/interfaces/ecpg/preproc/preproc_extern.h --- b/src/interfaces/ecpg/preproc/preproc_extern.h *************** extern struct when when_error, *** 59,66 **** extern struct ECPGstruct_member *struct_member_list[STRUCT_DEPTH]; /* Globals from keywords.c */ ! extern const ScanKeyword SQLScanKeywords[]; ! extern const int NumSQLScanKeywords; /* functions */ --- 59,65 ---- extern struct ECPGstruct_member *struct_member_list[STRUCT_DEPTH]; /* Globals from keywords.c */ ! extern const uint16 SQLScanKeywordTokens[]; /* functions */ *************** extern void check_indicator(struct ECPGt *** 102,109 **** extern void remove_typedefs(int); extern void remove_variables(int); extern struct variable *new_variable(const char *, struct ECPGtype *, int); ! extern const ScanKeyword *ScanCKeywordLookup(const char *); ! extern const ScanKeyword *ScanECPGKeywordLookup(const char *text); extern void parser_init(void); extern int filtered_base_yylex(void); --- 101,108 ---- extern void remove_typedefs(int); extern void remove_variables(int); extern struct variable *new_variable(const char *, struct ECPGtype *, int); ! extern int ScanCKeywordLookup(const char *text); ! extern int ScanECPGKeywordLookup(const char *text); extern void parser_init(void); extern int filtered_base_yylex(void); diff --git a/src/pl/plpgsql/src/.gitignore b/src/pl/plpgsql/src/.gitignore index ff6ac96..3ab9a22 100644 *** a/src/pl/plpgsql/src/.gitignore --- b/src/pl/plpgsql/src/.gitignore *************** *** 1,5 **** --- 1,7 ---- /pl_gram.c /pl_gram.h + /pl_reserved_kwlist_d.h + /pl_unreserved_kwlist_d.h /plerrcodes.h /log/ /results/ diff --git a/src/pl/plpgsql/src/Makefile b/src/pl/plpgsql/src/Makefile index 25a5a9d..9dd4a74 100644 *** a/src/pl/plpgsql/src/Makefile --- b/src/pl/plpgsql/src/Makefile *************** REGRESS_OPTS = --dbname=$(PL_TESTDB) *** 29,34 **** --- 29,36 ---- REGRESS = plpgsql_call plpgsql_control plpgsql_domain plpgsql_record \ plpgsql_cache plpgsql_transaction plpgsql_varprops + GEN_KEYWORDLIST = $(top_srcdir)/src/tools/gen_keywordlist.pl + all: all-lib # Shared library stuff *************** uninstall-headers: *** 61,66 **** --- 63,69 ---- # Force these dependencies to be known even without dependency info built: pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o: plpgsql.h pl_gram.h plerrcodes.h + pl_scanner.o: pl_reserved_kwlist_d.h pl_unreserved_kwlist_d.h # See notes in src/backend/parser/Makefile about the following two rules pl_gram.h: pl_gram.c *************** pl_gram.c: BISONFLAGS += -d *** 72,77 **** --- 75,87 ---- plerrcodes.h: $(top_srcdir)/src/backend/utils/errcodes.txt generate-plerrcodes.pl $(PERL) $(srcdir)/generate-plerrcodes.pl $< > $@ + # generate keyword headers for the scanner + pl_reserved_kwlist_d.h: pl_reserved_kwlist.h $(GEN_KEYWORDLIST) + $(PERL) $(GEN_KEYWORDLIST) --varname ReservedPLKeywords $< + + pl_unreserved_kwlist_d.h: pl_unreserved_kwlist.h $(GEN_KEYWORDLIST) + $(PERL) $(GEN_KEYWORDLIST) --varname UnreservedPLKeywords $< + check: submake $(pg_regress_check) $(REGRESS_OPTS) $(REGRESS) *************** submake: *** 84,96 **** $(MAKE) -C $(top_builddir)/src/test/regress pg_regress$(X) ! distprep: pl_gram.h pl_gram.c plerrcodes.h ! # pl_gram.c, pl_gram.h and plerrcodes.h are in the distribution tarball, ! # so they are not cleaned here. clean distclean: clean-lib rm -f $(OBJS) rm -rf $(pg_regress_clean_files) maintainer-clean: distclean ! rm -f pl_gram.c pl_gram.h plerrcodes.h --- 94,107 ---- $(MAKE) -C $(top_builddir)/src/test/regress pg_regress$(X) ! distprep: pl_gram.h pl_gram.c plerrcodes.h pl_reserved_kwlist_d.h pl_unreserved_kwlist_d.h ! # pl_gram.c, pl_gram.h, plerrcodes.h, pl_reserved_kwlist_d.h, and ! # pl_unreserved_kwlist_d.h are in the distribution tarball, so they ! # are not cleaned here. clean distclean: clean-lib rm -f $(OBJS) rm -rf $(pg_regress_clean_files) maintainer-clean: distclean ! rm -f pl_gram.c pl_gram.h plerrcodes.h pl_reserved_kwlist_d.h pl_unreserved_kwlist_d.h diff --git a/src/pl/plpgsql/src/pl_reserved_kwlist.h b/src/pl/plpgsql/src/pl_reserved_kwlist.h index ...5c2e0c1 . *** a/src/pl/plpgsql/src/pl_reserved_kwlist.h --- b/src/pl/plpgsql/src/pl_reserved_kwlist.h *************** *** 0 **** --- 1,53 ---- + /*------------------------------------------------------------------------- + * + * pl_reserved_kwlist.h + * + * The keyword lists are kept in their own source files for use by + * automatic tools. The exact representation of a keyword is determined + * by the PG_KEYWORD macro, which is not defined in this file; it can + * be defined by the caller for special purposes. + * + * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/pl/plpgsql/src/pl_reserved_kwlist.h + * + *------------------------------------------------------------------------- + */ + + /* There is deliberately not an #ifndef PL_RESERVED_KWLIST_H here. */ + + /* + * List of (keyword-name, keyword-token-value) pairs. + * + * Be careful not to put the same word in both lists. + * + * !!WARNING!!: This list must be sorted by ASCII name, because binary + * search is used to locate entries. + */ + + /* name, value */ + PG_KEYWORD("all", K_ALL) + PG_KEYWORD("begin", K_BEGIN) + PG_KEYWORD("by", K_BY) + PG_KEYWORD("case", K_CASE) + PG_KEYWORD("declare", K_DECLARE) + PG_KEYWORD("else", K_ELSE) + PG_KEYWORD("end", K_END) + PG_KEYWORD("execute", K_EXECUTE) + PG_KEYWORD("for", K_FOR) + PG_KEYWORD("foreach", K_FOREACH) + PG_KEYWORD("from", K_FROM) + PG_KEYWORD("if", K_IF) + PG_KEYWORD("in", K_IN) + PG_KEYWORD("into", K_INTO) + PG_KEYWORD("loop", K_LOOP) + PG_KEYWORD("not", K_NOT) + PG_KEYWORD("null", K_NULL) + PG_KEYWORD("or", K_OR) + PG_KEYWORD("strict", K_STRICT) + PG_KEYWORD("then", K_THEN) + PG_KEYWORD("to", K_TO) + PG_KEYWORD("using", K_USING) + PG_KEYWORD("when", K_WHEN) + PG_KEYWORD("while", K_WHILE) diff --git a/src/pl/plpgsql/src/pl_scanner.c b/src/pl/plpgsql/src/pl_scanner.c index 8340628..c260438 100644 *** a/src/pl/plpgsql/src/pl_scanner.c --- b/src/pl/plpgsql/src/pl_scanner.c *************** *** 22,37 **** #include "pl_gram.h" /* must be after parser/scanner.h */ - #define PG_KEYWORD(a,b,c) {a,b,c}, - - /* Klugy flag to tell scanner how to look up identifiers */ IdentifierLookup plpgsql_IdentifierLookup = IDENTIFIER_LOOKUP_NORMAL; /* * A word about keywords: * ! * We keep reserved and unreserved keywords in separate arrays. The * reserved keywords are passed to the core scanner, so they will be * recognized before (and instead of) any variable name. Unreserved words * are checked for separately, usually after determining that the identifier --- 22,36 ---- #include "pl_gram.h" /* must be after parser/scanner.h */ /* Klugy flag to tell scanner how to look up identifiers */ IdentifierLookup plpgsql_IdentifierLookup = IDENTIFIER_LOOKUP_NORMAL; /* * A word about keywords: * ! * We keep reserved and unreserved keywords in separate headers. Be careful ! * not to put the same word in both headers. Also be sure that pl_gram.y's ! * unreserved_keyword production agrees with the unreserved header. The * reserved keywords are passed to the core scanner, so they will be * recognized before (and instead of) any variable name. Unreserved words * are checked for separately, usually after determining that the identifier *************** IdentifierLookup plpgsql_IdentifierLooku *** 57,186 **** * BEGIN BY DECLARE EXECUTE FOREACH IF LOOP STRICT WHILE */ ! /* ! * Lists of keyword (name, token-value, category) entries. ! * ! * !!WARNING!!: These lists must be sorted by ASCII name, because binary ! * search is used to locate entries. ! * ! * Be careful not to put the same word in both lists. Also be sure that ! * pl_gram.y's unreserved_keyword production agrees with the second list. ! */ ! static const ScanKeyword reserved_keywords[] = { ! PG_KEYWORD("all", K_ALL, RESERVED_KEYWORD) ! PG_KEYWORD("begin", K_BEGIN, RESERVED_KEYWORD) ! PG_KEYWORD("by", K_BY, RESERVED_KEYWORD) ! PG_KEYWORD("case", K_CASE, RESERVED_KEYWORD) ! PG_KEYWORD("declare", K_DECLARE, RESERVED_KEYWORD) ! PG_KEYWORD("else", K_ELSE, RESERVED_KEYWORD) ! PG_KEYWORD("end", K_END, RESERVED_KEYWORD) ! PG_KEYWORD("execute", K_EXECUTE, RESERVED_KEYWORD) ! PG_KEYWORD("for", K_FOR, RESERVED_KEYWORD) ! PG_KEYWORD("foreach", K_FOREACH, RESERVED_KEYWORD) ! PG_KEYWORD("from", K_FROM, RESERVED_KEYWORD) ! PG_KEYWORD("if", K_IF, RESERVED_KEYWORD) ! PG_KEYWORD("in", K_IN, RESERVED_KEYWORD) ! PG_KEYWORD("into", K_INTO, RESERVED_KEYWORD) ! PG_KEYWORD("loop", K_LOOP, RESERVED_KEYWORD) ! PG_KEYWORD("not", K_NOT, RESERVED_KEYWORD) ! PG_KEYWORD("null", K_NULL, RESERVED_KEYWORD) ! PG_KEYWORD("or", K_OR, RESERVED_KEYWORD) ! PG_KEYWORD("strict", K_STRICT, RESERVED_KEYWORD) ! PG_KEYWORD("then", K_THEN, RESERVED_KEYWORD) ! PG_KEYWORD("to", K_TO, RESERVED_KEYWORD) ! PG_KEYWORD("using", K_USING, RESERVED_KEYWORD) ! PG_KEYWORD("when", K_WHEN, RESERVED_KEYWORD) ! PG_KEYWORD("while", K_WHILE, RESERVED_KEYWORD) ! }; ! static const int num_reserved_keywords = lengthof(reserved_keywords); ! static const ScanKeyword unreserved_keywords[] = { ! PG_KEYWORD("absolute", K_ABSOLUTE, UNRESERVED_KEYWORD) ! PG_KEYWORD("alias", K_ALIAS, UNRESERVED_KEYWORD) ! PG_KEYWORD("array", K_ARRAY, UNRESERVED_KEYWORD) ! PG_KEYWORD("assert", K_ASSERT, UNRESERVED_KEYWORD) ! PG_KEYWORD("backward", K_BACKWARD, UNRESERVED_KEYWORD) ! PG_KEYWORD("call", K_CALL, UNRESERVED_KEYWORD) ! PG_KEYWORD("close", K_CLOSE, UNRESERVED_KEYWORD) ! PG_KEYWORD("collate", K_COLLATE, UNRESERVED_KEYWORD) ! PG_KEYWORD("column", K_COLUMN, UNRESERVED_KEYWORD) ! PG_KEYWORD("column_name", K_COLUMN_NAME, UNRESERVED_KEYWORD) ! PG_KEYWORD("commit", K_COMMIT, UNRESERVED_KEYWORD) ! PG_KEYWORD("constant", K_CONSTANT, UNRESERVED_KEYWORD) ! PG_KEYWORD("constraint", K_CONSTRAINT, UNRESERVED_KEYWORD) ! PG_KEYWORD("constraint_name", K_CONSTRAINT_NAME, UNRESERVED_KEYWORD) ! PG_KEYWORD("continue", K_CONTINUE, UNRESERVED_KEYWORD) ! PG_KEYWORD("current", K_CURRENT, UNRESERVED_KEYWORD) ! PG_KEYWORD("cursor", K_CURSOR, UNRESERVED_KEYWORD) ! PG_KEYWORD("datatype", K_DATATYPE, UNRESERVED_KEYWORD) ! PG_KEYWORD("debug", K_DEBUG, UNRESERVED_KEYWORD) ! PG_KEYWORD("default", K_DEFAULT, UNRESERVED_KEYWORD) ! PG_KEYWORD("detail", K_DETAIL, UNRESERVED_KEYWORD) ! PG_KEYWORD("diagnostics", K_DIAGNOSTICS, UNRESERVED_KEYWORD) ! PG_KEYWORD("do", K_DO, UNRESERVED_KEYWORD) ! PG_KEYWORD("dump", K_DUMP, UNRESERVED_KEYWORD) ! PG_KEYWORD("elseif", K_ELSIF, UNRESERVED_KEYWORD) ! PG_KEYWORD("elsif", K_ELSIF, UNRESERVED_KEYWORD) ! PG_KEYWORD("errcode", K_ERRCODE, UNRESERVED_KEYWORD) ! PG_KEYWORD("error", K_ERROR, UNRESERVED_KEYWORD) ! PG_KEYWORD("exception", K_EXCEPTION, UNRESERVED_KEYWORD) ! PG_KEYWORD("exit", K_EXIT, UNRESERVED_KEYWORD) ! PG_KEYWORD("fetch", K_FETCH, UNRESERVED_KEYWORD) ! PG_KEYWORD("first", K_FIRST, UNRESERVED_KEYWORD) ! PG_KEYWORD("forward", K_FORWARD, UNRESERVED_KEYWORD) ! PG_KEYWORD("get", K_GET, UNRESERVED_KEYWORD) ! PG_KEYWORD("hint", K_HINT, UNRESERVED_KEYWORD) ! PG_KEYWORD("import", K_IMPORT, UNRESERVED_KEYWORD) ! PG_KEYWORD("info", K_INFO, UNRESERVED_KEYWORD) ! PG_KEYWORD("insert", K_INSERT, UNRESERVED_KEYWORD) ! PG_KEYWORD("is", K_IS, UNRESERVED_KEYWORD) ! PG_KEYWORD("last", K_LAST, UNRESERVED_KEYWORD) ! PG_KEYWORD("log", K_LOG, UNRESERVED_KEYWORD) ! PG_KEYWORD("message", K_MESSAGE, UNRESERVED_KEYWORD) ! PG_KEYWORD("message_text", K_MESSAGE_TEXT, UNRESERVED_KEYWORD) ! PG_KEYWORD("move", K_MOVE, UNRESERVED_KEYWORD) ! PG_KEYWORD("next", K_NEXT, UNRESERVED_KEYWORD) ! PG_KEYWORD("no", K_NO, UNRESERVED_KEYWORD) ! PG_KEYWORD("notice", K_NOTICE, UNRESERVED_KEYWORD) ! PG_KEYWORD("open", K_OPEN, UNRESERVED_KEYWORD) ! PG_KEYWORD("option", K_OPTION, UNRESERVED_KEYWORD) ! PG_KEYWORD("perform", K_PERFORM, UNRESERVED_KEYWORD) ! PG_KEYWORD("pg_context", K_PG_CONTEXT, UNRESERVED_KEYWORD) ! PG_KEYWORD("pg_datatype_name", K_PG_DATATYPE_NAME, UNRESERVED_KEYWORD) ! PG_KEYWORD("pg_exception_context", K_PG_EXCEPTION_CONTEXT, UNRESERVED_KEYWORD) ! PG_KEYWORD("pg_exception_detail", K_PG_EXCEPTION_DETAIL, UNRESERVED_KEYWORD) ! PG_KEYWORD("pg_exception_hint", K_PG_EXCEPTION_HINT, UNRESERVED_KEYWORD) ! PG_KEYWORD("print_strict_params", K_PRINT_STRICT_PARAMS, UNRESERVED_KEYWORD) ! PG_KEYWORD("prior", K_PRIOR, UNRESERVED_KEYWORD) ! PG_KEYWORD("query", K_QUERY, UNRESERVED_KEYWORD) ! PG_KEYWORD("raise", K_RAISE, UNRESERVED_KEYWORD) ! PG_KEYWORD("relative", K_RELATIVE, UNRESERVED_KEYWORD) ! PG_KEYWORD("reset", K_RESET, UNRESERVED_KEYWORD) ! PG_KEYWORD("return", K_RETURN, UNRESERVED_KEYWORD) ! PG_KEYWORD("returned_sqlstate", K_RETURNED_SQLSTATE, UNRESERVED_KEYWORD) ! PG_KEYWORD("reverse", K_REVERSE, UNRESERVED_KEYWORD) ! PG_KEYWORD("rollback", K_ROLLBACK, UNRESERVED_KEYWORD) ! PG_KEYWORD("row_count", K_ROW_COUNT, UNRESERVED_KEYWORD) ! PG_KEYWORD("rowtype", K_ROWTYPE, UNRESERVED_KEYWORD) ! PG_KEYWORD("schema", K_SCHEMA, UNRESERVED_KEYWORD) ! PG_KEYWORD("schema_name", K_SCHEMA_NAME, UNRESERVED_KEYWORD) ! PG_KEYWORD("scroll", K_SCROLL, UNRESERVED_KEYWORD) ! PG_KEYWORD("set", K_SET, UNRESERVED_KEYWORD) ! PG_KEYWORD("slice", K_SLICE, UNRESERVED_KEYWORD) ! PG_KEYWORD("sqlstate", K_SQLSTATE, UNRESERVED_KEYWORD) ! PG_KEYWORD("stacked", K_STACKED, UNRESERVED_KEYWORD) ! PG_KEYWORD("table", K_TABLE, UNRESERVED_KEYWORD) ! PG_KEYWORD("table_name", K_TABLE_NAME, UNRESERVED_KEYWORD) ! PG_KEYWORD("type", K_TYPE, UNRESERVED_KEYWORD) ! PG_KEYWORD("use_column", K_USE_COLUMN, UNRESERVED_KEYWORD) ! PG_KEYWORD("use_variable", K_USE_VARIABLE, UNRESERVED_KEYWORD) ! PG_KEYWORD("variable_conflict", K_VARIABLE_CONFLICT, UNRESERVED_KEYWORD) ! PG_KEYWORD("warning", K_WARNING, UNRESERVED_KEYWORD) }; ! static const int num_unreserved_keywords = lengthof(unreserved_keywords); /* * This macro must recognize all tokens that can immediately precede a --- 56,77 ---- * BEGIN BY DECLARE EXECUTE FOREACH IF LOOP STRICT WHILE */ ! /* ScanKeywordList lookup data for PL/pgSQL keywords */ ! #include "pl_reserved_kwlist_d.h" ! #include "pl_unreserved_kwlist_d.h" ! /* Token codes for PL/pgSQL keywords */ ! #define PG_KEYWORD(kwname, value) value, ! static const uint16 ReservedPLKeywordTokens[] = { ! #include "pl_reserved_kwlist.h" ! }; ! static const uint16 UnreservedPLKeywordTokens[] = { ! #include "pl_unreserved_kwlist.h" }; ! #undef PG_KEYWORD /* * This macro must recognize all tokens that can immediately precede a *************** plpgsql_yylex(void) *** 256,262 **** { int tok1; TokenAuxData aux1; ! const ScanKeyword *kw; tok1 = internal_yylex(&aux1); if (tok1 == IDENT || tok1 == PARAM) --- 147,153 ---- { int tok1; TokenAuxData aux1; ! int kwnum; tok1 = internal_yylex(&aux1); if (tok1 == IDENT || tok1 == PARAM) *************** plpgsql_yylex(void) *** 333,344 **** &aux1.lval.word)) tok1 = T_DATUM; else if (!aux1.lval.word.quoted && ! (kw = ScanKeywordLookup(aux1.lval.word.ident, ! unreserved_keywords, ! num_unreserved_keywords))) { ! aux1.lval.keyword = kw->name; ! tok1 = kw->value; } else tok1 = T_WORD; --- 224,235 ---- &aux1.lval.word)) tok1 = T_DATUM; else if (!aux1.lval.word.quoted && ! (kwnum = ScanKeywordLookup(aux1.lval.word.ident, ! &UnreservedPLKeywords)) >= 0) { ! aux1.lval.keyword = GetScanKeyword(kwnum, ! &UnreservedPLKeywords); ! tok1 = UnreservedPLKeywordTokens[kwnum]; } else tok1 = T_WORD; *************** plpgsql_yylex(void) *** 375,386 **** &aux1.lval.word)) tok1 = T_DATUM; else if (!aux1.lval.word.quoted && ! (kw = ScanKeywordLookup(aux1.lval.word.ident, ! unreserved_keywords, ! num_unreserved_keywords))) { ! aux1.lval.keyword = kw->name; ! tok1 = kw->value; } else tok1 = T_WORD; --- 266,277 ---- &aux1.lval.word)) tok1 = T_DATUM; else if (!aux1.lval.word.quoted && ! (kwnum = ScanKeywordLookup(aux1.lval.word.ident, ! &UnreservedPLKeywords)) >= 0) { ! aux1.lval.keyword = GetScanKeyword(kwnum, ! &UnreservedPLKeywords); ! tok1 = UnreservedPLKeywordTokens[kwnum]; } else tok1 = T_WORD; *************** plpgsql_token_is_unreserved_keyword(int *** 497,505 **** { int i; ! for (i = 0; i < num_unreserved_keywords; i++) { ! if (unreserved_keywords[i].value == token) return true; } return false; --- 388,396 ---- { int i; ! for (i = 0; i < lengthof(UnreservedPLKeywordTokens); i++) { ! if (UnreservedPLKeywordTokens[i] == token) return true; } return false; *************** plpgsql_scanner_init(const char *str) *** 696,702 **** { /* Start up the core scanner */ yyscanner = scanner_init(str, &core_yy, ! reserved_keywords, num_reserved_keywords); /* * scanorig points to the original string, which unlike the scanner's --- 587,593 ---- { /* Start up the core scanner */ yyscanner = scanner_init(str, &core_yy, ! &ReservedPLKeywords, ReservedPLKeywordTokens); /* * scanorig points to the original string, which unlike the scanner's diff --git a/src/pl/plpgsql/src/pl_unreserved_kwlist.h b/src/pl/plpgsql/src/pl_unreserved_kwlist.h index ...ef2aea0 . *** a/src/pl/plpgsql/src/pl_unreserved_kwlist.h --- b/src/pl/plpgsql/src/pl_unreserved_kwlist.h *************** *** 0 **** --- 1,111 ---- + /*------------------------------------------------------------------------- + * + * pl_unreserved_kwlist.h + * + * The keyword lists are kept in their own source files for use by + * automatic tools. The exact representation of a keyword is determined + * by the PG_KEYWORD macro, which is not defined in this file; it can + * be defined by the caller for special purposes. + * + * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/pl/plpgsql/src/pl_unreserved_kwlist.h + * + *------------------------------------------------------------------------- + */ + + /* There is deliberately not an #ifndef PL_UNRESERVED_KWLIST_H here. */ + + /* + * List of (keyword-name, keyword-token-value) pairs. + * + * Be careful not to put the same word in both lists. Also be sure that + * pl_gram.y's unreserved_keyword production agrees with this list. + * + * !!WARNING!!: This list must be sorted by ASCII name, because binary + * search is used to locate entries. + */ + + /* name, value */ + PG_KEYWORD("absolute", K_ABSOLUTE) + PG_KEYWORD("alias", K_ALIAS) + PG_KEYWORD("array", K_ARRAY) + PG_KEYWORD("assert", K_ASSERT) + PG_KEYWORD("backward", K_BACKWARD) + PG_KEYWORD("call", K_CALL) + PG_KEYWORD("close", K_CLOSE) + PG_KEYWORD("collate", K_COLLATE) + PG_KEYWORD("column", K_COLUMN) + PG_KEYWORD("column_name", K_COLUMN_NAME) + PG_KEYWORD("commit", K_COMMIT) + PG_KEYWORD("constant", K_CONSTANT) + PG_KEYWORD("constraint", K_CONSTRAINT) + PG_KEYWORD("constraint_name", K_CONSTRAINT_NAME) + PG_KEYWORD("continue", K_CONTINUE) + PG_KEYWORD("current", K_CURRENT) + PG_KEYWORD("cursor", K_CURSOR) + PG_KEYWORD("datatype", K_DATATYPE) + PG_KEYWORD("debug", K_DEBUG) + PG_KEYWORD("default", K_DEFAULT) + PG_KEYWORD("detail", K_DETAIL) + PG_KEYWORD("diagnostics", K_DIAGNOSTICS) + PG_KEYWORD("do", K_DO) + PG_KEYWORD("dump", K_DUMP) + PG_KEYWORD("elseif", K_ELSIF) + PG_KEYWORD("elsif", K_ELSIF) + PG_KEYWORD("errcode", K_ERRCODE) + PG_KEYWORD("error", K_ERROR) + PG_KEYWORD("exception", K_EXCEPTION) + PG_KEYWORD("exit", K_EXIT) + PG_KEYWORD("fetch", K_FETCH) + PG_KEYWORD("first", K_FIRST) + PG_KEYWORD("forward", K_FORWARD) + PG_KEYWORD("get", K_GET) + PG_KEYWORD("hint", K_HINT) + PG_KEYWORD("import", K_IMPORT) + PG_KEYWORD("info", K_INFO) + PG_KEYWORD("insert", K_INSERT) + PG_KEYWORD("is", K_IS) + PG_KEYWORD("last", K_LAST) + PG_KEYWORD("log", K_LOG) + PG_KEYWORD("message", K_MESSAGE) + PG_KEYWORD("message_text", K_MESSAGE_TEXT) + PG_KEYWORD("move", K_MOVE) + PG_KEYWORD("next", K_NEXT) + PG_KEYWORD("no", K_NO) + PG_KEYWORD("notice", K_NOTICE) + PG_KEYWORD("open", K_OPEN) + PG_KEYWORD("option", K_OPTION) + PG_KEYWORD("perform", K_PERFORM) + PG_KEYWORD("pg_context", K_PG_CONTEXT) + PG_KEYWORD("pg_datatype_name", K_PG_DATATYPE_NAME) + PG_KEYWORD("pg_exception_context", K_PG_EXCEPTION_CONTEXT) + PG_KEYWORD("pg_exception_detail", K_PG_EXCEPTION_DETAIL) + PG_KEYWORD("pg_exception_hint", K_PG_EXCEPTION_HINT) + PG_KEYWORD("print_strict_params", K_PRINT_STRICT_PARAMS) + PG_KEYWORD("prior", K_PRIOR) + PG_KEYWORD("query", K_QUERY) + PG_KEYWORD("raise", K_RAISE) + PG_KEYWORD("relative", K_RELATIVE) + PG_KEYWORD("reset", K_RESET) + PG_KEYWORD("return", K_RETURN) + PG_KEYWORD("returned_sqlstate", K_RETURNED_SQLSTATE) + PG_KEYWORD("reverse", K_REVERSE) + PG_KEYWORD("rollback", K_ROLLBACK) + PG_KEYWORD("row_count", K_ROW_COUNT) + PG_KEYWORD("rowtype", K_ROWTYPE) + PG_KEYWORD("schema", K_SCHEMA) + PG_KEYWORD("schema_name", K_SCHEMA_NAME) + PG_KEYWORD("scroll", K_SCROLL) + PG_KEYWORD("set", K_SET) + PG_KEYWORD("slice", K_SLICE) + PG_KEYWORD("sqlstate", K_SQLSTATE) + PG_KEYWORD("stacked", K_STACKED) + PG_KEYWORD("table", K_TABLE) + PG_KEYWORD("table_name", K_TABLE_NAME) + PG_KEYWORD("type", K_TYPE) + PG_KEYWORD("use_column", K_USE_COLUMN) + PG_KEYWORD("use_variable", K_USE_VARIABLE) + PG_KEYWORD("variable_conflict", K_VARIABLE_CONFLICT) + PG_KEYWORD("warning", K_WARNING) diff --git a/src/tools/gen_keywordlist.pl b/src/tools/gen_keywordlist.pl index ...eb5ed65 . *** a/src/tools/gen_keywordlist.pl --- b/src/tools/gen_keywordlist.pl *************** *** 0 **** --- 1,148 ---- + #---------------------------------------------------------------------- + # + # gen_keywordlist.pl + # Perl script that transforms a list of keywords into a ScanKeywordList + # data structure that can be passed to ScanKeywordLookup(). + # + # The input is a C header file containing a series of macro calls + # PG_KEYWORD("keyword", ...) + # Lines not starting with PG_KEYWORD are ignored. The keywords are + # implicitly numbered 0..N-1 in order of appearance in the header file. + # Currently, the keywords are required to appear in ASCII order. + # + # The output is a C header file that defines a "const ScanKeywordList" + # variable named according to the -v switch ("ScanKeywords" by default). + # The variable is marked "static" unless the -e switch is given. + # + # + # Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + # Portions Copyright (c) 1994, Regents of the University of California + # + # src/tools/gen_keywordlist.pl + # + #---------------------------------------------------------------------- + + use strict; + use warnings; + use Getopt::Long; + + my $output_path = ''; + my $extern = 0; + my $varname = 'ScanKeywords'; + + GetOptions( + 'output:s' => \$output_path, + 'extern' => \$extern, + 'varname:s' => \$varname) || usage(); + + my $kw_input_file = shift @ARGV || die "No input file.\n"; + + # Make sure output_path ends in a slash if needed. + if ($output_path ne '' && substr($output_path, -1) ne '/') + { + $output_path .= '/'; + } + + $kw_input_file =~ /(\w+)\.h$/ || die "Input file must be named something.h.\n"; + my $base_filename = $1 . '_d'; + my $kw_def_file = $output_path . $base_filename . '.h'; + + open(my $kif, '<', $kw_input_file) || die "$kw_input_file: $!\n"; + open(my $kwdef, '>', $kw_def_file) || die "$kw_def_file: $!\n"; + + # Opening boilerplate for keyword definition header. + printf $kwdef <<EOM, $base_filename, uc $base_filename, uc $base_filename; + /*------------------------------------------------------------------------- + * + * %s.h + * List of keywords represented as a ScanKeywordList. + * + * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * NOTES + * ****************************** + * *** DO NOT EDIT THIS FILE! *** + * ****************************** + * + * It has been GENERATED by src/tools/gen_keywordlist.pl + * + *------------------------------------------------------------------------- + */ + + #ifndef %s_H + #define %s_H + + #include "common/kwlookup.h" + + EOM + + # Parse input file for keyword names. + my @keywords; + while (<$kif>) + { + if (/^PG_KEYWORD\("(\w+)"/) + { + push @keywords, $1; + } + } + + # Error out if the keyword names are not in ASCII order. + for my $i (0..$#keywords - 1) + { + die qq|The keyword "$keywords[$i + 1]" is out of order in $kw_input_file\n| + if ($keywords[$i] cmp $keywords[$i + 1]) >= 0; + } + + # Emit the string containing all the keywords. + + printf $kwdef qq|static const char %s_kw_string[] =\n\t"|, $varname; + print $kwdef join qq|\\0"\n\t"|, @keywords; + print $kwdef qq|";\n\n|; + + # Emit an array of numerical offsets which will be used to index into the + # keyword string. + + printf $kwdef "static const uint16 %s_kw_offsets[] = {\n", $varname; + + my $offset = 0; + foreach my $name (@keywords) + { + print $kwdef "\t$offset,\n"; + + # Calculate the cumulative offset of the next keyword, + # taking into account the null terminator. + $offset += length($name) + 1; + } + + print $kwdef "};\n\n"; + + # Emit a macro defining the number of keywords. + + printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords; + + # Emit the struct that wraps all this lookup info into one variable. + + print $kwdef "static " if !$extern; + printf $kwdef "const ScanKeywordList %s = {\n", $varname; + printf $kwdef qq|\t%s_kw_string,\n|, $varname; + printf $kwdef qq|\t%s_kw_offsets,\n|, $varname; + printf $kwdef qq|\t%s_NUM_KEYWORDS\n|, uc $varname; + print $kwdef "};\n\n"; + + printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename; + + + sub usage + { + die <<EOM; + Usage: gen_keywordlist.pl [--output/-o <path>] [--varname/-v <varname>] [--extern/-e] input_file + --output Output directory (default '.') + --varname Name for ScanKeywordList variable (default 'ScanKeywords') + --extern Allow the ScanKeywordList variable to be globally visible + + gen_keywordlist.pl transforms a list of keywords into a ScanKeywordList. + The output filename is derived from the input file by inserting _d, + for example kwlist_d.h is produced from kwlist.h. + EOM + } diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm index eb2346b..937bf18 100644 *** a/src/tools/msvc/Solution.pm --- b/src/tools/msvc/Solution.pm *************** sub GenerateFiles *** 410,415 **** --- 410,451 ---- } if (IsNewer( + 'src/common/kwlist_d.h', + 'src/include/parser/kwlist.h')) + { + print "Generating kwlist_d.h...\n"; + system('perl src/tools/gen_keywordlist.pl --extern -o src/common src/include/parser/kwlist.h'); + } + + if (IsNewer( + 'src/pl/plpgsql/src/pl_reserved_kwlist_d.h', + 'src/pl/plpgsql/src/pl_reserved_kwlist.h') + || IsNewer( + 'src/pl/plpgsql/src/pl_unreserved_kwlist_d.h', + 'src/pl/plpgsql/src/pl_unreserved_kwlist.h')) + { + print "Generating pl_reserved_kwlist_d.h and pl_unreserved_kwlist_d.h...\n"; + chdir('src/pl/plpgsql/src'); + system('perl ../../../tools/gen_keywordlist.pl --varname ReservedPLKeywords pl_reserved_kwlist.h'); + system('perl ../../../tools/gen_keywordlist.pl --varname UnreservedPLKeywords pl_unreserved_kwlist.h'); + chdir('../../../..'); + } + + if (IsNewer( + 'src/interfaces/ecpg/preproc/c_kwlist_d.h', + 'src/interfaces/ecpg/preproc/c_kwlist.h') + || IsNewer( + 'src/interfaces/ecpg/preproc/ecpg_kwlist_d.h', + 'src/interfaces/ecpg/preproc/ecpg_kwlist.h')) + { + print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n"; + chdir('src/interfaces/ecpg/preproc'); + system('perl ../../../tools/gen_keywordlist.pl --varname ScanCKeywords c_kwlist.h'); + system('perl ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h'); + chdir('../../../..'); + } + + if (IsNewer( 'src/interfaces/ecpg/preproc/preproc.y', 'src/backend/parser/gram.y')) { diff --git a/src/tools/msvc/clean.bat b/src/tools/msvc/clean.bat index 7a23a2b..069d6eb 100755 *** a/src/tools/msvc/clean.bat --- b/src/tools/msvc/clean.bat *************** if %DIST%==1 if exist src\pl\tcl\pltcler *** 64,69 **** --- 64,74 ---- if %DIST%==1 if exist src\backend\utils\sort\qsort_tuple.c del /q src\backend\utils\sort\qsort_tuple.c if %DIST%==1 if exist src\bin\psql\sql_help.c del /q src\bin\psql\sql_help.c if %DIST%==1 if exist src\bin\psql\sql_help.h del /q src\bin\psql\sql_help.h + if %DIST%==1 if exist src\common\kwlist_d.h del /q src\common\kwlist_d.h + if %DIST%==1 if exist src\pl\plpgsql\src\pl_reserved_kwlist_d.h del /q src\pl\plpgsql\src\pl_reserved_kwlist_d.h + if %DIST%==1 if exist src\pl\plpgsql\src\pl_unreserved_kwlist_d.h del /q src\pl\plpgsql\src\pl_unreserved_kwlist_d.h + if %DIST%==1 if exist src\interfaces\ecpg\preproc\c_kwlist_d.h del /q src\interfaces\ecpg\preproc\c_kwlist_d.h + if %DIST%==1 if exist src\interfaces\ecpg\preproc\ecpg_kwlist_d.h del /q src\interfaces\ecpg\preproc\ecpg_kwlist_d.h if %DIST%==1 if exist src\interfaces\ecpg\preproc\preproc.y del /q src\interfaces\ecpg\preproc\preproc.y if %DIST%==1 if exist src\backend\catalog\postgres.bki del /q src\backend\catalog\postgres.bki if %DIST%==1 if exist src\backend\catalog\postgres.description del /q src\backend\catalog\postgres.description
I wrote: > I spent some time hacking on this today, and I think it's committable > now, but I'm putting it back up in case anyone wants to have another > look (and also so the cfbot can check it on Windows). ... and indeed, the cfbot doesn't like it. Here's v8, with the missing addition to Mkvcbuild.pm. regards, tom lane diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index e8ef966..9131991 100644 *** a/contrib/pg_stat_statements/pg_stat_statements.c --- b/contrib/pg_stat_statements/pg_stat_statements.c *************** fill_in_constant_lengths(pgssJumbleState *** 3075,3082 **** /* initialize the flex scanner --- should match raw_parser() */ yyscanner = scanner_init(query, &yyextra, ! ScanKeywords, ! NumScanKeywords); /* we don't want to re-emit any escape string warnings */ yyextra.escape_string_warning = false; --- 3075,3082 ---- /* initialize the flex scanner --- should match raw_parser() */ yyscanner = scanner_init(query, &yyextra, ! &ScanKeywords, ! ScanKeywordTokens); /* we don't want to re-emit any escape string warnings */ yyextra.escape_string_warning = false; diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c index 7e9b122..4c0c258 100644 *** a/src/backend/parser/parser.c --- b/src/backend/parser/parser.c *************** raw_parser(const char *str) *** 41,47 **** /* initialize the flex scanner */ yyscanner = scanner_init(str, &yyextra.core_yy_extra, ! ScanKeywords, NumScanKeywords); /* base_yylex() only needs this much initialization */ yyextra.have_lookahead = false; --- 41,47 ---- /* initialize the flex scanner */ yyscanner = scanner_init(str, &yyextra.core_yy_extra, ! &ScanKeywords, ScanKeywordTokens); /* base_yylex() only needs this much initialization */ yyextra.have_lookahead = false; diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l index fbeb86f..e1cae85 100644 *** a/src/backend/parser/scan.l --- b/src/backend/parser/scan.l *************** bool escape_string_warning = true; *** 67,72 **** --- 67,87 ---- bool standard_conforming_strings = true; /* + * Constant data exported from this file. This array maps from the + * zero-based keyword numbers returned by ScanKeywordLookup to the + * Bison token numbers needed by gram.y. This is exported because + * callers need to pass it to scanner_init, if they are using the + * standard keyword list ScanKeywords. + */ + #define PG_KEYWORD(kwname, value, category) value, + + const uint16 ScanKeywordTokens[] = { + #include "parser/kwlist.h" + }; + + #undef PG_KEYWORD + + /* * Set the type of YYSTYPE. */ #define YYSTYPE core_YYSTYPE *************** other . *** 504,521 **** * We will pass this along as a normal character string, * but preceded with an internally-generated "NCHAR". */ ! const ScanKeyword *keyword; SET_YYLLOC(); yyless(1); /* eat only 'n' this time */ ! keyword = ScanKeywordLookup("nchar", ! yyextra->keywords, ! yyextra->num_keywords); ! if (keyword != NULL) { ! yylval->keyword = keyword->name; ! return keyword->value; } else { --- 519,536 ---- * We will pass this along as a normal character string, * but preceded with an internally-generated "NCHAR". */ ! int kwnum; SET_YYLLOC(); yyless(1); /* eat only 'n' this time */ ! kwnum = ScanKeywordLookup("nchar", ! yyextra->keywordlist); ! if (kwnum >= 0) { ! yylval->keyword = GetScanKeyword(kwnum, ! yyextra->keywordlist); ! return yyextra->keyword_tokens[kwnum]; } else { *************** other . *** 1021,1039 **** {identifier} { ! const ScanKeyword *keyword; char *ident; SET_YYLLOC(); /* Is it a keyword? */ ! keyword = ScanKeywordLookup(yytext, ! yyextra->keywords, ! yyextra->num_keywords); ! if (keyword != NULL) { ! yylval->keyword = keyword->name; ! return keyword->value; } /* --- 1036,1054 ---- {identifier} { ! int kwnum; char *ident; SET_YYLLOC(); /* Is it a keyword? */ ! kwnum = ScanKeywordLookup(yytext, ! yyextra->keywordlist); ! if (kwnum >= 0) { ! yylval->keyword = GetScanKeyword(kwnum, ! yyextra->keywordlist); ! return yyextra->keyword_tokens[kwnum]; } /* *************** scanner_yyerror(const char *message, cor *** 1142,1149 **** core_yyscan_t scanner_init(const char *str, core_yy_extra_type *yyext, ! const ScanKeyword *keywords, ! int num_keywords) { Size slen = strlen(str); yyscan_t scanner; --- 1157,1164 ---- core_yyscan_t scanner_init(const char *str, core_yy_extra_type *yyext, ! const ScanKeywordList *keywordlist, ! const uint16 *keyword_tokens) { Size slen = strlen(str); yyscan_t scanner; *************** scanner_init(const char *str, *** 1153,1160 **** core_yyset_extra(yyext, scanner); ! yyext->keywords = keywords; ! yyext->num_keywords = num_keywords; yyext->backslash_quote = backslash_quote; yyext->escape_string_warning = escape_string_warning; --- 1168,1175 ---- core_yyset_extra(yyext, scanner); ! yyext->keywordlist = keywordlist; ! yyext->keyword_tokens = keyword_tokens; yyext->backslash_quote = backslash_quote; yyext->escape_string_warning = escape_string_warning; diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c index 7b69b82..746b7d2 100644 *** a/src/backend/utils/adt/misc.c --- b/src/backend/utils/adt/misc.c *************** pg_get_keywords(PG_FUNCTION_ARGS) *** 417,431 **** funcctx = SRF_PERCALL_SETUP(); ! if (funcctx->call_cntr < NumScanKeywords) { char *values[3]; HeapTuple tuple; /* cast-away-const is ugly but alternatives aren't much better */ ! values[0] = unconstify(char *, ScanKeywords[funcctx->call_cntr].name); ! switch (ScanKeywords[funcctx->call_cntr].category) { case UNRESERVED_KEYWORD: values[1] = "U"; --- 417,433 ---- funcctx = SRF_PERCALL_SETUP(); ! if (funcctx->call_cntr < ScanKeywords.num_keywords) { char *values[3]; HeapTuple tuple; /* cast-away-const is ugly but alternatives aren't much better */ ! values[0] = unconstify(char *, ! GetScanKeyword(funcctx->call_cntr, ! &ScanKeywords)); ! switch (ScanKeywordCategories[funcctx->call_cntr]) { case UNRESERVED_KEYWORD: values[1] = "U"; diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c index 368eacf..77811f6 100644 *** a/src/backend/utils/adt/ruleutils.c --- b/src/backend/utils/adt/ruleutils.c *************** quote_identifier(const char *ident) *** 10601,10611 **** * Note: ScanKeywordLookup() does case-insensitive comparison, but * that's fine, since we already know we have all-lower-case. */ ! const ScanKeyword *keyword = ScanKeywordLookup(ident, ! ScanKeywords, ! NumScanKeywords); ! if (keyword != NULL && keyword->category != UNRESERVED_KEYWORD) safe = false; } --- 10601,10609 ---- * Note: ScanKeywordLookup() does case-insensitive comparison, but * that's fine, since we already know we have all-lower-case. */ ! int kwnum = ScanKeywordLookup(ident, &ScanKeywords); ! if (kwnum >= 0 && ScanKeywordCategories[kwnum] != UNRESERVED_KEYWORD) safe = false; } diff --git a/src/common/.gitignore b/src/common/.gitignore index ...ffa3284 . *** a/src/common/.gitignore --- b/src/common/.gitignore *************** *** 0 **** --- 1 ---- + /kwlist_d.h diff --git a/src/common/Makefile b/src/common/Makefile index ec8139f..317b071 100644 *** a/src/common/Makefile --- b/src/common/Makefile *************** override CPPFLAGS += -DVAL_LDFLAGS_EX="\ *** 41,51 **** override CPPFLAGS += -DVAL_LDFLAGS_SL="\"$(LDFLAGS_SL)\"" override CPPFLAGS += -DVAL_LIBS="\"$(LIBS)\"" ! override CPPFLAGS := -DFRONTEND $(CPPFLAGS) LIBS += $(PTHREAD_LIBS) OBJS_COMMON = base64.o config_info.o controldata_utils.o exec.o file_perm.o \ ! ip.o keywords.o link-canary.o md5.o pg_lzcompress.o \ pgfnames.o psprintf.o relpath.o \ rmtree.o saslprep.o scram-common.o string.o unicode_norm.o \ username.o wait_error.o --- 41,51 ---- override CPPFLAGS += -DVAL_LDFLAGS_SL="\"$(LDFLAGS_SL)\"" override CPPFLAGS += -DVAL_LIBS="\"$(LIBS)\"" ! override CPPFLAGS := -DFRONTEND -I. -I$(top_srcdir)/src/common $(CPPFLAGS) LIBS += $(PTHREAD_LIBS) OBJS_COMMON = base64.o config_info.o controldata_utils.o exec.o file_perm.o \ ! ip.o keywords.o kwlookup.o link-canary.o md5.o pg_lzcompress.o \ pgfnames.o psprintf.o relpath.o \ rmtree.o saslprep.o scram-common.o string.o unicode_norm.o \ username.o wait_error.o *************** OBJS_SRV = $(OBJS_COMMON:%.o=%_srv.o) *** 65,70 **** --- 65,72 ---- all: libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a + distprep: kwlist_d.h + # libpgcommon is needed by some contrib install: all installdirs $(INSTALL_STLIB) libpgcommon.a '$(DESTDIR)$(libdir)/libpgcommon.a' *************** libpgcommon_srv.a: $(OBJS_SRV) *** 115,130 **** %_srv.o: %.c %.o $(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@ ! # Dependencies of keywords.o need to be managed explicitly to make sure ! # that you don't get broken parsing code, even in a non-enable-depend build. ! # Note that gram.h isn't required for the frontend versions of keywords.o. ! $(top_builddir)/src/include/parser/gram.h: $(top_srcdir)/src/backend/parser/gram.y ! $(MAKE) -C $(top_builddir)/src/backend $(top_builddir)/src/include/parser/gram.h ! keywords.o: $(top_srcdir)/src/include/parser/kwlist.h ! keywords_shlib.o: $(top_srcdir)/src/include/parser/kwlist.h ! keywords_srv.o: $(top_builddir)/src/include/parser/gram.h $(top_srcdir)/src/include/parser/kwlist.h ! clean distclean maintainer-clean: rm -f libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a rm -f $(OBJS_FRONTEND) $(OBJS_SHLIB) $(OBJS_SRV) --- 117,134 ---- %_srv.o: %.c %.o $(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@ ! # generate SQL keyword lookup table to be included into keywords*.o. ! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(top_srcdir)/src/tools/gen_keywordlist.pl ! $(PERL) $(top_srcdir)/src/tools/gen_keywordlist.pl --extern $< ! # Dependencies of keywords*.o need to be managed explicitly to make sure ! # that you don't get broken parsing code, even in a non-enable-depend build. ! keywords.o keywords_shlib.o keywords_srv.o: kwlist_d.h ! # kwlist_d.h is in the distribution tarball, so it is not cleaned here. ! clean distclean: rm -f libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a rm -f $(OBJS_FRONTEND) $(OBJS_SHLIB) $(OBJS_SRV) + + maintainer-clean: distclean + rm -f kwlist_d.h diff --git a/src/common/keywords.c b/src/common/keywords.c index 6f99090..103166c 100644 *** a/src/common/keywords.c --- b/src/common/keywords.c *************** *** 1,7 **** /*------------------------------------------------------------------------- * * keywords.c ! * lexical token lookup for key words in PostgreSQL * * * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group --- 1,7 ---- /*------------------------------------------------------------------------- * * keywords.c ! * PostgreSQL's list of SQL keywords * * * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group *************** *** 19,114 **** #include "postgres_fe.h" #endif ! #ifndef FRONTEND ! ! #include "parser/gramparse.h" ! ! #define PG_KEYWORD(a,b,c) {a,b,c}, - #else ! #include "common/keywords.h" ! /* ! * We don't need the token number for frontend uses, so leave it out to avoid ! * requiring backend headers that won't compile cleanly here. ! */ ! #define PG_KEYWORD(a,b,c) {a,0,c}, ! #endif /* FRONTEND */ ! const ScanKeyword ScanKeywords[] = { #include "parser/kwlist.h" }; ! const int NumScanKeywords = lengthof(ScanKeywords); ! ! ! /* ! * ScanKeywordLookup - see if a given word is a keyword ! * ! * The table to be searched is passed explicitly, so that this can be used ! * to search keyword lists other than the standard list appearing above. ! * ! * Returns a pointer to the ScanKeyword table entry, or NULL if no match. ! * ! * The match is done case-insensitively. Note that we deliberately use a ! * dumbed-down case conversion that will only translate 'A'-'Z' into 'a'-'z', ! * even if we are in a locale where tolower() would produce more or different ! * translations. This is to conform to the SQL99 spec, which says that ! * keywords are to be matched in this way even though non-keyword identifiers ! * receive a different case-normalization mapping. ! */ ! const ScanKeyword * ! ScanKeywordLookup(const char *text, ! const ScanKeyword *keywords, ! int num_keywords) ! { ! int len, ! i; ! char word[NAMEDATALEN]; ! const ScanKeyword *low; ! const ScanKeyword *high; ! ! len = strlen(text); ! /* We assume all keywords are shorter than NAMEDATALEN. */ ! if (len >= NAMEDATALEN) ! return NULL; ! ! /* ! * Apply an ASCII-only downcasing. We must not use tolower() since it may ! * produce the wrong translation in some locales (eg, Turkish). ! */ ! for (i = 0; i < len; i++) ! { ! char ch = text[i]; ! ! if (ch >= 'A' && ch <= 'Z') ! ch += 'a' - 'A'; ! word[i] = ch; ! } ! word[len] = '\0'; ! ! /* ! * Now do a binary search using plain strcmp() comparison. ! */ ! low = keywords; ! high = keywords + (num_keywords - 1); ! while (low <= high) ! { ! const ScanKeyword *middle; ! int difference; ! ! middle = low + (high - low) / 2; ! difference = strcmp(middle->name, word); ! if (difference == 0) ! return middle; ! else if (difference < 0) ! low = middle + 1; ! else ! high = middle - 1; ! } ! ! return NULL; ! } --- 19,37 ---- #include "postgres_fe.h" #endif ! #include "common/keywords.h" ! /* ScanKeywordList lookup data for SQL keywords */ ! #include "kwlist_d.h" ! /* Keyword categories for SQL keywords */ + #define PG_KEYWORD(kwname, value, category) category, ! const uint8 ScanKeywordCategories[SCANKEYWORDS_NUM_KEYWORDS] = { #include "parser/kwlist.h" }; ! #undef PG_KEYWORD diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c index ...db62623 . *** a/src/common/kwlookup.c --- b/src/common/kwlookup.c *************** *** 0 **** --- 1,91 ---- + /*------------------------------------------------------------------------- + * + * kwlookup.c + * Key word lookup for PostgreSQL + * + * + * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * src/common/kwlookup.c + * + *------------------------------------------------------------------------- + */ + #include "c.h" + + #include "common/kwlookup.h" + + + /* + * ScanKeywordLookup - see if a given word is a keyword + * + * The list of keywords to be matched against is passed as a ScanKeywordList. + * + * Returns the keyword number (0..N-1) of the keyword, or -1 if no match. + * Callers typically use the keyword number to index into information + * arrays, but that is no concern of this code. + * + * The match is done case-insensitively. Note that we deliberately use a + * dumbed-down case conversion that will only translate 'A'-'Z' into 'a'-'z', + * even if we are in a locale where tolower() would produce more or different + * translations. This is to conform to the SQL99 spec, which says that + * keywords are to be matched in this way even though non-keyword identifiers + * receive a different case-normalization mapping. + */ + int + ScanKeywordLookup(const char *text, + const ScanKeywordList *keywords) + { + int len, + i; + char word[NAMEDATALEN]; + const char *kw_string; + const uint16 *kw_offsets; + const uint16 *low; + const uint16 *high; + + len = strlen(text); + /* We assume all keywords are shorter than NAMEDATALEN. */ + if (len >= NAMEDATALEN) + return -1; + + /* + * Apply an ASCII-only downcasing. We must not use tolower() since it may + * produce the wrong translation in some locales (eg, Turkish). + */ + for (i = 0; i < len; i++) + { + char ch = text[i]; + + if (ch >= 'A' && ch <= 'Z') + ch += 'a' - 'A'; + word[i] = ch; + } + word[len] = '\0'; + + /* + * Now do a binary search using plain strcmp() comparison. + */ + kw_string = keywords->kw_string; + kw_offsets = keywords->kw_offsets; + low = kw_offsets; + high = kw_offsets + (keywords->num_keywords - 1); + while (low <= high) + { + const uint16 *middle; + int difference; + + middle = low + (high - low) / 2; + difference = strcmp(kw_string + *middle, word); + if (difference == 0) + return middle - kw_offsets; + else if (difference < 0) + low = middle + 1; + else + high = middle - 1; + } + + return -1; + } diff --git a/src/fe_utils/string_utils.c b/src/fe_utils/string_utils.c index 9b47b62..5c1732a 100644 *** a/src/fe_utils/string_utils.c --- b/src/fe_utils/string_utils.c *************** fmtId(const char *rawid) *** 104,114 **** * Note: ScanKeywordLookup() does case-insensitive comparison, but * that's fine, since we already know we have all-lower-case. */ ! const ScanKeyword *keyword = ScanKeywordLookup(rawid, ! ScanKeywords, ! NumScanKeywords); ! if (keyword != NULL && keyword->category != UNRESERVED_KEYWORD) need_quotes = true; } --- 104,112 ---- * Note: ScanKeywordLookup() does case-insensitive comparison, but * that's fine, since we already know we have all-lower-case. */ ! int kwnum = ScanKeywordLookup(rawid, &ScanKeywords); ! if (kwnum >= 0 && ScanKeywordCategories[kwnum] != UNRESERVED_KEYWORD) need_quotes = true; } diff --git a/src/include/common/keywords.h b/src/include/common/keywords.h index 8f22f32..fb18858 100644 *** a/src/include/common/keywords.h --- b/src/include/common/keywords.h *************** *** 1,7 **** /*------------------------------------------------------------------------- * * keywords.h ! * lexical token lookup for key words in PostgreSQL * * * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group --- 1,7 ---- /*------------------------------------------------------------------------- * * keywords.h ! * PostgreSQL's list of SQL keywords * * * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group *************** *** 14,44 **** #ifndef KEYWORDS_H #define KEYWORDS_H /* Keyword categories --- should match lists in gram.y */ #define UNRESERVED_KEYWORD 0 #define COL_NAME_KEYWORD 1 #define TYPE_FUNC_NAME_KEYWORD 2 #define RESERVED_KEYWORD 3 - - typedef struct ScanKeyword - { - const char *name; /* in lower case */ - int16 value; /* grammar's token code */ - int16 category; /* see codes above */ - } ScanKeyword; - #ifndef FRONTEND ! extern PGDLLIMPORT const ScanKeyword ScanKeywords[]; ! extern PGDLLIMPORT const int NumScanKeywords; #else ! extern const ScanKeyword ScanKeywords[]; ! extern const int NumScanKeywords; #endif - - extern const ScanKeyword *ScanKeywordLookup(const char *text, - const ScanKeyword *keywords, - int num_keywords); - #endif /* KEYWORDS_H */ --- 14,33 ---- #ifndef KEYWORDS_H #define KEYWORDS_H + #include "common/kwlookup.h" + /* Keyword categories --- should match lists in gram.y */ #define UNRESERVED_KEYWORD 0 #define COL_NAME_KEYWORD 1 #define TYPE_FUNC_NAME_KEYWORD 2 #define RESERVED_KEYWORD 3 #ifndef FRONTEND ! extern PGDLLIMPORT const ScanKeywordList ScanKeywords; ! extern PGDLLIMPORT const uint8 ScanKeywordCategories[]; #else ! extern const ScanKeywordList ScanKeywords; ! extern const uint8 ScanKeywordCategories[]; #endif #endif /* KEYWORDS_H */ diff --git a/src/include/common/kwlookup.h b/src/include/common/kwlookup.h index ...3098df3 . *** a/src/include/common/kwlookup.h --- b/src/include/common/kwlookup.h *************** *** 0 **** --- 1,39 ---- + /*------------------------------------------------------------------------- + * + * kwlookup.h + * Key word lookup for PostgreSQL + * + * + * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/common/kwlookup.h + * + *------------------------------------------------------------------------- + */ + #ifndef KWLOOKUP_H + #define KWLOOKUP_H + + /* + * This struct contains the data needed by ScanKeywordLookup to perform a + * search within a set of keywords. The contents are typically generated by + * src/tools/gen_keywordlist.pl from a header containing PG_KEYWORD macros. + */ + typedef struct ScanKeywordList + { + const char *kw_string; /* all keywords in order, separated by \0 */ + const uint16 *kw_offsets; /* offsets to the start of each keyword */ + int num_keywords; /* number of keywords */ + } ScanKeywordList; + + + extern int ScanKeywordLookup(const char *text, const ScanKeywordList *keywords); + + /* Code that wants to retrieve the text of the N'th keyword should use this. */ + static inline const char * + GetScanKeyword(int n, const ScanKeywordList *keywords) + { + return keywords->kw_string + keywords->kw_offsets[n]; + } + + #endif /* KWLOOKUP_H */ diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h index 0256d53..b8902d3 100644 *** a/src/include/parser/kwlist.h --- b/src/include/parser/kwlist.h *************** *** 2,8 **** * * kwlist.h * ! * The keyword list is kept in its own source file for possible use by * automatic tools. The exact representation of a keyword is determined * by the PG_KEYWORD macro, which is not defined in this file; it can * be defined by the caller for special purposes. --- 2,8 ---- * * kwlist.h * ! * The keyword lists are kept in their own source files for use by * automatic tools. The exact representation of a keyword is determined * by the PG_KEYWORD macro, which is not defined in this file; it can * be defined by the caller for special purposes. diff --git a/src/include/parser/scanner.h b/src/include/parser/scanner.h index 009550f..91e1c83 100644 *** a/src/include/parser/scanner.h --- b/src/include/parser/scanner.h *************** typedef struct core_yy_extra_type *** 73,82 **** Size scanbuflen; /* ! * The keyword list to use. */ ! const ScanKeyword *keywords; ! int num_keywords; /* * Scanner settings to use. These are initialized from the corresponding --- 73,82 ---- Size scanbuflen; /* ! * The keyword list to use, and the associated grammar token codes. */ ! const ScanKeywordList *keywordlist; ! const uint16 *keyword_tokens; /* * Scanner settings to use. These are initialized from the corresponding *************** typedef struct core_yy_extra_type *** 116,126 **** typedef void *core_yyscan_t; /* Entry points in parser/scan.l */ extern core_yyscan_t scanner_init(const char *str, core_yy_extra_type *yyext, ! const ScanKeyword *keywords, ! int num_keywords); extern void scanner_finish(core_yyscan_t yyscanner); extern int core_yylex(core_YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner); --- 116,129 ---- typedef void *core_yyscan_t; + /* Constant data exported from parser/scan.l */ + extern PGDLLIMPORT const uint16 ScanKeywordTokens[]; + /* Entry points in parser/scan.l */ extern core_yyscan_t scanner_init(const char *str, core_yy_extra_type *yyext, ! const ScanKeywordList *keywordlist, ! const uint16 *keyword_tokens); extern void scanner_finish(core_yyscan_t yyscanner); extern int core_yylex(core_YYSTYPE *lvalp, YYLTYPE *llocp, core_yyscan_t yyscanner); diff --git a/src/interfaces/ecpg/preproc/.gitignore b/src/interfaces/ecpg/preproc/.gitignore index 38ae2fe..958a826 100644 *** a/src/interfaces/ecpg/preproc/.gitignore --- b/src/interfaces/ecpg/preproc/.gitignore *************** *** 2,6 **** --- 2,8 ---- /preproc.c /preproc.h /pgc.c + /c_kwlist_d.h + /ecpg_kwlist_d.h /typename.c /ecpg diff --git a/src/interfaces/ecpg/preproc/Makefile b/src/interfaces/ecpg/preproc/Makefile index 69ddd8e..9b145a1 100644 *** a/src/interfaces/ecpg/preproc/Makefile --- b/src/interfaces/ecpg/preproc/Makefile *************** OBJS= preproc.o pgc.o type.o ecpg.o outp *** 28,33 **** --- 28,35 ---- keywords.o c_keywords.o ecpg_keywords.o typename.o descriptor.o variable.o \ $(WIN32RES) + GEN_KEYWORDLIST = $(top_srcdir)/src/tools/gen_keywordlist.pl + # Suppress parallel build to avoid a bug in GNU make 3.82 # (see comments in ../Makefile) ifeq ($(MAKE_VERSION),3.82) *************** preproc.y: ../../../backend/parser/gram. *** 53,61 **** $(PERL) $(srcdir)/parse.pl $(srcdir) < $< > $@ $(PERL) $(srcdir)/check_rules.pl $(srcdir) $< ecpg_keywords.o c_keywords.o keywords.o preproc.o pgc.o parser.o: preproc.h ! distprep: preproc.y preproc.c preproc.h pgc.c install: all installdirs $(INSTALL_PROGRAM) ecpg$(X) '$(DESTDIR)$(bindir)' --- 55,73 ---- $(PERL) $(srcdir)/parse.pl $(srcdir) < $< > $@ $(PERL) $(srcdir)/check_rules.pl $(srcdir) $< + # generate keyword headers + c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST) + $(PERL) $(GEN_KEYWORDLIST) --varname ScanCKeywords $< + + ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST) + $(PERL) $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $< + + # Force these dependencies to be known even without dependency info built: ecpg_keywords.o c_keywords.o keywords.o preproc.o pgc.o parser.o: preproc.h + ecpg_keywords.o: ecpg_kwlist_d.h + c_keywords.o: c_kwlist_d.h ! distprep: preproc.y preproc.c preproc.h pgc.c c_kwlist_d.h ecpg_kwlist_d.h install: all installdirs $(INSTALL_PROGRAM) ecpg$(X) '$(DESTDIR)$(bindir)' *************** installdirs: *** 66,77 **** uninstall: rm -f '$(DESTDIR)$(bindir)/ecpg$(X)' clean distclean: rm -f *.o ecpg$(X) rm -f typename.c - # `make distclean' must not remove preproc.y, preproc.c, preproc.h, or pgc.c - # since we want to ship those files in the distribution for people with - # inadequate tools. Instead, `make maintainer-clean' will remove them. maintainer-clean: distclean ! rm -f preproc.y preproc.c preproc.h pgc.c --- 78,88 ---- uninstall: rm -f '$(DESTDIR)$(bindir)/ecpg$(X)' + # preproc.y, preproc.c, preproc.h, pgc.c, c_kwlist_d.h, and ecpg_kwlist_d.h + # are in the distribution tarball, so they are not cleaned here. clean distclean: rm -f *.o ecpg$(X) rm -f typename.c maintainer-clean: distclean ! rm -f preproc.y preproc.c preproc.h pgc.c c_kwlist_d.h ecpg_kwlist_d.h diff --git a/src/interfaces/ecpg/preproc/c_keywords.c b/src/interfaces/ecpg/preproc/c_keywords.c index c367dbf..521992f 100644 *** a/src/interfaces/ecpg/preproc/c_keywords.c --- b/src/interfaces/ecpg/preproc/c_keywords.c *************** *** 14,85 **** #include "preproc_extern.h" #include "preproc.h" ! /* ! * List of (keyword-name, keyword-token-value) pairs. ! * ! * !!WARNING!!: This list must be sorted, because binary ! * search is used to locate entries. ! */ ! static const ScanKeyword ScanCKeywords[] = { ! /* name, value, category */ ! /* ! * category is not needed in ecpg, it is only here so we can share the ! * data structure with the backend ! */ ! {"VARCHAR", VARCHAR, 0}, ! {"auto", S_AUTO, 0}, ! {"bool", SQL_BOOL, 0}, ! {"char", CHAR_P, 0}, ! {"const", S_CONST, 0}, ! {"enum", ENUM_P, 0}, ! {"extern", S_EXTERN, 0}, ! {"float", FLOAT_P, 0}, ! {"hour", HOUR_P, 0}, ! {"int", INT_P, 0}, ! {"long", SQL_LONG, 0}, ! {"minute", MINUTE_P, 0}, ! {"month", MONTH_P, 0}, ! {"register", S_REGISTER, 0}, ! {"second", SECOND_P, 0}, ! {"short", SQL_SHORT, 0}, ! {"signed", SQL_SIGNED, 0}, ! {"static", S_STATIC, 0}, ! {"struct", SQL_STRUCT, 0}, ! {"to", TO, 0}, ! {"typedef", S_TYPEDEF, 0}, ! {"union", UNION, 0}, ! {"unsigned", SQL_UNSIGNED, 0}, ! {"varchar", VARCHAR, 0}, ! {"volatile", S_VOLATILE, 0}, ! {"year", YEAR_P, 0}, }; /* * Do a binary search using plain strcmp() comparison. This is much like * ScanKeywordLookup(), except we want case-sensitive matching. */ ! const ScanKeyword * ScanCKeywordLookup(const char *text) { ! const ScanKeyword *low = &ScanCKeywords[0]; ! const ScanKeyword *high = &ScanCKeywords[lengthof(ScanCKeywords) - 1]; while (low <= high) { ! const ScanKeyword *middle; int difference; middle = low + (high - low) / 2; ! difference = strcmp(middle->name, text); if (difference == 0) ! return middle; else if (difference < 0) low = middle + 1; else high = middle - 1; } ! return NULL; } --- 14,67 ---- #include "preproc_extern.h" #include "preproc.h" ! /* ScanKeywordList lookup data for C keywords */ ! #include "c_kwlist_d.h" ! /* Token codes for C keywords */ ! #define PG_KEYWORD(kwname, value) value, ! ! static const uint16 ScanCKeywordTokens[] = { ! #include "c_kwlist.h" }; + #undef PG_KEYWORD + /* + * ScanCKeywordLookup - see if a given word is a keyword + * + * Returns the token value of the keyword, or -1 if no match. + * * Do a binary search using plain strcmp() comparison. This is much like * ScanKeywordLookup(), except we want case-sensitive matching. */ ! int ScanCKeywordLookup(const char *text) { ! const char *kw_string; ! const uint16 *kw_offsets; ! const uint16 *low; ! const uint16 *high; ! ! kw_string = ScanCKeywords.kw_string; ! kw_offsets = ScanCKeywords.kw_offsets; ! low = kw_offsets; ! high = kw_offsets + (ScanCKeywords.num_keywords - 1); while (low <= high) { ! const uint16 *middle; int difference; middle = low + (high - low) / 2; ! difference = strcmp(kw_string + *middle, text); if (difference == 0) ! return ScanCKeywordTokens[middle - kw_offsets]; else if (difference < 0) low = middle + 1; else high = middle - 1; } ! return -1; } diff --git a/src/interfaces/ecpg/preproc/c_kwlist.h b/src/interfaces/ecpg/preproc/c_kwlist.h index ...4545505 . *** a/src/interfaces/ecpg/preproc/c_kwlist.h --- b/src/interfaces/ecpg/preproc/c_kwlist.h *************** *** 0 **** --- 1,53 ---- + /*------------------------------------------------------------------------- + * + * c_kwlist.h + * + * The keyword lists are kept in their own source files for use by + * automatic tools. The exact representation of a keyword is determined + * by the PG_KEYWORD macro, which is not defined in this file; it can + * be defined by the caller for special purposes. + * + * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/interfaces/ecpg/preproc/c_kwlist.h + * + *------------------------------------------------------------------------- + */ + + /* There is deliberately not an #ifndef C_KWLIST_H here. */ + + /* + * List of (keyword-name, keyword-token-value) pairs. + * + * !!WARNING!!: This list must be sorted by ASCII name, because binary + * search is used to locate entries. + */ + + /* name, value */ + PG_KEYWORD("VARCHAR", VARCHAR) + PG_KEYWORD("auto", S_AUTO) + PG_KEYWORD("bool", SQL_BOOL) + PG_KEYWORD("char", CHAR_P) + PG_KEYWORD("const", S_CONST) + PG_KEYWORD("enum", ENUM_P) + PG_KEYWORD("extern", S_EXTERN) + PG_KEYWORD("float", FLOAT_P) + PG_KEYWORD("hour", HOUR_P) + PG_KEYWORD("int", INT_P) + PG_KEYWORD("long", SQL_LONG) + PG_KEYWORD("minute", MINUTE_P) + PG_KEYWORD("month", MONTH_P) + PG_KEYWORD("register", S_REGISTER) + PG_KEYWORD("second", SECOND_P) + PG_KEYWORD("short", SQL_SHORT) + PG_KEYWORD("signed", SQL_SIGNED) + PG_KEYWORD("static", S_STATIC) + PG_KEYWORD("struct", SQL_STRUCT) + PG_KEYWORD("to", TO) + PG_KEYWORD("typedef", S_TYPEDEF) + PG_KEYWORD("union", UNION) + PG_KEYWORD("unsigned", SQL_UNSIGNED) + PG_KEYWORD("varchar", VARCHAR) + PG_KEYWORD("volatile", S_VOLATILE) + PG_KEYWORD("year", YEAR_P) diff --git a/src/interfaces/ecpg/preproc/ecpg_keywords.c b/src/interfaces/ecpg/preproc/ecpg_keywords.c index 37c97e1..4839c37 100644 *** a/src/interfaces/ecpg/preproc/ecpg_keywords.c --- b/src/interfaces/ecpg/preproc/ecpg_keywords.c *************** *** 16,97 **** #include "preproc_extern.h" #include "preproc.h" ! /* ! * List of (keyword-name, keyword-token-value) pairs. ! * ! * !!WARNING!!: This list must be sorted, because binary ! * search is used to locate entries. ! */ ! static const ScanKeyword ECPGScanKeywords[] = { ! /* name, value, category */ ! /* ! * category is not needed in ecpg, it is only here so we can share the ! * data structure with the backend ! */ ! {"allocate", SQL_ALLOCATE, 0}, ! {"autocommit", SQL_AUTOCOMMIT, 0}, ! {"bool", SQL_BOOL, 0}, ! {"break", SQL_BREAK, 0}, ! {"cardinality", SQL_CARDINALITY, 0}, ! {"connect", SQL_CONNECT, 0}, ! {"count", SQL_COUNT, 0}, ! {"datetime_interval_code", SQL_DATETIME_INTERVAL_CODE, 0}, ! {"datetime_interval_precision", SQL_DATETIME_INTERVAL_PRECISION, 0}, ! {"describe", SQL_DESCRIBE, 0}, ! {"descriptor", SQL_DESCRIPTOR, 0}, ! {"disconnect", SQL_DISCONNECT, 0}, ! {"found", SQL_FOUND, 0}, ! {"free", SQL_FREE, 0}, ! {"get", SQL_GET, 0}, ! {"go", SQL_GO, 0}, ! {"goto", SQL_GOTO, 0}, ! {"identified", SQL_IDENTIFIED, 0}, ! {"indicator", SQL_INDICATOR, 0}, ! {"key_member", SQL_KEY_MEMBER, 0}, ! {"length", SQL_LENGTH, 0}, ! {"long", SQL_LONG, 0}, ! {"nullable", SQL_NULLABLE, 0}, ! {"octet_length", SQL_OCTET_LENGTH, 0}, ! {"open", SQL_OPEN, 0}, ! {"output", SQL_OUTPUT, 0}, ! {"reference", SQL_REFERENCE, 0}, ! {"returned_length", SQL_RETURNED_LENGTH, 0}, ! {"returned_octet_length", SQL_RETURNED_OCTET_LENGTH, 0}, ! {"scale", SQL_SCALE, 0}, ! {"section", SQL_SECTION, 0}, ! {"short", SQL_SHORT, 0}, ! {"signed", SQL_SIGNED, 0}, ! {"sqlerror", SQL_SQLERROR, 0}, ! {"sqlprint", SQL_SQLPRINT, 0}, ! {"sqlwarning", SQL_SQLWARNING, 0}, ! {"stop", SQL_STOP, 0}, ! {"struct", SQL_STRUCT, 0}, ! {"unsigned", SQL_UNSIGNED, 0}, ! {"var", SQL_VAR, 0}, ! {"whenever", SQL_WHENEVER, 0}, }; /* * ScanECPGKeywordLookup - see if a given word is a keyword * ! * Returns a pointer to the ScanKeyword table entry, or NULL if no match. * Keywords are matched using the same case-folding rules as in the backend. */ ! const ScanKeyword * ScanECPGKeywordLookup(const char *text) { ! const ScanKeyword *res; /* First check SQL symbols defined by the backend. */ ! res = ScanKeywordLookup(text, SQLScanKeywords, NumSQLScanKeywords); ! if (res) ! return res; /* Try ECPG-specific keywords. */ ! res = ScanKeywordLookup(text, ECPGScanKeywords, lengthof(ECPGScanKeywords)); ! if (res) ! return res; ! return NULL; } --- 16,55 ---- #include "preproc_extern.h" #include "preproc.h" ! /* ScanKeywordList lookup data for ECPG keywords */ ! #include "ecpg_kwlist_d.h" ! /* Token codes for ECPG keywords */ ! #define PG_KEYWORD(kwname, value) value, ! ! static const uint16 ECPGScanKeywordTokens[] = { ! #include "ecpg_kwlist.h" }; + #undef PG_KEYWORD + + /* * ScanECPGKeywordLookup - see if a given word is a keyword * ! * Returns the token value of the keyword, or -1 if no match. ! * * Keywords are matched using the same case-folding rules as in the backend. */ ! int ScanECPGKeywordLookup(const char *text) { ! int kwnum; /* First check SQL symbols defined by the backend. */ ! kwnum = ScanKeywordLookup(text, &ScanKeywords); ! if (kwnum >= 0) ! return SQLScanKeywordTokens[kwnum]; /* Try ECPG-specific keywords. */ ! kwnum = ScanKeywordLookup(text, &ScanECPGKeywords); ! if (kwnum >= 0) ! return ECPGScanKeywordTokens[kwnum]; ! return -1; } diff --git a/src/interfaces/ecpg/preproc/ecpg_kwlist.h b/src/interfaces/ecpg/preproc/ecpg_kwlist.h index ...97ef254 . *** a/src/interfaces/ecpg/preproc/ecpg_kwlist.h --- b/src/interfaces/ecpg/preproc/ecpg_kwlist.h *************** *** 0 **** --- 1,68 ---- + /*------------------------------------------------------------------------- + * + * ecpg_kwlist.h + * + * The keyword lists are kept in their own source files for use by + * automatic tools. The exact representation of a keyword is determined + * by the PG_KEYWORD macro, which is not defined in this file; it can + * be defined by the caller for special purposes. + * + * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/interfaces/ecpg/preproc/ecpg_kwlist.h + * + *------------------------------------------------------------------------- + */ + + /* There is deliberately not an #ifndef ECPG_KWLIST_H here. */ + + /* + * List of (keyword-name, keyword-token-value) pairs. + * + * !!WARNING!!: This list must be sorted by ASCII name, because binary + * search is used to locate entries. + */ + + /* name, value */ + PG_KEYWORD("allocate", SQL_ALLOCATE) + PG_KEYWORD("autocommit", SQL_AUTOCOMMIT) + PG_KEYWORD("bool", SQL_BOOL) + PG_KEYWORD("break", SQL_BREAK) + PG_KEYWORD("cardinality", SQL_CARDINALITY) + PG_KEYWORD("connect", SQL_CONNECT) + PG_KEYWORD("count", SQL_COUNT) + PG_KEYWORD("datetime_interval_code", SQL_DATETIME_INTERVAL_CODE) + PG_KEYWORD("datetime_interval_precision", SQL_DATETIME_INTERVAL_PRECISION) + PG_KEYWORD("describe", SQL_DESCRIBE) + PG_KEYWORD("descriptor", SQL_DESCRIPTOR) + PG_KEYWORD("disconnect", SQL_DISCONNECT) + PG_KEYWORD("found", SQL_FOUND) + PG_KEYWORD("free", SQL_FREE) + PG_KEYWORD("get", SQL_GET) + PG_KEYWORD("go", SQL_GO) + PG_KEYWORD("goto", SQL_GOTO) + PG_KEYWORD("identified", SQL_IDENTIFIED) + PG_KEYWORD("indicator", SQL_INDICATOR) + PG_KEYWORD("key_member", SQL_KEY_MEMBER) + PG_KEYWORD("length", SQL_LENGTH) + PG_KEYWORD("long", SQL_LONG) + PG_KEYWORD("nullable", SQL_NULLABLE) + PG_KEYWORD("octet_length", SQL_OCTET_LENGTH) + PG_KEYWORD("open", SQL_OPEN) + PG_KEYWORD("output", SQL_OUTPUT) + PG_KEYWORD("reference", SQL_REFERENCE) + PG_KEYWORD("returned_length", SQL_RETURNED_LENGTH) + PG_KEYWORD("returned_octet_length", SQL_RETURNED_OCTET_LENGTH) + PG_KEYWORD("scale", SQL_SCALE) + PG_KEYWORD("section", SQL_SECTION) + PG_KEYWORD("short", SQL_SHORT) + PG_KEYWORD("signed", SQL_SIGNED) + PG_KEYWORD("sqlerror", SQL_SQLERROR) + PG_KEYWORD("sqlprint", SQL_SQLPRINT) + PG_KEYWORD("sqlwarning", SQL_SQLWARNING) + PG_KEYWORD("stop", SQL_STOP) + PG_KEYWORD("struct", SQL_STRUCT) + PG_KEYWORD("unsigned", SQL_UNSIGNED) + PG_KEYWORD("var", SQL_VAR) + PG_KEYWORD("whenever", SQL_WHENEVER) diff --git a/src/interfaces/ecpg/preproc/keywords.c b/src/interfaces/ecpg/preproc/keywords.c index 12409e9..0380409 100644 *** a/src/interfaces/ecpg/preproc/keywords.c --- b/src/interfaces/ecpg/preproc/keywords.c *************** *** 17,40 **** /* * This is much trickier than it looks. We are #include'ing kwlist.h ! * but the "value" numbers that go into the table are from preproc.h ! * not the backend's gram.h. Therefore this table will recognize all ! * keywords known to the backend, but will supply the token numbers used * by ecpg's grammar, which is what we need. The ecpg grammar must * define all the same token names the backend does, else we'll get * undefined-symbol failures in this compile. */ - #include "common/keywords.h" - #include "preproc_extern.h" #include "preproc.h" ! #define PG_KEYWORD(a,b,c) {a,b,c}, ! ! const ScanKeyword SQLScanKeywords[] = { #include "parser/kwlist.h" }; ! const int NumSQLScanKeywords = lengthof(SQLScanKeywords); --- 17,38 ---- /* * This is much trickier than it looks. We are #include'ing kwlist.h ! * but the token numbers that go into the table are from preproc.h ! * not the backend's gram.h. Therefore this token table will match ! * the ScanKeywords table supplied from common/keywords.c, including all ! * keywords known to the backend, but it will supply the token numbers used * by ecpg's grammar, which is what we need. The ecpg grammar must * define all the same token names the backend does, else we'll get * undefined-symbol failures in this compile. */ #include "preproc_extern.h" #include "preproc.h" + #define PG_KEYWORD(kwname, value, category) value, ! const uint16 SQLScanKeywordTokens[] = { #include "parser/kwlist.h" }; ! #undef PG_KEYWORD diff --git a/src/interfaces/ecpg/preproc/pgc.l b/src/interfaces/ecpg/preproc/pgc.l index a60564c..3131f5f 100644 *** a/src/interfaces/ecpg/preproc/pgc.l --- b/src/interfaces/ecpg/preproc/pgc.l *************** cppline {space}*#([^i][A-Za-z]*|{if}|{ *** 920,938 **** } {identifier} { - const ScanKeyword *keyword; - if (!isdefine()) { /* Is it an SQL/ECPG keyword? */ ! keyword = ScanECPGKeywordLookup(yytext); ! if (keyword != NULL) ! return keyword->value; /* Is it a C keyword? */ ! keyword = ScanCKeywordLookup(yytext); ! if (keyword != NULL) ! return keyword->value; /* * None of the above. Return it as an identifier. --- 920,938 ---- } {identifier} { if (!isdefine()) { + int kwvalue; + /* Is it an SQL/ECPG keyword? */ ! kwvalue = ScanECPGKeywordLookup(yytext); ! if (kwvalue >= 0) ! return kwvalue; /* Is it a C keyword? */ ! kwvalue = ScanCKeywordLookup(yytext); ! if (kwvalue >= 0) ! return kwvalue; /* * None of the above. Return it as an identifier. *************** cppline {space}*#([^i][A-Za-z]*|{if}|{ *** 1010,1021 **** return CPP_LINE; } <C>{identifier} { - const ScanKeyword *keyword; - /* * Try to detect a function name: * look for identifiers at the global scope ! * keep the last identifier before the first '(' and '{' */ if (braces_open == 0 && parenths_open == 0) { if (current_function) --- 1010,1020 ---- return CPP_LINE; } <C>{identifier} { /* * Try to detect a function name: * look for identifiers at the global scope ! * keep the last identifier before the first '(' and '{' ! */ if (braces_open == 0 && parenths_open == 0) { if (current_function) *************** cppline {space}*#([^i][A-Za-z]*|{if}|{ *** 1026,1034 **** /* however, some defines have to be taken care of for compatibility */ if ((!INFORMIX_MODE || !isinformixdefine()) && !isdefine()) { ! keyword = ScanCKeywordLookup(yytext); ! if (keyword != NULL) ! return keyword->value; else { base_yylval.str = mm_strdup(yytext); --- 1025,1035 ---- /* however, some defines have to be taken care of for compatibility */ if ((!INFORMIX_MODE || !isinformixdefine()) && !isdefine()) { ! int kwvalue; ! ! kwvalue = ScanCKeywordLookup(yytext); ! if (kwvalue >= 0) ! return kwvalue; else { base_yylval.str = mm_strdup(yytext); diff --git a/src/interfaces/ecpg/preproc/preproc_extern.h b/src/interfaces/ecpg/preproc/preproc_extern.h index 13eda67..9746780 100644 *** a/src/interfaces/ecpg/preproc/preproc_extern.h --- b/src/interfaces/ecpg/preproc/preproc_extern.h *************** extern struct when when_error, *** 59,66 **** extern struct ECPGstruct_member *struct_member_list[STRUCT_DEPTH]; /* Globals from keywords.c */ ! extern const ScanKeyword SQLScanKeywords[]; ! extern const int NumSQLScanKeywords; /* functions */ --- 59,65 ---- extern struct ECPGstruct_member *struct_member_list[STRUCT_DEPTH]; /* Globals from keywords.c */ ! extern const uint16 SQLScanKeywordTokens[]; /* functions */ *************** extern void check_indicator(struct ECPGt *** 102,109 **** extern void remove_typedefs(int); extern void remove_variables(int); extern struct variable *new_variable(const char *, struct ECPGtype *, int); ! extern const ScanKeyword *ScanCKeywordLookup(const char *); ! extern const ScanKeyword *ScanECPGKeywordLookup(const char *text); extern void parser_init(void); extern int filtered_base_yylex(void); --- 101,108 ---- extern void remove_typedefs(int); extern void remove_variables(int); extern struct variable *new_variable(const char *, struct ECPGtype *, int); ! extern int ScanCKeywordLookup(const char *text); ! extern int ScanECPGKeywordLookup(const char *text); extern void parser_init(void); extern int filtered_base_yylex(void); diff --git a/src/pl/plpgsql/src/.gitignore b/src/pl/plpgsql/src/.gitignore index ff6ac96..3ab9a22 100644 *** a/src/pl/plpgsql/src/.gitignore --- b/src/pl/plpgsql/src/.gitignore *************** *** 1,5 **** --- 1,7 ---- /pl_gram.c /pl_gram.h + /pl_reserved_kwlist_d.h + /pl_unreserved_kwlist_d.h /plerrcodes.h /log/ /results/ diff --git a/src/pl/plpgsql/src/Makefile b/src/pl/plpgsql/src/Makefile index 25a5a9d..9dd4a74 100644 *** a/src/pl/plpgsql/src/Makefile --- b/src/pl/plpgsql/src/Makefile *************** REGRESS_OPTS = --dbname=$(PL_TESTDB) *** 29,34 **** --- 29,36 ---- REGRESS = plpgsql_call plpgsql_control plpgsql_domain plpgsql_record \ plpgsql_cache plpgsql_transaction plpgsql_varprops + GEN_KEYWORDLIST = $(top_srcdir)/src/tools/gen_keywordlist.pl + all: all-lib # Shared library stuff *************** uninstall-headers: *** 61,66 **** --- 63,69 ---- # Force these dependencies to be known even without dependency info built: pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o: plpgsql.h pl_gram.h plerrcodes.h + pl_scanner.o: pl_reserved_kwlist_d.h pl_unreserved_kwlist_d.h # See notes in src/backend/parser/Makefile about the following two rules pl_gram.h: pl_gram.c *************** pl_gram.c: BISONFLAGS += -d *** 72,77 **** --- 75,87 ---- plerrcodes.h: $(top_srcdir)/src/backend/utils/errcodes.txt generate-plerrcodes.pl $(PERL) $(srcdir)/generate-plerrcodes.pl $< > $@ + # generate keyword headers for the scanner + pl_reserved_kwlist_d.h: pl_reserved_kwlist.h $(GEN_KEYWORDLIST) + $(PERL) $(GEN_KEYWORDLIST) --varname ReservedPLKeywords $< + + pl_unreserved_kwlist_d.h: pl_unreserved_kwlist.h $(GEN_KEYWORDLIST) + $(PERL) $(GEN_KEYWORDLIST) --varname UnreservedPLKeywords $< + check: submake $(pg_regress_check) $(REGRESS_OPTS) $(REGRESS) *************** submake: *** 84,96 **** $(MAKE) -C $(top_builddir)/src/test/regress pg_regress$(X) ! distprep: pl_gram.h pl_gram.c plerrcodes.h ! # pl_gram.c, pl_gram.h and plerrcodes.h are in the distribution tarball, ! # so they are not cleaned here. clean distclean: clean-lib rm -f $(OBJS) rm -rf $(pg_regress_clean_files) maintainer-clean: distclean ! rm -f pl_gram.c pl_gram.h plerrcodes.h --- 94,107 ---- $(MAKE) -C $(top_builddir)/src/test/regress pg_regress$(X) ! distprep: pl_gram.h pl_gram.c plerrcodes.h pl_reserved_kwlist_d.h pl_unreserved_kwlist_d.h ! # pl_gram.c, pl_gram.h, plerrcodes.h, pl_reserved_kwlist_d.h, and ! # pl_unreserved_kwlist_d.h are in the distribution tarball, so they ! # are not cleaned here. clean distclean: clean-lib rm -f $(OBJS) rm -rf $(pg_regress_clean_files) maintainer-clean: distclean ! rm -f pl_gram.c pl_gram.h plerrcodes.h pl_reserved_kwlist_d.h pl_unreserved_kwlist_d.h diff --git a/src/pl/plpgsql/src/pl_reserved_kwlist.h b/src/pl/plpgsql/src/pl_reserved_kwlist.h index ...5c2e0c1 . *** a/src/pl/plpgsql/src/pl_reserved_kwlist.h --- b/src/pl/plpgsql/src/pl_reserved_kwlist.h *************** *** 0 **** --- 1,53 ---- + /*------------------------------------------------------------------------- + * + * pl_reserved_kwlist.h + * + * The keyword lists are kept in their own source files for use by + * automatic tools. The exact representation of a keyword is determined + * by the PG_KEYWORD macro, which is not defined in this file; it can + * be defined by the caller for special purposes. + * + * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/pl/plpgsql/src/pl_reserved_kwlist.h + * + *------------------------------------------------------------------------- + */ + + /* There is deliberately not an #ifndef PL_RESERVED_KWLIST_H here. */ + + /* + * List of (keyword-name, keyword-token-value) pairs. + * + * Be careful not to put the same word in both lists. + * + * !!WARNING!!: This list must be sorted by ASCII name, because binary + * search is used to locate entries. + */ + + /* name, value */ + PG_KEYWORD("all", K_ALL) + PG_KEYWORD("begin", K_BEGIN) + PG_KEYWORD("by", K_BY) + PG_KEYWORD("case", K_CASE) + PG_KEYWORD("declare", K_DECLARE) + PG_KEYWORD("else", K_ELSE) + PG_KEYWORD("end", K_END) + PG_KEYWORD("execute", K_EXECUTE) + PG_KEYWORD("for", K_FOR) + PG_KEYWORD("foreach", K_FOREACH) + PG_KEYWORD("from", K_FROM) + PG_KEYWORD("if", K_IF) + PG_KEYWORD("in", K_IN) + PG_KEYWORD("into", K_INTO) + PG_KEYWORD("loop", K_LOOP) + PG_KEYWORD("not", K_NOT) + PG_KEYWORD("null", K_NULL) + PG_KEYWORD("or", K_OR) + PG_KEYWORD("strict", K_STRICT) + PG_KEYWORD("then", K_THEN) + PG_KEYWORD("to", K_TO) + PG_KEYWORD("using", K_USING) + PG_KEYWORD("when", K_WHEN) + PG_KEYWORD("while", K_WHILE) diff --git a/src/pl/plpgsql/src/pl_scanner.c b/src/pl/plpgsql/src/pl_scanner.c index 8340628..c260438 100644 *** a/src/pl/plpgsql/src/pl_scanner.c --- b/src/pl/plpgsql/src/pl_scanner.c *************** *** 22,37 **** #include "pl_gram.h" /* must be after parser/scanner.h */ - #define PG_KEYWORD(a,b,c) {a,b,c}, - - /* Klugy flag to tell scanner how to look up identifiers */ IdentifierLookup plpgsql_IdentifierLookup = IDENTIFIER_LOOKUP_NORMAL; /* * A word about keywords: * ! * We keep reserved and unreserved keywords in separate arrays. The * reserved keywords are passed to the core scanner, so they will be * recognized before (and instead of) any variable name. Unreserved words * are checked for separately, usually after determining that the identifier --- 22,36 ---- #include "pl_gram.h" /* must be after parser/scanner.h */ /* Klugy flag to tell scanner how to look up identifiers */ IdentifierLookup plpgsql_IdentifierLookup = IDENTIFIER_LOOKUP_NORMAL; /* * A word about keywords: * ! * We keep reserved and unreserved keywords in separate headers. Be careful ! * not to put the same word in both headers. Also be sure that pl_gram.y's ! * unreserved_keyword production agrees with the unreserved header. The * reserved keywords are passed to the core scanner, so they will be * recognized before (and instead of) any variable name. Unreserved words * are checked for separately, usually after determining that the identifier *************** IdentifierLookup plpgsql_IdentifierLooku *** 57,186 **** * BEGIN BY DECLARE EXECUTE FOREACH IF LOOP STRICT WHILE */ ! /* ! * Lists of keyword (name, token-value, category) entries. ! * ! * !!WARNING!!: These lists must be sorted by ASCII name, because binary ! * search is used to locate entries. ! * ! * Be careful not to put the same word in both lists. Also be sure that ! * pl_gram.y's unreserved_keyword production agrees with the second list. ! */ ! static const ScanKeyword reserved_keywords[] = { ! PG_KEYWORD("all", K_ALL, RESERVED_KEYWORD) ! PG_KEYWORD("begin", K_BEGIN, RESERVED_KEYWORD) ! PG_KEYWORD("by", K_BY, RESERVED_KEYWORD) ! PG_KEYWORD("case", K_CASE, RESERVED_KEYWORD) ! PG_KEYWORD("declare", K_DECLARE, RESERVED_KEYWORD) ! PG_KEYWORD("else", K_ELSE, RESERVED_KEYWORD) ! PG_KEYWORD("end", K_END, RESERVED_KEYWORD) ! PG_KEYWORD("execute", K_EXECUTE, RESERVED_KEYWORD) ! PG_KEYWORD("for", K_FOR, RESERVED_KEYWORD) ! PG_KEYWORD("foreach", K_FOREACH, RESERVED_KEYWORD) ! PG_KEYWORD("from", K_FROM, RESERVED_KEYWORD) ! PG_KEYWORD("if", K_IF, RESERVED_KEYWORD) ! PG_KEYWORD("in", K_IN, RESERVED_KEYWORD) ! PG_KEYWORD("into", K_INTO, RESERVED_KEYWORD) ! PG_KEYWORD("loop", K_LOOP, RESERVED_KEYWORD) ! PG_KEYWORD("not", K_NOT, RESERVED_KEYWORD) ! PG_KEYWORD("null", K_NULL, RESERVED_KEYWORD) ! PG_KEYWORD("or", K_OR, RESERVED_KEYWORD) ! PG_KEYWORD("strict", K_STRICT, RESERVED_KEYWORD) ! PG_KEYWORD("then", K_THEN, RESERVED_KEYWORD) ! PG_KEYWORD("to", K_TO, RESERVED_KEYWORD) ! PG_KEYWORD("using", K_USING, RESERVED_KEYWORD) ! PG_KEYWORD("when", K_WHEN, RESERVED_KEYWORD) ! PG_KEYWORD("while", K_WHILE, RESERVED_KEYWORD) ! }; ! static const int num_reserved_keywords = lengthof(reserved_keywords); ! static const ScanKeyword unreserved_keywords[] = { ! PG_KEYWORD("absolute", K_ABSOLUTE, UNRESERVED_KEYWORD) ! PG_KEYWORD("alias", K_ALIAS, UNRESERVED_KEYWORD) ! PG_KEYWORD("array", K_ARRAY, UNRESERVED_KEYWORD) ! PG_KEYWORD("assert", K_ASSERT, UNRESERVED_KEYWORD) ! PG_KEYWORD("backward", K_BACKWARD, UNRESERVED_KEYWORD) ! PG_KEYWORD("call", K_CALL, UNRESERVED_KEYWORD) ! PG_KEYWORD("close", K_CLOSE, UNRESERVED_KEYWORD) ! PG_KEYWORD("collate", K_COLLATE, UNRESERVED_KEYWORD) ! PG_KEYWORD("column", K_COLUMN, UNRESERVED_KEYWORD) ! PG_KEYWORD("column_name", K_COLUMN_NAME, UNRESERVED_KEYWORD) ! PG_KEYWORD("commit", K_COMMIT, UNRESERVED_KEYWORD) ! PG_KEYWORD("constant", K_CONSTANT, UNRESERVED_KEYWORD) ! PG_KEYWORD("constraint", K_CONSTRAINT, UNRESERVED_KEYWORD) ! PG_KEYWORD("constraint_name", K_CONSTRAINT_NAME, UNRESERVED_KEYWORD) ! PG_KEYWORD("continue", K_CONTINUE, UNRESERVED_KEYWORD) ! PG_KEYWORD("current", K_CURRENT, UNRESERVED_KEYWORD) ! PG_KEYWORD("cursor", K_CURSOR, UNRESERVED_KEYWORD) ! PG_KEYWORD("datatype", K_DATATYPE, UNRESERVED_KEYWORD) ! PG_KEYWORD("debug", K_DEBUG, UNRESERVED_KEYWORD) ! PG_KEYWORD("default", K_DEFAULT, UNRESERVED_KEYWORD) ! PG_KEYWORD("detail", K_DETAIL, UNRESERVED_KEYWORD) ! PG_KEYWORD("diagnostics", K_DIAGNOSTICS, UNRESERVED_KEYWORD) ! PG_KEYWORD("do", K_DO, UNRESERVED_KEYWORD) ! PG_KEYWORD("dump", K_DUMP, UNRESERVED_KEYWORD) ! PG_KEYWORD("elseif", K_ELSIF, UNRESERVED_KEYWORD) ! PG_KEYWORD("elsif", K_ELSIF, UNRESERVED_KEYWORD) ! PG_KEYWORD("errcode", K_ERRCODE, UNRESERVED_KEYWORD) ! PG_KEYWORD("error", K_ERROR, UNRESERVED_KEYWORD) ! PG_KEYWORD("exception", K_EXCEPTION, UNRESERVED_KEYWORD) ! PG_KEYWORD("exit", K_EXIT, UNRESERVED_KEYWORD) ! PG_KEYWORD("fetch", K_FETCH, UNRESERVED_KEYWORD) ! PG_KEYWORD("first", K_FIRST, UNRESERVED_KEYWORD) ! PG_KEYWORD("forward", K_FORWARD, UNRESERVED_KEYWORD) ! PG_KEYWORD("get", K_GET, UNRESERVED_KEYWORD) ! PG_KEYWORD("hint", K_HINT, UNRESERVED_KEYWORD) ! PG_KEYWORD("import", K_IMPORT, UNRESERVED_KEYWORD) ! PG_KEYWORD("info", K_INFO, UNRESERVED_KEYWORD) ! PG_KEYWORD("insert", K_INSERT, UNRESERVED_KEYWORD) ! PG_KEYWORD("is", K_IS, UNRESERVED_KEYWORD) ! PG_KEYWORD("last", K_LAST, UNRESERVED_KEYWORD) ! PG_KEYWORD("log", K_LOG, UNRESERVED_KEYWORD) ! PG_KEYWORD("message", K_MESSAGE, UNRESERVED_KEYWORD) ! PG_KEYWORD("message_text", K_MESSAGE_TEXT, UNRESERVED_KEYWORD) ! PG_KEYWORD("move", K_MOVE, UNRESERVED_KEYWORD) ! PG_KEYWORD("next", K_NEXT, UNRESERVED_KEYWORD) ! PG_KEYWORD("no", K_NO, UNRESERVED_KEYWORD) ! PG_KEYWORD("notice", K_NOTICE, UNRESERVED_KEYWORD) ! PG_KEYWORD("open", K_OPEN, UNRESERVED_KEYWORD) ! PG_KEYWORD("option", K_OPTION, UNRESERVED_KEYWORD) ! PG_KEYWORD("perform", K_PERFORM, UNRESERVED_KEYWORD) ! PG_KEYWORD("pg_context", K_PG_CONTEXT, UNRESERVED_KEYWORD) ! PG_KEYWORD("pg_datatype_name", K_PG_DATATYPE_NAME, UNRESERVED_KEYWORD) ! PG_KEYWORD("pg_exception_context", K_PG_EXCEPTION_CONTEXT, UNRESERVED_KEYWORD) ! PG_KEYWORD("pg_exception_detail", K_PG_EXCEPTION_DETAIL, UNRESERVED_KEYWORD) ! PG_KEYWORD("pg_exception_hint", K_PG_EXCEPTION_HINT, UNRESERVED_KEYWORD) ! PG_KEYWORD("print_strict_params", K_PRINT_STRICT_PARAMS, UNRESERVED_KEYWORD) ! PG_KEYWORD("prior", K_PRIOR, UNRESERVED_KEYWORD) ! PG_KEYWORD("query", K_QUERY, UNRESERVED_KEYWORD) ! PG_KEYWORD("raise", K_RAISE, UNRESERVED_KEYWORD) ! PG_KEYWORD("relative", K_RELATIVE, UNRESERVED_KEYWORD) ! PG_KEYWORD("reset", K_RESET, UNRESERVED_KEYWORD) ! PG_KEYWORD("return", K_RETURN, UNRESERVED_KEYWORD) ! PG_KEYWORD("returned_sqlstate", K_RETURNED_SQLSTATE, UNRESERVED_KEYWORD) ! PG_KEYWORD("reverse", K_REVERSE, UNRESERVED_KEYWORD) ! PG_KEYWORD("rollback", K_ROLLBACK, UNRESERVED_KEYWORD) ! PG_KEYWORD("row_count", K_ROW_COUNT, UNRESERVED_KEYWORD) ! PG_KEYWORD("rowtype", K_ROWTYPE, UNRESERVED_KEYWORD) ! PG_KEYWORD("schema", K_SCHEMA, UNRESERVED_KEYWORD) ! PG_KEYWORD("schema_name", K_SCHEMA_NAME, UNRESERVED_KEYWORD) ! PG_KEYWORD("scroll", K_SCROLL, UNRESERVED_KEYWORD) ! PG_KEYWORD("set", K_SET, UNRESERVED_KEYWORD) ! PG_KEYWORD("slice", K_SLICE, UNRESERVED_KEYWORD) ! PG_KEYWORD("sqlstate", K_SQLSTATE, UNRESERVED_KEYWORD) ! PG_KEYWORD("stacked", K_STACKED, UNRESERVED_KEYWORD) ! PG_KEYWORD("table", K_TABLE, UNRESERVED_KEYWORD) ! PG_KEYWORD("table_name", K_TABLE_NAME, UNRESERVED_KEYWORD) ! PG_KEYWORD("type", K_TYPE, UNRESERVED_KEYWORD) ! PG_KEYWORD("use_column", K_USE_COLUMN, UNRESERVED_KEYWORD) ! PG_KEYWORD("use_variable", K_USE_VARIABLE, UNRESERVED_KEYWORD) ! PG_KEYWORD("variable_conflict", K_VARIABLE_CONFLICT, UNRESERVED_KEYWORD) ! PG_KEYWORD("warning", K_WARNING, UNRESERVED_KEYWORD) }; ! static const int num_unreserved_keywords = lengthof(unreserved_keywords); /* * This macro must recognize all tokens that can immediately precede a --- 56,77 ---- * BEGIN BY DECLARE EXECUTE FOREACH IF LOOP STRICT WHILE */ ! /* ScanKeywordList lookup data for PL/pgSQL keywords */ ! #include "pl_reserved_kwlist_d.h" ! #include "pl_unreserved_kwlist_d.h" ! /* Token codes for PL/pgSQL keywords */ ! #define PG_KEYWORD(kwname, value) value, ! static const uint16 ReservedPLKeywordTokens[] = { ! #include "pl_reserved_kwlist.h" ! }; ! static const uint16 UnreservedPLKeywordTokens[] = { ! #include "pl_unreserved_kwlist.h" }; ! #undef PG_KEYWORD /* * This macro must recognize all tokens that can immediately precede a *************** plpgsql_yylex(void) *** 256,262 **** { int tok1; TokenAuxData aux1; ! const ScanKeyword *kw; tok1 = internal_yylex(&aux1); if (tok1 == IDENT || tok1 == PARAM) --- 147,153 ---- { int tok1; TokenAuxData aux1; ! int kwnum; tok1 = internal_yylex(&aux1); if (tok1 == IDENT || tok1 == PARAM) *************** plpgsql_yylex(void) *** 333,344 **** &aux1.lval.word)) tok1 = T_DATUM; else if (!aux1.lval.word.quoted && ! (kw = ScanKeywordLookup(aux1.lval.word.ident, ! unreserved_keywords, ! num_unreserved_keywords))) { ! aux1.lval.keyword = kw->name; ! tok1 = kw->value; } else tok1 = T_WORD; --- 224,235 ---- &aux1.lval.word)) tok1 = T_DATUM; else if (!aux1.lval.word.quoted && ! (kwnum = ScanKeywordLookup(aux1.lval.word.ident, ! &UnreservedPLKeywords)) >= 0) { ! aux1.lval.keyword = GetScanKeyword(kwnum, ! &UnreservedPLKeywords); ! tok1 = UnreservedPLKeywordTokens[kwnum]; } else tok1 = T_WORD; *************** plpgsql_yylex(void) *** 375,386 **** &aux1.lval.word)) tok1 = T_DATUM; else if (!aux1.lval.word.quoted && ! (kw = ScanKeywordLookup(aux1.lval.word.ident, ! unreserved_keywords, ! num_unreserved_keywords))) { ! aux1.lval.keyword = kw->name; ! tok1 = kw->value; } else tok1 = T_WORD; --- 266,277 ---- &aux1.lval.word)) tok1 = T_DATUM; else if (!aux1.lval.word.quoted && ! (kwnum = ScanKeywordLookup(aux1.lval.word.ident, ! &UnreservedPLKeywords)) >= 0) { ! aux1.lval.keyword = GetScanKeyword(kwnum, ! &UnreservedPLKeywords); ! tok1 = UnreservedPLKeywordTokens[kwnum]; } else tok1 = T_WORD; *************** plpgsql_token_is_unreserved_keyword(int *** 497,505 **** { int i; ! for (i = 0; i < num_unreserved_keywords; i++) { ! if (unreserved_keywords[i].value == token) return true; } return false; --- 388,396 ---- { int i; ! for (i = 0; i < lengthof(UnreservedPLKeywordTokens); i++) { ! if (UnreservedPLKeywordTokens[i] == token) return true; } return false; *************** plpgsql_scanner_init(const char *str) *** 696,702 **** { /* Start up the core scanner */ yyscanner = scanner_init(str, &core_yy, ! reserved_keywords, num_reserved_keywords); /* * scanorig points to the original string, which unlike the scanner's --- 587,593 ---- { /* Start up the core scanner */ yyscanner = scanner_init(str, &core_yy, ! &ReservedPLKeywords, ReservedPLKeywordTokens); /* * scanorig points to the original string, which unlike the scanner's diff --git a/src/pl/plpgsql/src/pl_unreserved_kwlist.h b/src/pl/plpgsql/src/pl_unreserved_kwlist.h index ...ef2aea0 . *** a/src/pl/plpgsql/src/pl_unreserved_kwlist.h --- b/src/pl/plpgsql/src/pl_unreserved_kwlist.h *************** *** 0 **** --- 1,111 ---- + /*------------------------------------------------------------------------- + * + * pl_unreserved_kwlist.h + * + * The keyword lists are kept in their own source files for use by + * automatic tools. The exact representation of a keyword is determined + * by the PG_KEYWORD macro, which is not defined in this file; it can + * be defined by the caller for special purposes. + * + * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/pl/plpgsql/src/pl_unreserved_kwlist.h + * + *------------------------------------------------------------------------- + */ + + /* There is deliberately not an #ifndef PL_UNRESERVED_KWLIST_H here. */ + + /* + * List of (keyword-name, keyword-token-value) pairs. + * + * Be careful not to put the same word in both lists. Also be sure that + * pl_gram.y's unreserved_keyword production agrees with this list. + * + * !!WARNING!!: This list must be sorted by ASCII name, because binary + * search is used to locate entries. + */ + + /* name, value */ + PG_KEYWORD("absolute", K_ABSOLUTE) + PG_KEYWORD("alias", K_ALIAS) + PG_KEYWORD("array", K_ARRAY) + PG_KEYWORD("assert", K_ASSERT) + PG_KEYWORD("backward", K_BACKWARD) + PG_KEYWORD("call", K_CALL) + PG_KEYWORD("close", K_CLOSE) + PG_KEYWORD("collate", K_COLLATE) + PG_KEYWORD("column", K_COLUMN) + PG_KEYWORD("column_name", K_COLUMN_NAME) + PG_KEYWORD("commit", K_COMMIT) + PG_KEYWORD("constant", K_CONSTANT) + PG_KEYWORD("constraint", K_CONSTRAINT) + PG_KEYWORD("constraint_name", K_CONSTRAINT_NAME) + PG_KEYWORD("continue", K_CONTINUE) + PG_KEYWORD("current", K_CURRENT) + PG_KEYWORD("cursor", K_CURSOR) + PG_KEYWORD("datatype", K_DATATYPE) + PG_KEYWORD("debug", K_DEBUG) + PG_KEYWORD("default", K_DEFAULT) + PG_KEYWORD("detail", K_DETAIL) + PG_KEYWORD("diagnostics", K_DIAGNOSTICS) + PG_KEYWORD("do", K_DO) + PG_KEYWORD("dump", K_DUMP) + PG_KEYWORD("elseif", K_ELSIF) + PG_KEYWORD("elsif", K_ELSIF) + PG_KEYWORD("errcode", K_ERRCODE) + PG_KEYWORD("error", K_ERROR) + PG_KEYWORD("exception", K_EXCEPTION) + PG_KEYWORD("exit", K_EXIT) + PG_KEYWORD("fetch", K_FETCH) + PG_KEYWORD("first", K_FIRST) + PG_KEYWORD("forward", K_FORWARD) + PG_KEYWORD("get", K_GET) + PG_KEYWORD("hint", K_HINT) + PG_KEYWORD("import", K_IMPORT) + PG_KEYWORD("info", K_INFO) + PG_KEYWORD("insert", K_INSERT) + PG_KEYWORD("is", K_IS) + PG_KEYWORD("last", K_LAST) + PG_KEYWORD("log", K_LOG) + PG_KEYWORD("message", K_MESSAGE) + PG_KEYWORD("message_text", K_MESSAGE_TEXT) + PG_KEYWORD("move", K_MOVE) + PG_KEYWORD("next", K_NEXT) + PG_KEYWORD("no", K_NO) + PG_KEYWORD("notice", K_NOTICE) + PG_KEYWORD("open", K_OPEN) + PG_KEYWORD("option", K_OPTION) + PG_KEYWORD("perform", K_PERFORM) + PG_KEYWORD("pg_context", K_PG_CONTEXT) + PG_KEYWORD("pg_datatype_name", K_PG_DATATYPE_NAME) + PG_KEYWORD("pg_exception_context", K_PG_EXCEPTION_CONTEXT) + PG_KEYWORD("pg_exception_detail", K_PG_EXCEPTION_DETAIL) + PG_KEYWORD("pg_exception_hint", K_PG_EXCEPTION_HINT) + PG_KEYWORD("print_strict_params", K_PRINT_STRICT_PARAMS) + PG_KEYWORD("prior", K_PRIOR) + PG_KEYWORD("query", K_QUERY) + PG_KEYWORD("raise", K_RAISE) + PG_KEYWORD("relative", K_RELATIVE) + PG_KEYWORD("reset", K_RESET) + PG_KEYWORD("return", K_RETURN) + PG_KEYWORD("returned_sqlstate", K_RETURNED_SQLSTATE) + PG_KEYWORD("reverse", K_REVERSE) + PG_KEYWORD("rollback", K_ROLLBACK) + PG_KEYWORD("row_count", K_ROW_COUNT) + PG_KEYWORD("rowtype", K_ROWTYPE) + PG_KEYWORD("schema", K_SCHEMA) + PG_KEYWORD("schema_name", K_SCHEMA_NAME) + PG_KEYWORD("scroll", K_SCROLL) + PG_KEYWORD("set", K_SET) + PG_KEYWORD("slice", K_SLICE) + PG_KEYWORD("sqlstate", K_SQLSTATE) + PG_KEYWORD("stacked", K_STACKED) + PG_KEYWORD("table", K_TABLE) + PG_KEYWORD("table_name", K_TABLE_NAME) + PG_KEYWORD("type", K_TYPE) + PG_KEYWORD("use_column", K_USE_COLUMN) + PG_KEYWORD("use_variable", K_USE_VARIABLE) + PG_KEYWORD("variable_conflict", K_VARIABLE_CONFLICT) + PG_KEYWORD("warning", K_WARNING) diff --git a/src/tools/gen_keywordlist.pl b/src/tools/gen_keywordlist.pl index ...eb5ed65 . *** a/src/tools/gen_keywordlist.pl --- b/src/tools/gen_keywordlist.pl *************** *** 0 **** --- 1,148 ---- + #---------------------------------------------------------------------- + # + # gen_keywordlist.pl + # Perl script that transforms a list of keywords into a ScanKeywordList + # data structure that can be passed to ScanKeywordLookup(). + # + # The input is a C header file containing a series of macro calls + # PG_KEYWORD("keyword", ...) + # Lines not starting with PG_KEYWORD are ignored. The keywords are + # implicitly numbered 0..N-1 in order of appearance in the header file. + # Currently, the keywords are required to appear in ASCII order. + # + # The output is a C header file that defines a "const ScanKeywordList" + # variable named according to the -v switch ("ScanKeywords" by default). + # The variable is marked "static" unless the -e switch is given. + # + # + # Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + # Portions Copyright (c) 1994, Regents of the University of California + # + # src/tools/gen_keywordlist.pl + # + #---------------------------------------------------------------------- + + use strict; + use warnings; + use Getopt::Long; + + my $output_path = ''; + my $extern = 0; + my $varname = 'ScanKeywords'; + + GetOptions( + 'output:s' => \$output_path, + 'extern' => \$extern, + 'varname:s' => \$varname) || usage(); + + my $kw_input_file = shift @ARGV || die "No input file.\n"; + + # Make sure output_path ends in a slash if needed. + if ($output_path ne '' && substr($output_path, -1) ne '/') + { + $output_path .= '/'; + } + + $kw_input_file =~ /(\w+)\.h$/ || die "Input file must be named something.h.\n"; + my $base_filename = $1 . '_d'; + my $kw_def_file = $output_path . $base_filename . '.h'; + + open(my $kif, '<', $kw_input_file) || die "$kw_input_file: $!\n"; + open(my $kwdef, '>', $kw_def_file) || die "$kw_def_file: $!\n"; + + # Opening boilerplate for keyword definition header. + printf $kwdef <<EOM, $base_filename, uc $base_filename, uc $base_filename; + /*------------------------------------------------------------------------- + * + * %s.h + * List of keywords represented as a ScanKeywordList. + * + * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * NOTES + * ****************************** + * *** DO NOT EDIT THIS FILE! *** + * ****************************** + * + * It has been GENERATED by src/tools/gen_keywordlist.pl + * + *------------------------------------------------------------------------- + */ + + #ifndef %s_H + #define %s_H + + #include "common/kwlookup.h" + + EOM + + # Parse input file for keyword names. + my @keywords; + while (<$kif>) + { + if (/^PG_KEYWORD\("(\w+)"/) + { + push @keywords, $1; + } + } + + # Error out if the keyword names are not in ASCII order. + for my $i (0..$#keywords - 1) + { + die qq|The keyword "$keywords[$i + 1]" is out of order in $kw_input_file\n| + if ($keywords[$i] cmp $keywords[$i + 1]) >= 0; + } + + # Emit the string containing all the keywords. + + printf $kwdef qq|static const char %s_kw_string[] =\n\t"|, $varname; + print $kwdef join qq|\\0"\n\t"|, @keywords; + print $kwdef qq|";\n\n|; + + # Emit an array of numerical offsets which will be used to index into the + # keyword string. + + printf $kwdef "static const uint16 %s_kw_offsets[] = {\n", $varname; + + my $offset = 0; + foreach my $name (@keywords) + { + print $kwdef "\t$offset,\n"; + + # Calculate the cumulative offset of the next keyword, + # taking into account the null terminator. + $offset += length($name) + 1; + } + + print $kwdef "};\n\n"; + + # Emit a macro defining the number of keywords. + + printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords; + + # Emit the struct that wraps all this lookup info into one variable. + + print $kwdef "static " if !$extern; + printf $kwdef "const ScanKeywordList %s = {\n", $varname; + printf $kwdef qq|\t%s_kw_string,\n|, $varname; + printf $kwdef qq|\t%s_kw_offsets,\n|, $varname; + printf $kwdef qq|\t%s_NUM_KEYWORDS\n|, uc $varname; + print $kwdef "};\n\n"; + + printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename; + + + sub usage + { + die <<EOM; + Usage: gen_keywordlist.pl [--output/-o <path>] [--varname/-v <varname>] [--extern/-e] input_file + --output Output directory (default '.') + --varname Name for ScanKeywordList variable (default 'ScanKeywords') + --extern Allow the ScanKeywordList variable to be globally visible + + gen_keywordlist.pl transforms a list of keywords into a ScanKeywordList. + The output filename is derived from the input file by inserting _d, + for example kwlist_d.h is produced from kwlist.h. + EOM + } diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm index 2921d19..56192f1 100644 *** a/src/tools/msvc/Mkvcbuild.pm --- b/src/tools/msvc/Mkvcbuild.pm *************** sub mkvcbuild *** 118,124 **** our @pgcommonallfiles = qw( base64.c config_info.c controldata_utils.c exec.c file_perm.c ip.c ! keywords.c link-canary.c md5.c pg_lzcompress.c pgfnames.c psprintf.c relpath.c rmtree.c saslprep.c scram-common.c string.c unicode_norm.c username.c wait_error.c); --- 118,124 ---- our @pgcommonallfiles = qw( base64.c config_info.c controldata_utils.c exec.c file_perm.c ip.c ! keywords.c kwlookup.c link-canary.c md5.c pg_lzcompress.c pgfnames.c psprintf.c relpath.c rmtree.c saslprep.c scram-common.c string.c unicode_norm.c username.c wait_error.c); diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm index eb2346b..937bf18 100644 *** a/src/tools/msvc/Solution.pm --- b/src/tools/msvc/Solution.pm *************** sub GenerateFiles *** 410,415 **** --- 410,451 ---- } if (IsNewer( + 'src/common/kwlist_d.h', + 'src/include/parser/kwlist.h')) + { + print "Generating kwlist_d.h...\n"; + system('perl src/tools/gen_keywordlist.pl --extern -o src/common src/include/parser/kwlist.h'); + } + + if (IsNewer( + 'src/pl/plpgsql/src/pl_reserved_kwlist_d.h', + 'src/pl/plpgsql/src/pl_reserved_kwlist.h') + || IsNewer( + 'src/pl/plpgsql/src/pl_unreserved_kwlist_d.h', + 'src/pl/plpgsql/src/pl_unreserved_kwlist.h')) + { + print "Generating pl_reserved_kwlist_d.h and pl_unreserved_kwlist_d.h...\n"; + chdir('src/pl/plpgsql/src'); + system('perl ../../../tools/gen_keywordlist.pl --varname ReservedPLKeywords pl_reserved_kwlist.h'); + system('perl ../../../tools/gen_keywordlist.pl --varname UnreservedPLKeywords pl_unreserved_kwlist.h'); + chdir('../../../..'); + } + + if (IsNewer( + 'src/interfaces/ecpg/preproc/c_kwlist_d.h', + 'src/interfaces/ecpg/preproc/c_kwlist.h') + || IsNewer( + 'src/interfaces/ecpg/preproc/ecpg_kwlist_d.h', + 'src/interfaces/ecpg/preproc/ecpg_kwlist.h')) + { + print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n"; + chdir('src/interfaces/ecpg/preproc'); + system('perl ../../../tools/gen_keywordlist.pl --varname ScanCKeywords c_kwlist.h'); + system('perl ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h'); + chdir('../../../..'); + } + + if (IsNewer( 'src/interfaces/ecpg/preproc/preproc.y', 'src/backend/parser/gram.y')) { diff --git a/src/tools/msvc/clean.bat b/src/tools/msvc/clean.bat index 7a23a2b..069d6eb 100755 *** a/src/tools/msvc/clean.bat --- b/src/tools/msvc/clean.bat *************** if %DIST%==1 if exist src\pl\tcl\pltcler *** 64,69 **** --- 64,74 ---- if %DIST%==1 if exist src\backend\utils\sort\qsort_tuple.c del /q src\backend\utils\sort\qsort_tuple.c if %DIST%==1 if exist src\bin\psql\sql_help.c del /q src\bin\psql\sql_help.c if %DIST%==1 if exist src\bin\psql\sql_help.h del /q src\bin\psql\sql_help.h + if %DIST%==1 if exist src\common\kwlist_d.h del /q src\common\kwlist_d.h + if %DIST%==1 if exist src\pl\plpgsql\src\pl_reserved_kwlist_d.h del /q src\pl\plpgsql\src\pl_reserved_kwlist_d.h + if %DIST%==1 if exist src\pl\plpgsql\src\pl_unreserved_kwlist_d.h del /q src\pl\plpgsql\src\pl_unreserved_kwlist_d.h + if %DIST%==1 if exist src\interfaces\ecpg\preproc\c_kwlist_d.h del /q src\interfaces\ecpg\preproc\c_kwlist_d.h + if %DIST%==1 if exist src\interfaces\ecpg\preproc\ecpg_kwlist_d.h del /q src\interfaces\ecpg\preproc\ecpg_kwlist_d.h if %DIST%==1 if exist src\interfaces\ecpg\preproc\preproc.y del /q src\interfaces\ecpg\preproc\preproc.y if %DIST%==1 if exist src\backend\catalog\postgres.bki del /q src\backend\catalog\postgres.bki if %DIST%==1 if exist src\backend\catalog\postgres.description del /q src\backend\catalog\postgres.description
Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)
From
David Rowley
Date:
On Sat, 5 Jan 2019 at 09:20, John Naylor <jcnaylor@gmail.com> wrote: > > On 1/3/19, Joerg Sonnenberger <joerg@bec.de> wrote: > > Hello John, > > I was pointed at your patch on IRC and decided to look into adding my > > own pieces. What I can provide you is a fast perfect hash function > > generator. I've attached a sample hash function based on the current > > main keyword list. hash() essentially gives you the number of the only > > possible match, a final strcmp/memcmp is still necessary to verify that > > it is an actual keyword though. The |0x20 can be dropped if all cases > > have pre-lower-cased the input already. This would replace the binary > > search in the lookup functions. Returning offsets directly would be easy > > as well. That allows writing a single string where each entry is prefixed > > with a type mask, the token id, the length of the keyword and the actual > > keyword text. Does that sound useful to you? > > Judging by previous responses, there is still interest in using > perfect hash functions, so thanks for this. I'm not knowledgeable > enough to judge its implementation, so I'll leave that for others. Well, I'm quite impressed by the resulting hash function. The resulting hash value can be used directly to index the existing 440 element ScanKeywords[] array (way better than the 1815 element array Andrew got from gperf). If we also happened to also store the length of the keyword in that array then we could compare the length of the word after hashing. If the length is the same then we could perform a memcmp() to confirm the match, should be a little cheaper than a strcmp() and we should be able to store the length for free on 64-bit machines. If the length is not the same then it's not a keyword. It may also save some cycles to determine the input word's length at the same time as lowering it. The keyword length could also be easily determined by changing the PG_KEYWORD macro to become: #define PG_KEYWORD(a,b,c) {a,0,c,sizeof(a)-1}, after, of course adding a new field to the ScanKeyword struct. What I'm most interested in is how long it took to generate the hash function in hash2.c? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Joerg Sonnenberger
Date:
On Mon, Jan 07, 2019 at 03:11:55AM +1300, David Rowley wrote: > What I'm most interested in is how long it took to generate the hash > function in hash2.c? It's within the noise floor of time(1) on my laptop, e.g. ~1ms. Joerg
Joerg Sonnenberger <joerg@bec.de> writes: > On Mon, Jan 07, 2019 at 03:11:55AM +1300, David Rowley wrote: >> What I'm most interested in is how long it took to generate the hash >> function in hash2.c? > It's within the noise floor of time(1) on my laptop, e.g. ~1ms. I decided to do some simple performance measurements to see if we're actually getting any useful results here. I set up a test case that just runs raw_parser in a tight loop: while (count-- > 0) { List *parsetree_list; MemoryContext oldcontext; oldcontext = MemoryContextSwitchTo(mycontext); parsetree_list = pg_parse_query(query_string); MemoryContextSwitchTo(oldcontext); MemoryContextReset(mycontext); CHECK_FOR_INTERRUPTS(); } and exercised it on the contents of information_schema.sql. I think that's a reasonably representative test case considering that we're only examining the speed of the flex+bison stages. (Since it's mostly DDL, including parse analysis might be a bit unlike normal workloads, but for raw parsing it should be fine.) On my workstation, in a non-cassert build, HEAD requires about 4700 ms for 1000 iterations (with maybe 1% cross-run variation). Applying the v8 patch I posted yesterday, the time drops to ~4520 ms or about a 4% savings. So not a lot, but it's pretty repeatable, and it shows that reducing the cache footprint of keyword lookup is worth something. I then tried plastering in Joerg's example hash function, as in the attached delta patch on top of v8. This is *not* a usable patch; it breaks plpgsql and ecpg, because ScanKeywordLookup no longer works for non-core keyword sets. But that doesn't matter for the information_schema test case, and on that I find the runtime drops to ~3820 ms, or 19% better than HEAD. So this is definitely an idea worth pursuing. Some additional thoughts: * It's too bad that the hash function doesn't have a return convention that allows distinguishing "couldn't possibly match any keyword" from "might match keyword 0". I imagine a lot of the zero entries in its hashtable could be interpreted as the former, so at the cost of perhaps a couple more if-tests we could skip work at the caller. As this patch stands, we could only skip the strcmp() so it might not be worth the trouble --- but if we use Joerg's |0x20 hack then we could hash before downcasing, allowing us to skip the downcasing step if the word couldn't possibly be a keyword. Likely that would be worth the trouble. * We should extend the ScanKeywordList representation to include a field holding the longest keyword length in the table, which gen_keywordlist.pl would have no trouble providing. Then we could skip downcasing and/or hashing for any word longer than that, replacing the current NAMEDATALEN test, and thereby putting a tight bound on the cost of downcasing and/or hashing. * If we do hash first, then we could replace the downcasing loop and strcmp with an integrated loop that downcases and compares one character at a time, removing the need for the NAMEDATALEN-sized buffer variable. * I think it matters to the speed of the hashing loop that the magic multipliers are hard-coded. (Examining the assembly code shows that, at least on my x86_64 hardware, gcc prefers to use shift-and-add sequences here instead of multiply instructions.) So we probably can't have inlined hashing code --- I imagine the hash generator needs the flexibility to pick different values of those multipliers. I envision making this work by having gen_keywordlist.pl emit a function definition for the hash step and include a function pointer to it in ScanKeywordList. That extra function call will make things fractionally slower than what I have here, but I don't think it will change the conclusions any. This approach would also greatly alleviate the concern I had yesterday about ecpg's c_keywords.c having a second copy of the hashing code; what it would have is its own generated function, which isn't much of a problem. * Given that the generator's runtime is negligible when coded in C, I suspect that we might able to tolerate the speed hit from translating it to Perl, and frankly I'd much rather do that than cope with the consequences of including C code in our build process. I'm eagerly awaiting seeing Joerg's code, but I think in the meantime I'm going to go look up NetBSD's nbperf to get an idea of how painful it might be to do in Perl. (Now, bearing in mind that I'm not exactly fluent in Perl, there are probably other people around here who could produce a better translation ...) regards, tom lane diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c index db62623..8c55f40 100644 *** a/src/common/kwlookup.c --- b/src/common/kwlookup.c *************** *** 17,22 **** --- 17,149 ---- #include "common/kwlookup.h" + static uint32 + hash(const void *key, size_t keylen) + { + static const uint16 g[881] = { + 0x015b, 0x0000, 0x0070, 0x01b2, 0x0000, 0x0078, 0x0020, 0x0000, + 0x0000, 0x0193, 0x0000, 0x0000, 0x0000, 0x01ac, 0x0000, 0x0122, + 0x00b9, 0x0176, 0x013b, 0x0000, 0x0000, 0x0000, 0x0150, 0x0000, + 0x0000, 0x0000, 0x008b, 0x00ea, 0x00b3, 0x0197, 0x0000, 0x0118, + 0x012d, 0x0102, 0x0000, 0x0091, 0x0061, 0x008c, 0x0000, 0x0000, + 0x0144, 0x01b4, 0x0000, 0x0000, 0x01a8, 0x019e, 0x0000, 0x00da, + 0x0000, 0x0000, 0x0122, 0x0176, 0x00f3, 0x016a, 0x00f4, 0x00c0, + 0x0111, 0x0000, 0x0103, 0x0028, 0x001a, 0x0180, 0x0000, 0x0000, + 0x005f, 0x0000, 0x00d9, 0x0000, 0x016d, 0x0000, 0x0170, 0x0007, + 0x016f, 0x0000, 0x014e, 0x0098, 0x00a8, 0x004b, 0x0000, 0x0056, + 0x0000, 0x0121, 0x0012, 0x0102, 0x0192, 0x0000, 0x00f2, 0x0066, + 0x0000, 0x003a, 0x0000, 0x0000, 0x0144, 0x0000, 0x0000, 0x0133, + 0x0067, 0x0169, 0x0000, 0x0000, 0x0152, 0x0122, 0x0000, 0x0058, + 0x0135, 0x0045, 0x0193, 0x00d2, 0x007e, 0x0000, 0x00ae, 0x012c, + 0x0000, 0x0000, 0x0000, 0x0000, 0x0124, 0x0000, 0x0046, 0x0018, + 0x0000, 0x00ba, 0x00d1, 0x004a, 0x0000, 0x0000, 0x0000, 0x0000, + 0x0000, 0x001f, 0x0000, 0x0101, 0x0000, 0x0000, 0x0000, 0x01b5, + 0x016e, 0x0173, 0x008a, 0x0000, 0x0173, 0x000b, 0x0000, 0x00d5, + 0x005e, 0x0000, 0x00ac, 0x0000, 0x0000, 0x0000, 0x01a1, 0x0000, + 0x0000, 0x0127, 0x0000, 0x005e, 0x0000, 0x016f, 0x0000, 0x012b, + 0x01a4, 0x01b4, 0x0000, 0x0000, 0x003a, 0x0000, 0x0000, 0x00f5, + 0x00b1, 0x0003, 0x0123, 0x001b, 0x0000, 0x004f, 0x0000, 0x0000, + 0x0000, 0x0000, 0x0000, 0x007a, 0x0000, 0x0000, 0x0000, 0x0000, + 0x00c2, 0x00a2, 0x00b9, 0x0000, 0x00cb, 0x0000, 0x00d2, 0x0000, + 0x0197, 0x0121, 0x0000, 0x00d6, 0x0107, 0x0000, 0x0000, 0x0000, + 0x0000, 0x0000, 0x0000, 0x0165, 0x00df, 0x0121, 0x0000, 0x0000, + 0x0000, 0x0000, 0x0000, 0x019b, 0x0000, 0x01ad, 0x0000, 0x014f, + 0x018d, 0x0000, 0x015f, 0x0168, 0x0000, 0x0199, 0x0000, 0x0000, + 0x0000, 0x00a1, 0x0000, 0x0000, 0x0109, 0x0000, 0x0000, 0x01a6, + 0x0097, 0x0000, 0x0018, 0x0000, 0x00d1, 0x0000, 0x0000, 0x0000, + 0x0187, 0x0018, 0x0000, 0x00aa, 0x0000, 0x0000, 0x0000, 0x0000, + 0x0136, 0x0063, 0x00b8, 0x0000, 0x0067, 0x0114, 0x0000, 0x0000, + 0x0151, 0x0000, 0x0000, 0x0000, 0x00bf, 0x0000, 0x0000, 0x0000, + 0x01b4, 0x00d4, 0x0000, 0x0006, 0x017e, 0x0167, 0x003a, 0x017f, + 0x0183, 0x00c9, 0x01a2, 0x0000, 0x0000, 0x0153, 0x00ce, 0x0000, + 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0051, 0x0000, 0x0086, + 0x0000, 0x0083, 0x0137, 0x0000, 0x0000, 0x0050, 0x0000, 0x00d7, + 0x0000, 0x0000, 0x0000, 0x0129, 0x00f1, 0x0000, 0x009b, 0x01a7, + 0x0000, 0x00b4, 0x0000, 0x00e0, 0x0046, 0x0025, 0x0000, 0x0000, + 0x0000, 0x0144, 0x0000, 0x01a5, 0x0044, 0x0096, 0x0078, 0x0166, + 0x0000, 0x0000, 0x0000, 0x0143, 0x0000, 0x00b8, 0x0000, 0x009e, + 0x0000, 0x008c, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x00fe, + 0x0000, 0x0000, 0x0037, 0x0057, 0x0000, 0x00c3, 0x0000, 0x0000, + 0x0000, 0x00bf, 0x014b, 0x0069, 0x00ce, 0x0000, 0x019d, 0x007f, + 0x0186, 0x0000, 0x0119, 0x0015, 0x0000, 0x000e, 0x0113, 0x0139, + 0x008e, 0x01ab, 0x0000, 0x005c, 0x0000, 0x0095, 0x0000, 0x019d, + 0x0000, 0x0195, 0x0036, 0x0000, 0x0000, 0x00e0, 0x0146, 0x0000, + 0x0033, 0x0000, 0x0000, 0x0035, 0x0000, 0x0000, 0x0000, 0x0000, + 0x00d2, 0x0000, 0x0000, 0x0000, 0x0000, 0x004c, 0x00f0, 0x0000, + 0x0119, 0x00bd, 0x0000, 0x0000, 0x0031, 0x0117, 0x00b4, 0x0000, + 0x00f8, 0x0000, 0x0055, 0x0000, 0x0170, 0x0000, 0x0000, 0x0000, + 0x00e4, 0x00b5, 0x01b5, 0x0024, 0x0000, 0x01a5, 0x0000, 0x0000, + 0x0000, 0x0000, 0x0151, 0x0000, 0x00cc, 0x0000, 0x0000, 0x0150, + 0x00f3, 0x0071, 0x00d0, 0x0085, 0x0140, 0x0000, 0x00ae, 0x0000, + 0x00c4, 0x01a8, 0x0000, 0x0091, 0x0180, 0x0057, 0x0072, 0x0000, + 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x002a, 0x0000, + 0x0112, 0x003d, 0x017f, 0x0088, 0x0000, 0x0158, 0x0046, 0x0101, + 0x0000, 0x0000, 0x00ea, 0x0000, 0x0000, 0x00b2, 0x0149, 0x0000, + 0x007c, 0x0107, 0x0161, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, + 0x0000, 0x0169, 0x0000, 0x0118, 0x0091, 0x0043, 0x0064, 0x0000, + 0x0000, 0x0194, 0x0000, 0x0000, 0x00e9, 0x0000, 0x0000, 0x0000, + 0x005e, 0x0000, 0x0029, 0x0000, 0x0000, 0x0000, 0x003c, 0x0000, + 0x0000, 0x008b, 0x0000, 0x0000, 0x00fd, 0x002d, 0x0184, 0x0000, + 0x0000, 0x016a, 0x006f, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, + 0x0000, 0x0000, 0x0000, 0x0000, 0x01a0, 0x0000, 0x003b, 0x0000, + 0x006d, 0x0000, 0x016a, 0x0000, 0x01b2, 0x00c9, 0x0094, 0x0181, + 0x018b, 0x0000, 0x0199, 0x00c7, 0x017e, 0x0000, 0x0000, 0x0160, + 0x0000, 0x0000, 0x0175, 0x0000, 0x0000, 0x006a, 0x008d, 0x0000, + 0x00ed, 0x0000, 0x00b7, 0x0000, 0x0000, 0x0107, 0x00f9, 0x0000, + 0x0173, 0x0137, 0x0000, 0x0185, 0x0114, 0x006c, 0x0000, 0x0000, + 0x00f4, 0x0189, 0x0000, 0x0000, 0x0102, 0x0000, 0x00d5, 0x0000, + 0x0000, 0x015a, 0x0000, 0x00de, 0x0000, 0x0000, 0x0000, 0x0000, + 0x0000, 0x0000, 0x0000, 0x00f0, 0x00e1, 0x0000, 0x0000, 0x01b6, + 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0197, 0x0000, 0x0154, + 0x004a, 0x018d, 0x0000, 0x00c3, 0x0000, 0x0000, 0x0000, 0x0000, + 0x00c1, 0x0189, 0x001c, 0x0000, 0x00a6, 0x0000, 0x0000, 0x00a4, + 0x0000, 0x0000, 0x0000, 0x00ed, 0x0000, 0x0173, 0x0169, 0x00d2, + 0x0117, 0x0000, 0x009b, 0x0000, 0x014e, 0x0000, 0x00ac, 0x0000, + 0x008e, 0x0121, 0x0104, 0x0179, 0x0000, 0x01a5, 0x0103, 0x0000, + 0x001b, 0x0000, 0x0000, 0x01a8, 0x00ba, 0x010c, 0x0000, 0x0000, + 0x010e, 0x00ab, 0x0062, 0x0000, 0x0000, 0x0154, 0x0122, 0x013f, + 0x015f, 0x0000, 0x00f5, 0x01a9, 0x017b, 0x01a4, 0x0000, 0x0000, + 0x0040, 0x0004, 0x019b, 0x0000, 0x00e3, 0x010d, 0x015b, 0x0000, + 0x0104, 0x0000, 0x0000, 0x00b2, 0x00e8, 0x0000, 0x0000, 0x0000, + 0x0065, 0x0062, 0x007a, 0x0000, 0x0065, 0x008d, 0x0000, 0x0085, + 0x0000, 0x0000, 0x0000, 0x00af, 0x0104, 0x01ab, 0x0040, 0x00a8, + 0x0000, 0x0000, 0x0000, 0x0159, 0x0000, 0x0000, 0x0000, 0x0000, + 0x0000, 0x0189, 0x017b, 0x0077, 0x0000, 0x00ea, 0x00d7, 0x007f, + 0x0000, 0x00ae, 0x0047, 0x0000, 0x0163, 0x0000, 0x0000, 0x0157, + 0x0178, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x014d, 0x0009, + 0x0000, 0x0000, 0x00b6, 0x0000, 0x0000, 0x0000, 0x0000, 0x0192, + 0x01b1, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, + 0x0053, 0x012b, 0x00f6, 0x0096, 0x0000, 0x0141, 0x0000, 0x0000, + 0x0026, 0x0044, 0x00ce, 0x0061, 0x0199, 0x0000, 0x016b, 0x0156, + 0x011d, 0x0000, 0x0038, 0x008c, 0x00c8, 0x0000, 0x0000, 0x0002, + 0x0000, 0x01a1, 0x0000, 0x001e, 0x0000, 0x0000, 0x00bc, 0x00ab, + 0x0000, 0x0183, 0x0085, 0x0000, 0x0000, 0x010c, 0x0000, 0x01a5, + 0x0120, 0x0000, 0x0000, 0x0000, 0x0000, 0x0135, 0x0079, 0x0000, + 0x01ae, 0x0028, 0x0000, 0x0000, 0x014a, 0x0000, 0x00dd, 0x0000, + 0x0000, 0x00d8, 0x00de, 0x0075, 0x0000, 0x0000, 0x0021, 0x0099, + 0x0000, 0x00f7, 0x0000, 0x0000, 0x0000, 0x0046, 0x0000, 0x0010, + 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0116, + 0x0000, 0x0000, 0x0000, 0x0000, 0x01a8, 0x0000, 0x0000, 0x0000, + 0x004c, 0x0000, 0x00b7, 0x0000, 0x013f, 0x003c, 0x0000, 0x0000, + 0x006d, 0x007f, 0x0181, 0x0000, 0x0013, 0x0000, 0x0180, 0x0000, + 0x0000, 0x0000, 0x0000, 0x0000, 0x0136, 0x000d, 0x0000, 0x0000, + 0x0000, 0x0082, 0x0000, 0x0000, 0x00cf, 0x00c3, 0x0000, 0x0000, + 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0127, + 0x0000, 0x0013, 0x019c, 0x0000, 0x0024, 0x00bd, 0x017e, 0x0000, + 0x00b8, 0x002e, 0x012c, 0x0007, 0x0000, 0x0000, 0x00f7, 0x0000, + 0x0048, + }; + + const unsigned char *k = key; + uint32_t a, b; + + a = b = 0x0U; + while (keylen--) { + a = a * 31 + (k[keylen]|0x20); + b = b * 37 + (k[keylen]|0x20); + } + return (g[a % 881] + g[b % 881]) % 440; + } /* * ScanKeywordLookup - see if a given word is a keyword *************** ScanKeywordLookup(const char *text, *** 41,50 **** int len, i; char word[NAMEDATALEN]; ! const char *kw_string; ! const uint16 *kw_offsets; ! const uint16 *low; ! const uint16 *high; len = strlen(text); /* We assume all keywords are shorter than NAMEDATALEN. */ --- 168,175 ---- int len, i; char word[NAMEDATALEN]; ! const char *kw; ! uint32 h; len = strlen(text); /* We assume all keywords are shorter than NAMEDATALEN. */ *************** ScanKeywordLookup(const char *text, *** 66,91 **** word[len] = '\0'; /* ! * Now do a binary search using plain strcmp() comparison. */ ! kw_string = keywords->kw_string; ! kw_offsets = keywords->kw_offsets; ! low = kw_offsets; ! high = kw_offsets + (keywords->num_keywords - 1); ! while (low <= high) ! { ! const uint16 *middle; ! int difference; ! ! middle = low + (high - low) / 2; ! difference = strcmp(kw_string + *middle, word); ! if (difference == 0) ! return middle - kw_offsets; ! else if (difference < 0) ! low = middle + 1; ! else ! high = middle - 1; ! } ! return -1; } --- 191,201 ---- word[len] = '\0'; /* ! * Now do a hash search using plain strcmp() comparison. */ ! h = hash(word, len); ! kw = keywords->kw_string + keywords->kw_offsets[h]; ! if (strcmp(word, kw) == 0) ! return h; return -1; }
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Joerg Sonnenberger
Date:
On Sun, Jan 06, 2019 at 02:29:05PM -0500, Tom Lane wrote: > * It's too bad that the hash function doesn't have a return convention > that allows distinguishing "couldn't possibly match any keyword" from > "might match keyword 0". I imagine a lot of the zero entries in its > hashtable could be interpreted as the former, so at the cost of perhaps > a couple more if-tests we could skip work at the caller. As this patch > stands, we could only skip the strcmp() so it might not be worth the > trouble --- but if we use Joerg's |0x20 hack then we could hash before > downcasing, allowing us to skip the downcasing step if the word couldn't > possibly be a keyword. Likely that would be worth the trouble. The hash function itself doesn't have enough data in it to know whether its a match or not. A strcmp or memcmp at the end will always be necessary if you don't know already that it is a keyword. > * We should extend the ScanKeywordList representation to include a > field holding the longest keyword length in the table, which > gen_keywordlist.pl would have no trouble providing. Then we could > skip downcasing and/or hashing for any word longer than that, replacing > the current NAMEDATALEN test, and thereby putting a tight bound on > the cost of downcasing and/or hashing. Correct, possibly even have an array for each class of keywords. > * If we do hash first, then we could replace the downcasing loop and > strcmp with an integrated loop that downcases and compares one > character at a time, removing the need for the NAMEDATALEN-sized > buffer variable. This is also an option. Assuming that the character set for keywords doesn't change (letter or _), one 64bit bit test per input character would ensure that the |0x20 hack gives correct result for comparing as well. Any other input would be an instant mismatch. If digits are valid keyword characters, it would be two tests. > * I think it matters to the speed of the hashing loop that the > magic multipliers are hard-coded. (Examining the assembly code > shows that, at least on my x86_64 hardware, gcc prefers to use > shift-and-add sequences here instead of multiply instructions.) Right, that's one of the reasons for choosing them. The other is that it gives decent mixing for ASCII-only input. At the moment, they are hard-coded. > So we probably can't have inlined hashing code --- I imagine the > hash generator needs the flexibility to pick different values of > those multipliers. Right now, only the initial values are randomized. Picking a different set of hash functions is possible, but someone that should be done only if there is an actual need. That was what I meant with stronger mixing might be necessary for "annoying" keyword additions. > I envision making this work by having > gen_keywordlist.pl emit a function definition for the hash step and > include a function pointer to it in ScanKeywordList. That extra > function call will make things fractionally slower than what I have > here, but I don't think it will change the conclusions any. This > approach would also greatly alleviate the concern I had yesterday > about ecpg's c_keywords.c having a second copy of the hashing code; > what it would have is its own generated function, which isn't much > of a problem. There are two ways for dealing with it: (1) Have one big hash table with all the various keywords and a class mask stored. If there is enough overlap between the keyword tables, it can significantly reduce the amount of space needed. In terms of code complexity, it adds one class check at the end, i.e. a bitmap test. (2) Build independent hash tables for each input class. A bit simpler to manage, but can result in a bit wasted space. From the generator side, it doesn't matter which choice is taken. > * Given that the generator's runtime is negligible when coded in C, > I suspect that we might able to tolerate the speed hit from translating > it to Perl, and frankly I'd much rather do that than cope with the > consequences of including C code in our build process. I'm just not fluent enough in Perl to be much help for that, but I can sit down and write a trivial Python version of it :) There are a couple of changes that are useful to have in this context, e.g. the ability to directly provide the offsets in the result table to allow dropping the index -> offset table completely. Joerg
Joerg Sonnenberger <joerg@bec.de> writes: > On Sun, Jan 06, 2019 at 02:29:05PM -0500, Tom Lane wrote: >> So we probably can't have inlined hashing code --- I imagine the >> hash generator needs the flexibility to pick different values of >> those multipliers. > Right now, only the initial values are randomized. Picking a different > set of hash functions is possible, but someone that should be done only > if there is an actual need. That was what I meant with stronger mixing > might be necessary for "annoying" keyword additions. Hmm. I'm still leaning towards using generated, out-of-line hash functions though, because then we could have a generator switch indicating whether to apply the |0x20 case coercion or not. (I realize that we could blow off that consideration and use a case-insensitive hash function all the time, but it seems cleaner to me not to make assumptions about how variable the hash function parameters will need to be.) > There are two ways for dealing with it: > (1) Have one big hash table with all the various keywords and a class > mask stored. If there is enough overlap between the keyword tables, it > can significantly reduce the amount of space needed. In terms of code > complexity, it adds one class check at the end, i.e. a bitmap test. No, this would be a bad idea IMO, because it makes the core, plpgsql, and ecpg keyword sets all interdependent. If you add a keyword to any one of those and forget to rebuild the other components, you got trouble. Maybe we could make that reliable, but I don't think it's worth fooling with for hypothetical benefit. Also, it'd make the net space usage more not less, because each of those executables/shlibs would contain copies of all the keywords for the other ones' needs. regards, tom lane
Joerg Sonnenberger <joerg@bec.de> writes: > On Sun, Jan 06, 2019 at 02:29:05PM -0500, Tom Lane wrote: >> * We should extend the ScanKeywordList representation to include a >> field holding the longest keyword length in the table, which >> gen_keywordlist.pl would have no trouble providing. Then we could >> skip downcasing and/or hashing for any word longer than that, replacing >> the current NAMEDATALEN test, and thereby putting a tight bound on >> the cost of downcasing and/or hashing. > Correct, possibly even have an array for each class of keywords. I added that change to v8 and noted a further small improvement in my test case. That probably says something about the prevalence of long identifiers in information_schema.sql ;-), but anyway we can figure it's not a net loss. I've pushed that version (v8 + max_kw_len); if the buildfarm doesn't fall over, we can move on with looking at hashing. I took a quick look through the NetBSD nbperf sources at http://cvsweb.netbsd.org/bsdweb.cgi/src/usr.bin/nbperf/ and I concur with your judgment that we could manage translating that into Perl, especially if we only implement the parts we need. I'm curious what further changes you've made locally, and what parameters you were using. regards, tom lane
On 1/6/19, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I've pushed that version (v8 + max_kw_len); if the buildfarm doesn't > fall over, we can move on with looking at hashing. Thank you. The API adjustment looks good, and I'm glad that splitting out the aux info led to even further cleanups. -John Naylor
I wrote: > I took a quick look through the NetBSD nbperf sources at > http://cvsweb.netbsd.org/bsdweb.cgi/src/usr.bin/nbperf/ > and I concur with your judgment that we could manage translating > that into Perl, especially if we only implement the parts we need. Here's an implementation of that, using the hash functions you showed upthread. The speed of the Perl script seems to be pretty acceptable; less than 100ms to handle the main SQL keyword list, on my machine. Yeah, the C version might be less than 1ms, but I don't think that we need to put up with non-Perl build tooling for that. Using the same test case as before (parsing information_schema.sql), I get runtimes around 3560 ms, a shade better than my jury-rigged prototype. Probably there's a lot to be criticized about the Perl style below; anybody feel a need to rewrite it? regards, tom lane diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c index d72842e..445b99d 100644 *** a/src/common/kwlookup.c --- b/src/common/kwlookup.c *************** *** 35,94 **** * receive a different case-normalization mapping. */ int ! ScanKeywordLookup(const char *text, const ScanKeywordList *keywords) { ! int len, ! i; ! char word[NAMEDATALEN]; ! const char *kw_string; ! const uint16 *kw_offsets; ! const uint16 *low; ! const uint16 *high; ! ! len = strlen(text); if (len > keywords->max_kw_len) ! return -1; /* too long to be any keyword */ ! ! /* We assume all keywords are shorter than NAMEDATALEN. */ ! Assert(len < NAMEDATALEN); /* ! * Apply an ASCII-only downcasing. We must not use tolower() since it may ! * produce the wrong translation in some locales (eg, Turkish). */ ! for (i = 0; i < len; i++) ! { ! char ch = text[i]; ! ! if (ch >= 'A' && ch <= 'Z') ! ch += 'a' - 'A'; ! word[i] = ch; ! } ! word[len] = '\0'; /* ! * Now do a binary search using plain strcmp() comparison. */ ! kw_string = keywords->kw_string; ! kw_offsets = keywords->kw_offsets; ! low = kw_offsets; ! high = kw_offsets + (keywords->num_keywords - 1); ! while (low <= high) { ! const uint16 *middle; ! int difference; ! middle = low + (high - low) / 2; ! difference = strcmp(kw_string + *middle, word); ! if (difference == 0) ! return middle - kw_offsets; ! else if (difference < 0) ! low = middle + 1; ! else ! high = middle - 1; } ! return -1; } --- 35,81 ---- * receive a different case-normalization mapping. */ int ! ScanKeywordLookup(const char *str, const ScanKeywordList *keywords) { ! size_t len; ! uint32 h; ! const char *kw; + /* + * Reject immediately if too long to be any keyword. This saves useless + * hashing and downcasing work on long strings. + */ + len = strlen(str); if (len > keywords->max_kw_len) ! return -1; /* ! * Compute the hash function. We assume it was generated to produce ! * case-insensitive results. Since it's a perfect hash, we need only ! * match to the specific keyword it identifies. */ ! h = keywords->hash(str, len); /* ! * Compare character-by-character to see if we have a match, applying an ! * ASCII-only downcasing to the input characters. We must not use ! * tolower() since it may produce the wrong translation in some locales ! * (eg, Turkish). */ ! kw = GetScanKeyword(h, keywords); ! while (*str != '\0') { ! char ch = *str++; ! if (ch >= 'A' && ch <= 'Z') ! ch += 'a' - 'A'; ! if (ch != *kw++) ! return -1; } + if (*kw != '\0') + return -1; ! /* Success! */ ! return h; } diff --git a/src/include/common/kwlookup.h b/src/include/common/kwlookup.h index 39efb35..a6609ee 100644 *** a/src/include/common/kwlookup.h --- b/src/include/common/kwlookup.h *************** *** 14,19 **** --- 14,22 ---- #ifndef KWLOOKUP_H #define KWLOOKUP_H + /* Hash function used by ScanKeywordLookup */ + typedef uint32 (*ScanKeywordHashFunc) (const void *key, size_t keylen); + /* * This struct contains the data needed by ScanKeywordLookup to perform a * search within a set of keywords. The contents are typically generated by *************** typedef struct ScanKeywordList *** 23,28 **** --- 26,32 ---- { const char *kw_string; /* all keywords in order, separated by \0 */ const uint16 *kw_offsets; /* offsets to the start of each keyword */ + ScanKeywordHashFunc hash; /* perfect hash function for keywords */ int num_keywords; /* number of keywords */ int max_kw_len; /* length of longest keyword */ } ScanKeywordList; diff --git a/src/interfaces/ecpg/preproc/Makefile b/src/interfaces/ecpg/preproc/Makefile index b5b74a3..abfe3cc 100644 *** a/src/interfaces/ecpg/preproc/Makefile --- b/src/interfaces/ecpg/preproc/Makefile *************** preproc.y: ../../../backend/parser/gram. *** 57,63 **** # generate keyword headers c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST) ! $(PERL) $(GEN_KEYWORDLIST) --varname ScanCKeywords $< ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST) $(PERL) $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $< --- 57,63 ---- # generate keyword headers c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST) ! $(PERL) $(GEN_KEYWORDLIST) --varname ScanCKeywords --case $< ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST) $(PERL) $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $< diff --git a/src/interfaces/ecpg/preproc/c_keywords.c b/src/interfaces/ecpg/preproc/c_keywords.c index 38ddf6f..3493915 100644 *** a/src/interfaces/ecpg/preproc/c_keywords.c --- b/src/interfaces/ecpg/preproc/c_keywords.c *************** *** 9,16 **** */ #include "postgres_fe.h" - #include <ctype.h> - #include "preproc_extern.h" #include "preproc.h" --- 9,14 ---- *************** static const uint16 ScanCKeywordTokens[] *** 32,70 **** * * Returns the token value of the keyword, or -1 if no match. * ! * Do a binary search using plain strcmp() comparison. This is much like * ScanKeywordLookup(), except we want case-sensitive matching. */ int ! ScanCKeywordLookup(const char *text) { ! const char *kw_string; ! const uint16 *kw_offsets; ! const uint16 *low; ! const uint16 *high; ! if (strlen(text) > ScanCKeywords.max_kw_len) ! return -1; /* too long to be any keyword */ ! kw_string = ScanCKeywords.kw_string; ! kw_offsets = ScanCKeywords.kw_offsets; ! low = kw_offsets; ! high = kw_offsets + (ScanCKeywords.num_keywords - 1); ! while (low <= high) ! { ! const uint16 *middle; ! int difference; ! middle = low + (high - low) / 2; ! difference = strcmp(kw_string + *middle, text); ! if (difference == 0) ! return ScanCKeywordTokens[middle - kw_offsets]; ! else if (difference < 0) ! low = middle + 1; ! else ! high = middle - 1; ! } return -1; } --- 30,63 ---- * * Returns the token value of the keyword, or -1 if no match. * ! * Do a hash search using plain strcmp() comparison. This is much like * ScanKeywordLookup(), except we want case-sensitive matching. */ int ! ScanCKeywordLookup(const char *str) { ! size_t len; ! uint32 h; ! const char *kw; ! /* ! * Reject immediately if too long to be any keyword. This saves useless ! * hashing work on long strings. ! */ ! len = strlen(str); ! if (len > ScanCKeywords.max_kw_len) ! return -1; ! /* ! * Compute the hash function. Since it's a perfect hash, we need only ! * match to the specific keyword it identifies. ! */ ! h = ScanCKeywords_hash_func(str, len); ! kw = GetScanKeyword(h, &ScanCKeywords); ! if (strcmp(kw, str) == 0) ! return ScanCKeywordTokens[h]; return -1; } diff --git a/src/tools/gen_keywordlist.pl b/src/tools/gen_keywordlist.pl index d764aff..241ba51 100644 *** a/src/tools/gen_keywordlist.pl --- b/src/tools/gen_keywordlist.pl *************** *** 14,19 **** --- 14,25 ---- # variable named according to the -v switch ("ScanKeywords" by default). # The variable is marked "static" unless the -e switch is given. # + # ScanKeywordList uses hash-based lookup, so this script also selects + # a minimal perfect hash function for the keyword set, and emits a + # static hash function that is referenced in the ScanKeywordList struct. + # The hash function is case-insensitive unless --case is specified. + # Note that case insensitivity assumes all-ASCII keywords! + # # # Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group # Portions Copyright (c) 1994, Regents of the University of California *************** use Getopt::Long; *** 28,39 **** my $output_path = ''; my $extern = 0; my $varname = 'ScanKeywords'; GetOptions( ! 'output:s' => \$output_path, ! 'extern' => \$extern, ! 'varname:s' => \$varname) || usage(); my $kw_input_file = shift @ARGV || die "No input file.\n"; --- 34,47 ---- my $output_path = ''; my $extern = 0; + my $case_sensitive = 0; my $varname = 'ScanKeywords'; GetOptions( ! 'output:s' => \$output_path, ! 'extern' => \$extern, ! 'case-sensitive' => \$case_sensitive, ! 'varname:s' => \$varname) || usage(); my $kw_input_file = shift @ARGV || die "No input file.\n"; *************** while (<$kif>) *** 87,93 **** --- 95,116 ---- } } + # When being case-insensitive, insist that the input be all-lower-case. + if (!$case_sensitive) + { + foreach my $kw (@keywords) + { + die qq|The keyword "$kw" is not lower-case in $kw_input_file\n| + if ($kw ne lc $kw); + } + } + # Error out if the keyword names are not in ASCII order. + # + # While this isn't really necessary with hash-based lookup, it's still + # helpful because it provides a cheap way to reject duplicate keywords. + # Also, insisting on sorted order ensures that code that scans the keyword + # table linearly will see the keywords in a canonical order. for my $i (0..$#keywords - 1) { die qq|The keyword "$keywords[$i + 1]" is out of order in $kw_input_file\n| *************** print $kwdef "};\n\n"; *** 128,139 **** --- 151,167 ---- printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords; + # Emit the definition of the hash function. + + construct_hash_function(); + # Emit the struct that wraps all this lookup info into one variable. print $kwdef "static " if !$extern; printf $kwdef "const ScanKeywordList %s = {\n", $varname; printf $kwdef qq|\t%s_kw_string,\n|, $varname; printf $kwdef qq|\t%s_kw_offsets,\n|, $varname; + printf $kwdef qq|\t%s_hash_func,\n|, $varname; printf $kwdef qq|\t%s_NUM_KEYWORDS,\n|, uc $varname; printf $kwdef qq|\t%d\n|, $max_len; print $kwdef "};\n\n"; *************** print $kwdef "};\n\n"; *** 141,146 **** --- 169,433 ---- printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename; + # This code constructs a minimal perfect hash function for the given + # keyword set, using an algorithm described in + # "An optimal algorithm for generating minimal perfect hash functions" + # by Czech, Havas and Majewski in Information Processing Letters, + # 43(5):256-264, October 1992. + # This implementation is loosely based on NetBSD's "nbperf", + # which was written by Joerg Sonnenberger. + + # At runtime, we'll compute two simple hash functions of the input word, + # and use them to index into a mapping table. The hash functions are just + # multiply-and-add in uint32 arithmetic, with different multipliers but + # the same initial seed. + + # Calculate a hash function as the run-time code will do. + # If we are making a case-insensitive hash function, we implement that + # by OR'ing 0x20 into each byte of the key. This correctly transforms + # upper-case ASCII into lower-case ASCII, while not changing digits or + # dollar signs. It might reduce our ability to discriminate other + # characters, but not by very much, and typically keywords wouldn't + # contain other characters anyway. + sub calc_hash + { + my ($keyword, $mult, $seed) = @_; + + my $result = $seed; + for my $c (split //, $keyword) + { + my $cn = ord($c); + $cn |= 0x20 if !$case_sensitive; + $result = ($result * $mult + $cn) % 4294967296; + } + return $result; + } + + # Attempt to construct a mapping table for the minimal perfect hash function. + # Returns a nonempty integer array if successful, else an empty array. + sub construct_hash_table + { + # Parameters of the hash functions are passed in. + my ($hash_mult1, $hash_mult2, $hash_seed) = @_; + + # This algorithm is based on a graph whose edges correspond to the + # keywords and whose vertices correspond to entries of the mapping table. + # A keyword edge links the two vertices whose indexes are the outputs of + # the two hash functions for that keyword. For K keywords, the mapping + # table must have at least 2*K+1 entries, guaranteeing that there's at + # least one unused entry. + my $nedges = scalar @keywords; # number of edges + my $nverts = 2 * $nedges + 1; # number of vertices + + # Initialize the array of edges. + my @E = (); + foreach my $kw (@keywords) + { + # Calculate hashes for this keyword. + # The hashes are immediately reduced modulo the mapping table size. + my $hash1 = calc_hash($kw, $hash_mult1, $hash_seed) % $nverts; + my $hash2 = calc_hash($kw, $hash_mult2, $hash_seed) % $nverts; + + # If the two hashes are the same for any keyword, we have to fail + # since this edge would itself form a cycle in the graph. + return () if $hash1 == $hash2; + + # Add the edge for this keyword, initially populating just + # its "left" and "right" fields, which are the hash values. + push @E, + { + left => $hash1, + right => $hash2, + l_prev => undef, + l_next => undef, + r_prev => undef, + r_next => undef + }; + } + + # Initialize the array of vertices, marking them all "unused". + my @V = (); + for (my $i = 0; $i < $nverts; $i++) + { + push @V, { l_edge => undef, r_edge => undef }; + } + + # Attach each edge to a chain of the edges using its vertices. + # At completion, each vertice's l_edge field, if not undef, + # points to one edge having that vertex as left end, and the + # remaining such edges are chained through l_next/l_prev links. + # Likewise for right ends with r_edge and r_next/r_prev. + for (my $i = 0; $i < $nedges; $i++) + { + my $v = $E[$i]{left}; + $E[ $V[$v]{l_edge} ]{l_prev} = $i if (defined $V[$v]{l_edge}); + $E[$i]{l_next} = $V[$v]{l_edge}; + $V[$v]{l_edge} = $i; + + $v = $E[$i]{right}; + $E[ $V[$v]{r_edge} ]{r_prev} = $i if (defined $V[$v]{r_edge}); + $E[$i]{r_next} = $V[$v]{r_edge}; + $V[$v]{r_edge} = $i; + } + + # Now we attempt to prove the graph acyclic. + # A cycle-free graph is either empty or has some vertex of degree 1. + # Removing the edge attached to that vertex doesn't change this property, + # so doing that repeatedly will reduce the size of the graph. + # If the graph is empty at the end of the process, it was acyclic. + # We track the order of edge removal so that the next phase can process + # them in reverse order of removal. + my @output_order = (); + + for (my $i = 0; $i < $nverts; $i++) + { + my $v = $i; + # If vertex v is of degree 1 (i.e. exactly 1 edge connects to it), + # remove that edge, and then consider the edge's other vertex to see + # if it is now of degree 1. The inner loop repeats until reaching a + # vertex not of degree 1. + for (;;) + { + # If it's of degree 0, stop. + last if (!defined $V[$v]{l_edge} && !defined $V[$v]{r_edge}); + # Can't be degree 1 if both sides have edges. + last if (defined $V[$v]{l_edge} && defined $V[$v]{r_edge}); + # Check relevant side. + my ($e, $v2); + if (defined $V[$v]{l_edge}) + { + $e = $V[$v]{l_edge}; + # If there's more entries in chain, v is not of degree 1. + last if (defined $E[$e]{l_next}); + # OK, unlink e from v ... + $V[$v]{l_edge} = undef; + # ... and from its right-side vertex. + $v2 = $E[$e]{right}; + if (defined $E[$e]{r_prev}) + { + $E[ $E[$e]{r_prev} ]{r_next} = $E[$e]{r_next}; + } + else + { + $V[$v2]{r_edge} = $E[$e]{r_next}; + } + if (defined $E[$e]{r_next}) + { + $E[ $E[$e]{r_next} ]{r_prev} = $E[$e]{r_prev}; + } + } + else + { + $e = $V[$v]{r_edge}; + # If there's more entries in chain, v is not of degree 1. + last if (defined $E[$e]{r_next}); + # OK, unlink e from v ... + $V[$v]{r_edge} = undef; + # ... and from its left-side vertex. + $v2 = $E[$e]{left}; + if (defined $E[$e]{l_prev}) + { + $E[ $E[$e]{l_prev} ]{l_next} = $E[$e]{l_next}; + } + else + { + $V[$v2]{l_edge} = $E[$e]{l_next}; + } + if (defined $E[$e]{l_next}) + { + $E[ $E[$e]{l_next} ]{l_prev} = $E[$e]{l_prev}; + } + } + + # Push e onto the front of the output-order list. + unshift @output_order, $e; + # Consider v2 on next iteration of inner loop. + $v = $v2; + } + } + + # We succeeded only if all edges were removed from the graph. + return () if (scalar(@output_order) != $nedges); + + # OK, build the hash table of size $nverts. + my @hashtab = (0) x $nverts; + # We need a "visited" flag array in this step, too. + my @visited = (0) x $nverts; + + # The idea is that for any keyword, the sum of the hash table entries for + # its first and second hash values, reduced mod $nedges, is the desired + # output (i.e., the keyword number). By assigning hash table values in + # the selected edge order, we can guarantee that that's true. + foreach my $e (@output_order) + { + my $l = $E[$e]{left}; + my $r = $E[$e]{right}; + if (!$visited[$l]) + { + $hashtab[$l] = ($nedges + $e - $hashtab[$r]) % $nedges; + } + else + { + die "oops, doubly used hashtab entry" if $visited[$r]; + $hashtab[$r] = ($nedges + $e - $hashtab[$l]) % $nedges; + } + $visited[$l] = 1; + $visited[$r] = 1; + } + + return @hashtab; + } + + sub construct_hash_function + { + # Try different hash function parameters until we find a set that works + # for these keywords. In principle we might need to change multipliers, + # but these two multipliers are chosen to be cheap to calculate via + # shift-and-add, so don't change them except at great need. + my $hash_mult1 = 31; + my $hash_mult2 = 37; + + # We just try successive hash seed values until we find one that works. + # (Commonly, random seeds are tried, but we want reproducible results + # from this program so we don't do that.) + my $hash_seed; + my @hashtab; + for ($hash_seed = 0;; $hash_seed++) + { + @hashtab = construct_hash_table($hash_mult1, $hash_mult2, $hash_seed); + last if @hashtab; + } + + # OK, emit the hash function definition including the hash table. + printf $kwdef "static uint32\n"; + printf $kwdef "%s_hash_func(const void *key, size_t keylen)\n{\n", + $varname; + printf $kwdef "\tstatic const uint16 h[%d] = {\n", scalar(@hashtab); + for (my $i = 0; $i < scalar(@hashtab); $i++) + { + printf $kwdef "%s0x%04x,%s", + ($i % 8 == 0 ? "\t\t" : " "), + $hashtab[$i], + ($i % 8 == 7 ? "\n" : ""); + } + printf $kwdef "\n" if (scalar(@hashtab) % 8 != 0); + printf $kwdef "\t};\n\n"; + printf $kwdef "\tconst unsigned char *k = key;\n"; + printf $kwdef "\tuint32\t\ta = %d;\n", $hash_seed; + printf $kwdef "\tuint32\t\tb = %d;\n\n", $hash_seed; + printf $kwdef "\twhile (keylen--)\n\t{\n"; + printf $kwdef "\t\tunsigned char c = *k++"; + printf $kwdef " | 0x20" if !$case_sensitive; + printf $kwdef ";\n\n"; + printf $kwdef "\t\ta = a * %d + c;\n", $hash_mult1; + printf $kwdef "\t\tb = b * %d + c;\n", $hash_mult2; + printf $kwdef "\t}\n"; + printf $kwdef "\treturn (h[a %% %d] + h[b %% %d]) %% %d;\n", + scalar(@hashtab), scalar(@hashtab), scalar(@keywords); + printf $kwdef "}\n\n"; + } + + sub usage { die <<EOM; *************** Usage: gen_keywordlist.pl [--output/-o < *** 148,153 **** --- 435,441 ---- --output Output directory (default '.') --varname Name for ScanKeywordList variable (default 'ScanKeywords') --extern Allow the ScanKeywordList variable to be globally visible + --case Keyword matching is to be case-sensitive gen_keywordlist.pl transforms a list of keywords into a ScanKeywordList. The output filename is derived from the input file by inserting _d, diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm index 937bf18..ed603fc 100644 *** a/src/tools/msvc/Solution.pm --- b/src/tools/msvc/Solution.pm *************** sub GenerateFiles *** 440,446 **** { print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n"; chdir('src/interfaces/ecpg/preproc'); ! system('perl ../../../tools/gen_keywordlist.pl --varname ScanCKeywords c_kwlist.h'); system('perl ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h'); chdir('../../../..'); } --- 440,446 ---- { print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n"; chdir('src/interfaces/ecpg/preproc'); ! system('perl ../../../tools/gen_keywordlist.pl --varname ScanCKeywords --case c_kwlist.h'); system('perl ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h'); chdir('../../../..'); }
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andres Freund
Date:
Hi, On 2019-01-07 16:11:04 -0500, Tom Lane wrote: > I wrote: > > I took a quick look through the NetBSD nbperf sources at > > http://cvsweb.netbsd.org/bsdweb.cgi/src/usr.bin/nbperf/ > > and I concur with your judgment that we could manage translating > > that into Perl, especially if we only implement the parts we need. > > Here's an implementation of that, using the hash functions you showed > upthread. The speed of the Perl script seems to be pretty acceptable; > less than 100ms to handle the main SQL keyword list, on my machine. > Yeah, the C version might be less than 1ms, but I don't think that > we need to put up with non-Perl build tooling for that. > > Using the same test case as before (parsing information_schema.sql), > I get runtimes around 3560 ms, a shade better than my jury-rigged > prototype. > > Probably there's a lot to be criticized about the Perl style below; > anybody feel a need to rewrite it? Hm, shouldn't we extract the perfect hash generation into a perl module or such? It seems that there's plenty other possible uses for it. Greetings, Andres Freund
I wrote: > Probably there's a lot to be criticized about the Perl style below; > anybody feel a need to rewrite it? Here's a somewhat better version. I realized that I was being too slavishly tied to the data structures used in the C version; in Perl it's easier to manage the lists of edges as hashes. I can't see any need to distinguish left and right edge sets, either, so this just has one such hash per vertex. Also, it seems to me that we *can* make intelligent use of unused hashtable entries to exit early on many non-keyword inputs. The reason the existing code fails to do so is that it computes the sums and differences of hashtable entries in unsigned modulo arithmetic; but if we make the hashtable entries signed, we can set them up as exact differences and drop the final modulo operation in the hash function. Then, any out-of-range sum must indicate an input that is not a keyword (because it is touching a pair of hashtable entries that didn't go together) and we can exit early from the caller. This in turn lets us mark unused hashtable entries with large values to ensure that sums involving them will be out of range. A weak spot in that argument is that it's not entirely clear how large the differences can get --- with an unlucky series of collisions, maybe they could get large enough to overflow int16? I don't think that's likely for the size of problem this script is going to encounter, so I just put in an error check for it. But it could do with closer analysis before deciding that this is a general-purpose solution. regards, tom lane diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c index d72842e..9dc1fee 100644 *** a/src/common/kwlookup.c --- b/src/common/kwlookup.c *************** *** 35,94 **** * receive a different case-normalization mapping. */ int ! ScanKeywordLookup(const char *text, const ScanKeywordList *keywords) { ! int len, ! i; ! char word[NAMEDATALEN]; ! const char *kw_string; ! const uint16 *kw_offsets; ! const uint16 *low; ! const uint16 *high; ! ! len = strlen(text); if (len > keywords->max_kw_len) ! return -1; /* too long to be any keyword */ ! ! /* We assume all keywords are shorter than NAMEDATALEN. */ ! Assert(len < NAMEDATALEN); /* ! * Apply an ASCII-only downcasing. We must not use tolower() since it may ! * produce the wrong translation in some locales (eg, Turkish). */ ! for (i = 0; i < len; i++) ! { ! char ch = text[i]; ! if (ch >= 'A' && ch <= 'Z') ! ch += 'a' - 'A'; ! word[i] = ch; ! } ! word[len] = '\0'; /* ! * Now do a binary search using plain strcmp() comparison. */ ! kw_string = keywords->kw_string; ! kw_offsets = keywords->kw_offsets; ! low = kw_offsets; ! high = kw_offsets + (keywords->num_keywords - 1); ! while (low <= high) { ! const uint16 *middle; ! int difference; ! middle = low + (high - low) / 2; ! difference = strcmp(kw_string + *middle, word); ! if (difference == 0) ! return middle - kw_offsets; ! else if (difference < 0) ! low = middle + 1; ! else ! high = middle - 1; } ! return -1; } --- 35,89 ---- * receive a different case-normalization mapping. */ int ! ScanKeywordLookup(const char *str, const ScanKeywordList *keywords) { ! size_t len; ! int h; ! const char *kw; + /* + * Reject immediately if too long to be any keyword. This saves useless + * hashing and downcasing work on long strings. + */ + len = strlen(str); if (len > keywords->max_kw_len) ! return -1; /* ! * Compute the hash function. We assume it was generated to produce ! * case-insensitive results. Since it's a perfect hash, we need only ! * match to the specific keyword it identifies. */ ! h = keywords->hash(str, len); ! /* ! * An out-of-range result implies no match. (This can happen for ! * non-keyword inputs because the hash function will sum two unrelated ! * hashtable entries.) ! */ ! if (h < 0 || h >= keywords->num_keywords) ! return -1; /* ! * Compare character-by-character to see if we have a match, applying an ! * ASCII-only downcasing to the input characters. We must not use ! * tolower() since it may produce the wrong translation in some locales ! * (eg, Turkish). */ ! kw = GetScanKeyword(h, keywords); ! while (*str != '\0') { ! char ch = *str++; ! if (ch >= 'A' && ch <= 'Z') ! ch += 'a' - 'A'; ! if (ch != *kw++) ! return -1; } + if (*kw != '\0') + return -1; ! /* Success! */ ! return h; } diff --git a/src/include/common/kwlookup.h b/src/include/common/kwlookup.h index 39efb35..dbff367 100644 *** a/src/include/common/kwlookup.h --- b/src/include/common/kwlookup.h *************** *** 14,19 **** --- 14,22 ---- #ifndef KWLOOKUP_H #define KWLOOKUP_H + /* Hash function used by ScanKeywordLookup */ + typedef int (*ScanKeywordHashFunc) (const void *key, size_t keylen); + /* * This struct contains the data needed by ScanKeywordLookup to perform a * search within a set of keywords. The contents are typically generated by *************** typedef struct ScanKeywordList *** 23,28 **** --- 26,32 ---- { const char *kw_string; /* all keywords in order, separated by \0 */ const uint16 *kw_offsets; /* offsets to the start of each keyword */ + ScanKeywordHashFunc hash; /* perfect hash function for keywords */ int num_keywords; /* number of keywords */ int max_kw_len; /* length of longest keyword */ } ScanKeywordList; diff --git a/src/interfaces/ecpg/preproc/Makefile b/src/interfaces/ecpg/preproc/Makefile index b5b74a3..abfe3cc 100644 *** a/src/interfaces/ecpg/preproc/Makefile --- b/src/interfaces/ecpg/preproc/Makefile *************** preproc.y: ../../../backend/parser/gram. *** 57,63 **** # generate keyword headers c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST) ! $(PERL) $(GEN_KEYWORDLIST) --varname ScanCKeywords $< ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST) $(PERL) $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $< --- 57,63 ---- # generate keyword headers c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST) ! $(PERL) $(GEN_KEYWORDLIST) --varname ScanCKeywords --case $< ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST) $(PERL) $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $< diff --git a/src/interfaces/ecpg/preproc/c_keywords.c b/src/interfaces/ecpg/preproc/c_keywords.c index 38ddf6f..387298b 100644 *** a/src/interfaces/ecpg/preproc/c_keywords.c --- b/src/interfaces/ecpg/preproc/c_keywords.c *************** *** 9,16 **** */ #include "postgres_fe.h" - #include <ctype.h> - #include "preproc_extern.h" #include "preproc.h" --- 9,14 ---- *************** static const uint16 ScanCKeywordTokens[] *** 32,70 **** * * Returns the token value of the keyword, or -1 if no match. * ! * Do a binary search using plain strcmp() comparison. This is much like * ScanKeywordLookup(), except we want case-sensitive matching. */ int ! ScanCKeywordLookup(const char *text) { ! const char *kw_string; ! const uint16 *kw_offsets; ! const uint16 *low; ! const uint16 *high; ! if (strlen(text) > ScanCKeywords.max_kw_len) ! return -1; /* too long to be any keyword */ ! kw_string = ScanCKeywords.kw_string; ! kw_offsets = ScanCKeywords.kw_offsets; ! low = kw_offsets; ! high = kw_offsets + (ScanCKeywords.num_keywords - 1); ! while (low <= high) ! { ! const uint16 *middle; ! int difference; ! middle = low + (high - low) / 2; ! difference = strcmp(kw_string + *middle, text); ! if (difference == 0) ! return ScanCKeywordTokens[middle - kw_offsets]; ! else if (difference < 0) ! low = middle + 1; ! else ! high = middle - 1; ! } return -1; } --- 30,71 ---- * * Returns the token value of the keyword, or -1 if no match. * ! * Do a hash search using plain strcmp() comparison. This is much like * ScanKeywordLookup(), except we want case-sensitive matching. */ int ! ScanCKeywordLookup(const char *str) { ! size_t len; ! int h; ! const char *kw; ! /* ! * Reject immediately if too long to be any keyword. This saves useless ! * hashing work on long strings. ! */ ! len = strlen(str); ! if (len > ScanCKeywords.max_kw_len) ! return -1; ! /* ! * Compute the hash function. Since it's a perfect hash, we need only ! * match to the specific keyword it identifies. ! */ ! h = ScanCKeywords_hash_func(str, len); ! /* ! * An out-of-range result implies no match. (This can happen for ! * non-keyword inputs because the hash function will sum two unrelated ! * hashtable entries.) ! */ ! if (h < 0 || h >= ScanCKeywords.num_keywords) ! return -1; ! kw = GetScanKeyword(h, &ScanCKeywords); ! ! if (strcmp(kw, str) == 0) ! return ScanCKeywordTokens[h]; return -1; } diff --git a/src/tools/gen_keywordlist.pl b/src/tools/gen_keywordlist.pl index d764aff..f15afc1 100644 *** a/src/tools/gen_keywordlist.pl --- b/src/tools/gen_keywordlist.pl *************** *** 14,19 **** --- 14,25 ---- # variable named according to the -v switch ("ScanKeywords" by default). # The variable is marked "static" unless the -e switch is given. # + # ScanKeywordList uses hash-based lookup, so this script also selects + # a minimal perfect hash function for the keyword set, and emits a + # static hash function that is referenced in the ScanKeywordList struct. + # The hash function is case-insensitive unless --case is specified. + # Note that case insensitivity assumes all-ASCII keywords! + # # # Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group # Portions Copyright (c) 1994, Regents of the University of California *************** use Getopt::Long; *** 28,39 **** my $output_path = ''; my $extern = 0; my $varname = 'ScanKeywords'; GetOptions( ! 'output:s' => \$output_path, ! 'extern' => \$extern, ! 'varname:s' => \$varname) || usage(); my $kw_input_file = shift @ARGV || die "No input file.\n"; --- 34,47 ---- my $output_path = ''; my $extern = 0; + my $case_sensitive = 0; my $varname = 'ScanKeywords'; GetOptions( ! 'output:s' => \$output_path, ! 'extern' => \$extern, ! 'case-sensitive' => \$case_sensitive, ! 'varname:s' => \$varname) || usage(); my $kw_input_file = shift @ARGV || die "No input file.\n"; *************** while (<$kif>) *** 87,93 **** --- 95,116 ---- } } + # When being case-insensitive, insist that the input be all-lower-case. + if (!$case_sensitive) + { + foreach my $kw (@keywords) + { + die qq|The keyword "$kw" is not lower-case in $kw_input_file\n| + if ($kw ne lc $kw); + } + } + # Error out if the keyword names are not in ASCII order. + # + # While this isn't really necessary with hash-based lookup, it's still + # helpful because it provides a cheap way to reject duplicate keywords. + # Also, insisting on sorted order ensures that code that scans the keyword + # table linearly will see the keywords in a canonical order. for my $i (0..$#keywords - 1) { die qq|The keyword "$keywords[$i + 1]" is out of order in $kw_input_file\n| *************** print $kwdef "};\n\n"; *** 128,139 **** --- 151,167 ---- printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords; + # Emit the definition of the hash function. + + construct_hash_function(); + # Emit the struct that wraps all this lookup info into one variable. print $kwdef "static " if !$extern; printf $kwdef "const ScanKeywordList %s = {\n", $varname; printf $kwdef qq|\t%s_kw_string,\n|, $varname; printf $kwdef qq|\t%s_kw_offsets,\n|, $varname; + printf $kwdef qq|\t%s_hash_func,\n|, $varname; printf $kwdef qq|\t%s_NUM_KEYWORDS,\n|, uc $varname; printf $kwdef qq|\t%d\n|, $max_len; print $kwdef "};\n\n"; *************** print $kwdef "};\n\n"; *** 141,146 **** --- 169,392 ---- printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename; + # This code constructs a minimal perfect hash function for the given + # keyword set, using an algorithm described in + # "An optimal algorithm for generating minimal perfect hash functions" + # by Czech, Havas and Majewski in Information Processing Letters, + # 43(5):256-264, October 1992. + # This implementation is loosely based on NetBSD's "nbperf", + # which was written by Joerg Sonnenberger. + + # At runtime, we'll compute two simple hash functions of the input word, + # and use them to index into a mapping table. The hash functions are just + # multiply-and-add in uint32 arithmetic, with different multipliers but + # the same initial seed. + + # Calculate a hash function as the run-time code will do. + # If we are making a case-insensitive hash function, we implement that + # by OR'ing 0x20 into each byte of the key. This correctly transforms + # upper-case ASCII into lower-case ASCII, while not changing digits or + # dollar signs. (It does change '_', else we could just skip adjusting + # $cn here at all, for typical keyword strings.) + sub calc_hash + { + my ($keyword, $mult, $seed) = @_; + + my $result = $seed; + for my $c (split //, $keyword) + { + my $cn = ord($c); + $cn |= 0x20 if !$case_sensitive; + $result = ($result * $mult + $cn) % 4294967296; + } + return $result; + } + + # Attempt to construct a mapping table for the minimal perfect hash function. + # Returns a nonempty integer array if successful, else an empty array. + sub construct_hash_table + { + # Parameters of the hash functions are passed in. + my ($hash_mult1, $hash_mult2, $hash_seed) = @_; + + # This algorithm is based on a graph whose edges correspond to the + # keywords and whose vertices correspond to entries of the mapping table. + # A keyword edge links the two vertices whose indexes are the outputs of + # the two hash functions for that keyword. For K keywords, the mapping + # table must have at least 2*K+1 entries, guaranteeing that there's at + # least one unused entry. + my $nedges = scalar @keywords; # number of edges + my $nverts = 2 * $nedges + 1; # number of vertices + + # Initialize the array of edges. + my @E = (); + foreach my $kw (@keywords) + { + # Calculate hashes for this keyword. + # The hashes are immediately reduced modulo the mapping table size. + my $hash1 = calc_hash($kw, $hash_mult1, $hash_seed) % $nverts; + my $hash2 = calc_hash($kw, $hash_mult2, $hash_seed) % $nverts; + + # If the two hashes are the same for any keyword, we have to fail + # since this edge would itself form a cycle in the graph. + return () if $hash1 == $hash2; + + # Add the edge for this keyword. + push @E, { left => $hash1, right => $hash2 }; + } + + # Initialize the array of vertices, giving them all empty lists + # of associated edges. (The lists will be hashes of edge numbers.) + my @V = (); + for (my $v = 0; $v < $nverts; $v++) + { + push @V, { edges => {} }; + } + + # Insert each edge in the lists of edges using its vertices. + for (my $e = 0; $e < $nedges; $e++) + { + my $v = $E[$e]{left}; + $V[$v]{edges}->{$e} = 1; + + $v = $E[$e]{right}; + $V[$v]{edges}->{$e} = 1; + } + + # Now we attempt to prove the graph acyclic. + # A cycle-free graph is either empty or has some vertex of degree 1. + # Removing the edge attached to that vertex doesn't change this property, + # so doing that repeatedly will reduce the size of the graph. + # If the graph is empty at the end of the process, it was acyclic. + # We track the order of edge removal so that the next phase can process + # them in reverse order of removal. + my @output_order = (); + + # Consider each vertex as a possible starting point for edge-removal. + for (my $startv = 0; $startv < $nverts; $startv++) + { + my $v = $startv; + + # If vertex v is of degree 1 (i.e. exactly 1 edge connects to it), + # remove that edge, and then consider the edge's other vertex to see + # if it is now of degree 1. The inner loop repeats until reaching a + # vertex not of degree 1. + while (scalar(keys(%{ $V[$v]{edges} })) == 1) + { + # Unlink its only edge. + my $e = (keys(%{ $V[$v]{edges} }))[0]; + delete($V[$v]{edges}->{$e}); + + # Unlink the edge from its other vertex, too. + my $v2 = $E[$e]{left}; + $v2 = $E[$e]{right} if ($v2 == $v); + delete($V[$v2]{edges}->{$e}); + + # Push e onto the front of the output-order list. + unshift @output_order, $e; + + # Consider v2 on next iteration of inner loop. + $v = $v2; + } + } + + # We succeeded only if all edges were removed from the graph. + return () if (scalar(@output_order) != $nedges); + + # OK, build the hash table of size $nverts. + my @hashtab = (0) x $nverts; + # We need a "visited" flag array in this step, too. + my @visited = (0) x $nverts; + + # The idea is that for any keyword, the sum of the hash table entries + # for its first and second hash values is the desired output (i.e., the + # keyword number). By assigning hash table values in the selected edge + # order, we can guarantee that that's true. + foreach my $e (@output_order) + { + my $l = $E[$e]{left}; + my $r = $E[$e]{right}; + if (!$visited[$l]) + { + # $hashtab[$r] might be zero, or some previously assigned value. + $hashtab[$l] = $e - $hashtab[$r]; + } + else + { + die "oops, doubly used hashtab entry" if $visited[$r]; + # $hashtab[$l] might be zero, or some previously assigned value. + $hashtab[$r] = $e - $hashtab[$l]; + } + # Now freeze both of these hashtab entries. + $visited[$l] = 1; + $visited[$r] = 1; + } + + # Check that the results fit in int16. (With very large keyword sets, we + # might need to allow wider hashtable entries; but that day is far away.) + # Then set any unused hash table entries to 0x7FFF. For reasonable + # keyword counts, that will ensure that any hash sum involving such an + # entry will be out-of-range, allowing the caller to exit early. + for (my $v = 0; $v < $nverts; $v++) + { + die "oops, hashtab entry overflow" + if $hashtab[$v] < -32767 || $hashtab[$v] > 32767; + $hashtab[$v] = 0x7FFF if !$visited[$v]; + } + + return @hashtab; + } + + sub construct_hash_function + { + # Try different hash function parameters until we find a set that works + # for these keywords. In principle we might need to change multipliers, + # but these two multipliers are chosen to be cheap to calculate via + # shift-and-add, so don't change them except at great need. + my $hash_mult1 = 31; + my $hash_mult2 = 37; + + # We just try successive hash seed values until we find one that works. + # (Commonly, random seeds are tried, but we want reproducible results + # from this program so we don't do that.) + my $hash_seed; + my @hashtab; + for ($hash_seed = 0;; $hash_seed++) + { + @hashtab = construct_hash_table($hash_mult1, $hash_mult2, $hash_seed); + last if @hashtab; + } + my $nhash = scalar(@hashtab); + + # OK, emit the hash function definition including the hash table. + printf $kwdef "static int\n"; + printf $kwdef "%s_hash_func(const void *key, size_t keylen)\n{\n", + $varname; + printf $kwdef "\tstatic const int16 h[%d] = {\n", $nhash; + for (my $i = 0; $i < $nhash; $i++) + { + printf $kwdef "%s%6d,%s", + ($i % 8 == 0 ? "\t\t" : " "), + $hashtab[$i], + ($i % 8 == 7 ? "\n" : ""); + } + printf $kwdef "\n" if ($nhash % 8 != 0); + printf $kwdef "\t};\n\n"; + printf $kwdef "\tconst unsigned char *k = key;\n"; + printf $kwdef "\tuint32\t\ta = %d;\n", $hash_seed; + printf $kwdef "\tuint32\t\tb = %d;\n\n", $hash_seed; + printf $kwdef "\twhile (keylen--)\n\t{\n"; + printf $kwdef "\t\tunsigned char c = *k++"; + printf $kwdef " | 0x20" if !$case_sensitive; + printf $kwdef ";\n\n"; + printf $kwdef "\t\ta = a * %d + c;\n", $hash_mult1; + printf $kwdef "\t\tb = b * %d + c;\n", $hash_mult2; + printf $kwdef "\t}\n"; + printf $kwdef "\treturn h[a %% %d] + h[b %% %d];\n", $nhash, $nhash; + printf $kwdef "}\n\n"; + } + + sub usage { die <<EOM; *************** Usage: gen_keywordlist.pl [--output/-o < *** 148,153 **** --- 394,400 ---- --output Output directory (default '.') --varname Name for ScanKeywordList variable (default 'ScanKeywords') --extern Allow the ScanKeywordList variable to be globally visible + --case Keyword matching is to be case-sensitive gen_keywordlist.pl transforms a list of keywords into a ScanKeywordList. The output filename is derived from the input file by inserting _d, diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm index 937bf18..ed603fc 100644 *** a/src/tools/msvc/Solution.pm --- b/src/tools/msvc/Solution.pm *************** sub GenerateFiles *** 440,446 **** { print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n"; chdir('src/interfaces/ecpg/preproc'); ! system('perl ../../../tools/gen_keywordlist.pl --varname ScanCKeywords c_kwlist.h'); system('perl ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h'); chdir('../../../..'); } --- 440,446 ---- { print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n"; chdir('src/interfaces/ecpg/preproc'); ! system('perl ../../../tools/gen_keywordlist.pl --varname ScanCKeywords --case c_kwlist.h'); system('perl ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h'); chdir('../../../..'); }
Andres Freund <andres@anarazel.de> writes: > Hm, shouldn't we extract the perfect hash generation into a perl module > or such? It seems that there's plenty other possible uses for it. Such as? But in any case, that sounds like a task for someone with more sense of Perl style than I have. regards, tom lane
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andres Freund
Date:
Hi, On 2019-01-07 19:37:51 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Hm, shouldn't we extract the perfect hash generation into a perl module > > or such? It seems that there's plenty other possible uses for it. > > Such as? Builtin functions for one, which we'd swatted down last time round due to gperfs defficiencies. But I think there's plenty more potential, e.g. it'd make sense from a performance POV to use a perfect hash function for locks on builtin objects (the hashtable for lookups therein shows up prominently in a fair number of profiles, and they are a large percentage of the acquistions). I'm certain there's plenty more, I've not though too much about it. > But in any case, that sounds like a task for someone with > more sense of Perl style than I have. John, any chance you could help out with that... :) Greetings, Andres Freund
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andrew Dunstan
Date:
On 1/7/19 7:52 PM, Andres Freund wrote: > Hi, > > On 2019-01-07 19:37:51 -0500, Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >>> Hm, shouldn't we extract the perfect hash generation into a perl module >>> or such? It seems that there's plenty other possible uses for it. >> Such as? > Builtin functions for one, which we'd swatted down last time round due > to gperfs defficiencies. But I think there's plenty more potential, > e.g. it'd make sense from a performance POV to use a perfect hash > function for locks on builtin objects (the hashtable for lookups therein > shows up prominently in a fair number of profiles, and they are a large > percentage of the acquistions). I'm certain there's plenty more, I've > not though too much about it. > Yeah, this is pretty neat, >> But in any case, that sounds like a task for someone with >> more sense of Perl style than I have. > John, any chance you could help out with that... :) > If he doesn't I will. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 8, 2019 at 12:06 PM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote: > On 1/7/19 7:52 PM, Andres Freund wrote: > > Builtin functions for one, which we'd swatted down last time round due > > to gperfs defficiencies. Do you mean the fmgr table? > >> But in any case, that sounds like a task for someone with > >> more sense of Perl style than I have. > > John, any chance you could help out with that... :) > > If he doesn't I will. I'll take a crack at separating into a module. I'll wait a bit in case there are any stylistic suggestions on the patch as it stands.
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andres Freund
Date:
Hi, On 2019-01-08 13:41:16 -0500, John Naylor wrote: > On Tue, Jan 8, 2019 at 12:06 PM Andrew Dunstan > <andrew.dunstan@2ndquadrant.com> wrote: > > On 1/7/19 7:52 PM, Andres Freund wrote: > > > Builtin functions for one, which we'd swatted down last time round due > > > to gperfs defficiencies. > > Do you mean the fmgr table? Not the entire fmgr table, but just the builtin oid index, generated by the following section: # Create the fmgr_builtins table, collect data for fmgr_builtin_oid_index print $tfh "\nconst FmgrBuiltin fmgr_builtins[] = {\n"; my %bmap; $bmap{'t'} = 'true'; $bmap{'f'} = 'false'; my @fmgr_builtin_oid_index; my $fmgr_count = 0; foreach my $s (sort { $a->{oid} <=> $b->{oid} } @fmgr) { print $tfh " { $s->{oid}, $s->{nargs}, $bmap{$s->{strict}}, $bmap{$s->{retset}}, \"$s->{prosrc}\", $s->{prosrc} }"; $fmgr_builtin_oid_index[ $s->{oid} ] = $fmgr_count++; if ($fmgr_count <= $#fmgr) { print $tfh ",\n"; } else { print $tfh "\n"; } } print $tfh "};\n"; print $tfh qq| const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin)); |; The generated fmgr_builtin_oid_index is pretty sparse, and a more dense hashtable might e.g. more efficient from a cache perspective. Greetings, Andres Freund
John Naylor <john.naylor@2ndquadrant.com> writes: > On Tue, Jan 8, 2019 at 12:06 PM Andrew Dunstan > <andrew.dunstan@2ndquadrant.com> wrote: >> If he doesn't I will. > I'll take a crack at separating into a module. I'll wait a bit in > case there are any stylistic suggestions on the patch as it stands. I had a go at that myself. I'm sure there's plenty to criticize in the result, but at least it passes make check-world ;-) I resolved the worry I had last night about the range of table values by putting in logic to check the range and choose a suitable table element type. There are a couple of existing calls where we manage to fit the hashtable elements into int8 that way; of course, by definition that doesn't save a whole lot of space since such tables couldn't have many elements, but it seems cleaner anyway. regards, tom lane diff --git a/src/common/Makefile b/src/common/Makefile index 317b071..d0c2b97 100644 *** a/src/common/Makefile --- b/src/common/Makefile *************** OBJS_FRONTEND = $(OBJS_COMMON) fe_memuti *** 63,68 **** --- 63,73 ---- OBJS_SHLIB = $(OBJS_FRONTEND:%.o=%_shlib.o) OBJS_SRV = $(OBJS_COMMON:%.o=%_srv.o) + # where to find gen_keywordlist.pl and subsidiary files + TOOLSDIR = $(top_srcdir)/src/tools + GEN_KEYWORDLIST = $(PERL) -I $(TOOLSDIR) $(TOOLSDIR)/gen_keywordlist.pl + GEN_KEYWORDLIST_DEPS = $(TOOLSDIR)/gen_keywordlist.pl $(TOOLSDIR)/PerfectHash.pm + all: libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a distprep: kwlist_d.h *************** libpgcommon_srv.a: $(OBJS_SRV) *** 118,125 **** $(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@ # generate SQL keyword lookup table to be included into keywords*.o. ! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(top_srcdir)/src/tools/gen_keywordlist.pl ! $(PERL) $(top_srcdir)/src/tools/gen_keywordlist.pl --extern $< # Dependencies of keywords*.o need to be managed explicitly to make sure # that you don't get broken parsing code, even in a non-enable-depend build. --- 123,130 ---- $(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@ # generate SQL keyword lookup table to be included into keywords*.o. ! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(GEN_KEYWORDLIST_DEPS) ! $(GEN_KEYWORDLIST) --extern $< # Dependencies of keywords*.o need to be managed explicitly to make sure # that you don't get broken parsing code, even in a non-enable-depend build. diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c index d72842e..9dc1fee 100644 *** a/src/common/kwlookup.c --- b/src/common/kwlookup.c *************** *** 35,94 **** * receive a different case-normalization mapping. */ int ! ScanKeywordLookup(const char *text, const ScanKeywordList *keywords) { ! int len, ! i; ! char word[NAMEDATALEN]; ! const char *kw_string; ! const uint16 *kw_offsets; ! const uint16 *low; ! const uint16 *high; ! ! len = strlen(text); if (len > keywords->max_kw_len) ! return -1; /* too long to be any keyword */ ! ! /* We assume all keywords are shorter than NAMEDATALEN. */ ! Assert(len < NAMEDATALEN); /* ! * Apply an ASCII-only downcasing. We must not use tolower() since it may ! * produce the wrong translation in some locales (eg, Turkish). */ ! for (i = 0; i < len; i++) ! { ! char ch = text[i]; ! if (ch >= 'A' && ch <= 'Z') ! ch += 'a' - 'A'; ! word[i] = ch; ! } ! word[len] = '\0'; /* ! * Now do a binary search using plain strcmp() comparison. */ ! kw_string = keywords->kw_string; ! kw_offsets = keywords->kw_offsets; ! low = kw_offsets; ! high = kw_offsets + (keywords->num_keywords - 1); ! while (low <= high) { ! const uint16 *middle; ! int difference; ! middle = low + (high - low) / 2; ! difference = strcmp(kw_string + *middle, word); ! if (difference == 0) ! return middle - kw_offsets; ! else if (difference < 0) ! low = middle + 1; ! else ! high = middle - 1; } ! return -1; } --- 35,89 ---- * receive a different case-normalization mapping. */ int ! ScanKeywordLookup(const char *str, const ScanKeywordList *keywords) { ! size_t len; ! int h; ! const char *kw; + /* + * Reject immediately if too long to be any keyword. This saves useless + * hashing and downcasing work on long strings. + */ + len = strlen(str); if (len > keywords->max_kw_len) ! return -1; /* ! * Compute the hash function. We assume it was generated to produce ! * case-insensitive results. Since it's a perfect hash, we need only ! * match to the specific keyword it identifies. */ ! h = keywords->hash(str, len); ! /* ! * An out-of-range result implies no match. (This can happen for ! * non-keyword inputs because the hash function will sum two unrelated ! * hashtable entries.) ! */ ! if (h < 0 || h >= keywords->num_keywords) ! return -1; /* ! * Compare character-by-character to see if we have a match, applying an ! * ASCII-only downcasing to the input characters. We must not use ! * tolower() since it may produce the wrong translation in some locales ! * (eg, Turkish). */ ! kw = GetScanKeyword(h, keywords); ! while (*str != '\0') { ! char ch = *str++; ! if (ch >= 'A' && ch <= 'Z') ! ch += 'a' - 'A'; ! if (ch != *kw++) ! return -1; } + if (*kw != '\0') + return -1; ! /* Success! */ ! return h; } diff --git a/src/include/common/kwlookup.h b/src/include/common/kwlookup.h index 39efb35..dbff367 100644 *** a/src/include/common/kwlookup.h --- b/src/include/common/kwlookup.h *************** *** 14,19 **** --- 14,22 ---- #ifndef KWLOOKUP_H #define KWLOOKUP_H + /* Hash function used by ScanKeywordLookup */ + typedef int (*ScanKeywordHashFunc) (const void *key, size_t keylen); + /* * This struct contains the data needed by ScanKeywordLookup to perform a * search within a set of keywords. The contents are typically generated by *************** typedef struct ScanKeywordList *** 23,28 **** --- 26,32 ---- { const char *kw_string; /* all keywords in order, separated by \0 */ const uint16 *kw_offsets; /* offsets to the start of each keyword */ + ScanKeywordHashFunc hash; /* perfect hash function for keywords */ int num_keywords; /* number of keywords */ int max_kw_len; /* length of longest keyword */ } ScanKeywordList; diff --git a/src/interfaces/ecpg/preproc/Makefile b/src/interfaces/ecpg/preproc/Makefile index b5b74a3..6c02f97 100644 *** a/src/interfaces/ecpg/preproc/Makefile --- b/src/interfaces/ecpg/preproc/Makefile *************** OBJS= preproc.o pgc.o type.o ecpg.o outp *** 28,34 **** keywords.o c_keywords.o ecpg_keywords.o typename.o descriptor.o variable.o \ $(WIN32RES) ! GEN_KEYWORDLIST = $(top_srcdir)/src/tools/gen_keywordlist.pl # Suppress parallel build to avoid a bug in GNU make 3.82 # (see comments in ../Makefile) --- 28,37 ---- keywords.o c_keywords.o ecpg_keywords.o typename.o descriptor.o variable.o \ $(WIN32RES) ! # where to find gen_keywordlist.pl and subsidiary files ! TOOLSDIR = $(top_srcdir)/src/tools ! GEN_KEYWORDLIST = $(PERL) -I $(TOOLSDIR) $(TOOLSDIR)/gen_keywordlist.pl ! GEN_KEYWORDLIST_DEPS = $(TOOLSDIR)/gen_keywordlist.pl $(TOOLSDIR)/PerfectHash.pm # Suppress parallel build to avoid a bug in GNU make 3.82 # (see comments in ../Makefile) *************** preproc.y: ../../../backend/parser/gram. *** 56,66 **** $(PERL) $(srcdir)/check_rules.pl $(srcdir) $< # generate keyword headers ! c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST) ! $(PERL) $(GEN_KEYWORDLIST) --varname ScanCKeywords $< ! ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST) ! $(PERL) $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $< # Force these dependencies to be known even without dependency info built: ecpg_keywords.o c_keywords.o keywords.o preproc.o pgc.o parser.o: preproc.h --- 59,69 ---- $(PERL) $(srcdir)/check_rules.pl $(srcdir) $< # generate keyword headers ! c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST_DEPS) ! $(GEN_KEYWORDLIST) --varname ScanCKeywords --case $< ! ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST_DEPS) ! $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $< # Force these dependencies to be known even without dependency info built: ecpg_keywords.o c_keywords.o keywords.o preproc.o pgc.o parser.o: preproc.h diff --git a/src/interfaces/ecpg/preproc/c_keywords.c b/src/interfaces/ecpg/preproc/c_keywords.c index 38ddf6f..387298b 100644 *** a/src/interfaces/ecpg/preproc/c_keywords.c --- b/src/interfaces/ecpg/preproc/c_keywords.c *************** *** 9,16 **** */ #include "postgres_fe.h" - #include <ctype.h> - #include "preproc_extern.h" #include "preproc.h" --- 9,14 ---- *************** static const uint16 ScanCKeywordTokens[] *** 32,70 **** * * Returns the token value of the keyword, or -1 if no match. * ! * Do a binary search using plain strcmp() comparison. This is much like * ScanKeywordLookup(), except we want case-sensitive matching. */ int ! ScanCKeywordLookup(const char *text) { ! const char *kw_string; ! const uint16 *kw_offsets; ! const uint16 *low; ! const uint16 *high; ! if (strlen(text) > ScanCKeywords.max_kw_len) ! return -1; /* too long to be any keyword */ ! kw_string = ScanCKeywords.kw_string; ! kw_offsets = ScanCKeywords.kw_offsets; ! low = kw_offsets; ! high = kw_offsets + (ScanCKeywords.num_keywords - 1); ! while (low <= high) ! { ! const uint16 *middle; ! int difference; ! middle = low + (high - low) / 2; ! difference = strcmp(kw_string + *middle, text); ! if (difference == 0) ! return ScanCKeywordTokens[middle - kw_offsets]; ! else if (difference < 0) ! low = middle + 1; ! else ! high = middle - 1; ! } return -1; } --- 30,71 ---- * * Returns the token value of the keyword, or -1 if no match. * ! * Do a hash search using plain strcmp() comparison. This is much like * ScanKeywordLookup(), except we want case-sensitive matching. */ int ! ScanCKeywordLookup(const char *str) { ! size_t len; ! int h; ! const char *kw; ! /* ! * Reject immediately if too long to be any keyword. This saves useless ! * hashing work on long strings. ! */ ! len = strlen(str); ! if (len > ScanCKeywords.max_kw_len) ! return -1; ! /* ! * Compute the hash function. Since it's a perfect hash, we need only ! * match to the specific keyword it identifies. ! */ ! h = ScanCKeywords_hash_func(str, len); ! /* ! * An out-of-range result implies no match. (This can happen for ! * non-keyword inputs because the hash function will sum two unrelated ! * hashtable entries.) ! */ ! if (h < 0 || h >= ScanCKeywords.num_keywords) ! return -1; ! kw = GetScanKeyword(h, &ScanCKeywords); ! ! if (strcmp(kw, str) == 0) ! return ScanCKeywordTokens[h]; return -1; } diff --git a/src/pl/plpgsql/src/Makefile b/src/pl/plpgsql/src/Makefile index 9dd4a74..8a0f294 100644 *** a/src/pl/plpgsql/src/Makefile --- b/src/pl/plpgsql/src/Makefile *************** REGRESS_OPTS = --dbname=$(PL_TESTDB) *** 29,35 **** REGRESS = plpgsql_call plpgsql_control plpgsql_domain plpgsql_record \ plpgsql_cache plpgsql_transaction plpgsql_varprops ! GEN_KEYWORDLIST = $(top_srcdir)/src/tools/gen_keywordlist.pl all: all-lib --- 29,38 ---- REGRESS = plpgsql_call plpgsql_control plpgsql_domain plpgsql_record \ plpgsql_cache plpgsql_transaction plpgsql_varprops ! # where to find gen_keywordlist.pl and subsidiary files ! TOOLSDIR = $(top_srcdir)/src/tools ! GEN_KEYWORDLIST = $(PERL) -I $(TOOLSDIR) $(TOOLSDIR)/gen_keywordlist.pl ! GEN_KEYWORDLIST_DEPS = $(TOOLSDIR)/gen_keywordlist.pl $(TOOLSDIR)/PerfectHash.pm all: all-lib *************** plerrcodes.h: $(top_srcdir)/src/backend/ *** 76,86 **** $(PERL) $(srcdir)/generate-plerrcodes.pl $< > $@ # generate keyword headers for the scanner ! pl_reserved_kwlist_d.h: pl_reserved_kwlist.h $(GEN_KEYWORDLIST) ! $(PERL) $(GEN_KEYWORDLIST) --varname ReservedPLKeywords $< ! pl_unreserved_kwlist_d.h: pl_unreserved_kwlist.h $(GEN_KEYWORDLIST) ! $(PERL) $(GEN_KEYWORDLIST) --varname UnreservedPLKeywords $< check: submake --- 79,89 ---- $(PERL) $(srcdir)/generate-plerrcodes.pl $< > $@ # generate keyword headers for the scanner ! pl_reserved_kwlist_d.h: pl_reserved_kwlist.h $(GEN_KEYWORDLIST_DEPS) ! $(GEN_KEYWORDLIST) --varname ReservedPLKeywords $< ! pl_unreserved_kwlist_d.h: pl_unreserved_kwlist.h $(GEN_KEYWORDLIST_DEPS) ! $(GEN_KEYWORDLIST) --varname UnreservedPLKeywords $< check: submake diff --git a/src/tools/PerfectHash.pm b/src/tools/PerfectHash.pm index ...34d55cf . *** a/src/tools/PerfectHash.pm --- b/src/tools/PerfectHash.pm *************** *** 0 **** --- 1,336 ---- + #---------------------------------------------------------------------- + # + # PerfectHash.pm + # Perl module that constructs minimal perfect hash functions + # + # This code constructs a minimal perfect hash function for the given + # set of keys, using an algorithm described in + # "An optimal algorithm for generating minimal perfect hash functions" + # by Czech, Havas and Majewski in Information Processing Letters, + # 43(5):256-264, October 1992. + # This implementation is loosely based on NetBSD's "nbperf", + # which was written by Joerg Sonnenberger. + # + # The resulting hash function is perfect in the sense that if the presented + # key is one of the original set, it will return the key's index in the set + # (in range 0..N-1). However, the caller must still verify the match, + # as false positives are possible. Also, the hash function may return + # values that are out of range (negative, or >= N). This indicates that + # the presented key is definitely not in the set. + # + # + # Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + # Portions Copyright (c) 1994, Regents of the University of California + # + # src/tools/PerfectHash.pm + # + #---------------------------------------------------------------------- + + package PerfectHash; + + use strict; + use warnings; + use Exporter 'import'; + + our @EXPORT_OK = qw( + generate_hash_function + ); + + + # At runtime, we'll compute two simple hash functions of the input key, + # and use them to index into a mapping table. The hash functions are just + # multiply-and-add in uint32 arithmetic, with different multipliers but + # the same initial seed. All the complexity in this module is concerned + # with selecting hash parameters that will work and building the mapping + # table. + + # We support making case-insensitive hash functions, though this only + # works for a strict-ASCII interpretation of case insensitivity, + # ie, A-Z maps onto a-z and nothing else. + my $case_insensitive = 0; + + + # + # Construct a C function implementing a perfect hash for the given keys. + # The C function definition is returned as a string. + # + # The keys can be any set of Perl strings; it is caller's responsibility + # that there not be any duplicates. (Note that the "strings" can be + # binary data, but endianness is the caller's problem.) + # + # The name to use for the function is caller-specified, but its signature + # is always "int f(const void *key, size_t keylen)". The caller may + # prepend "static " to the result string if it wants a static function. + # + # If $ci is true, the function is case-insensitive, for the limited idea + # of case-insensitivity explained above. + # + sub generate_hash_function + { + my ($keys_ref, $funcname, $ci) = @_; + + # It's not worth passing this around as a parameter; just use a global. + $case_insensitive = $ci; + + # Try different hash function parameters until we find a set that works + # for these keys. In principle we might need to change multipliers, + # but these two multipliers are chosen to be cheap to calculate via + # shift-and-add, so don't change them except at great need. + my $hash_mult1 = 31; + my $hash_mult2 = 37; + + # We just try successive hash seed values until we find one that works. + # (Commonly, random seeds are tried, but we want reproducible results + # from this program so we don't do that.) + my $hash_seed; + my @subresult; + for ($hash_seed = 0;; $hash_seed++) + { + @subresult = + _construct_hash_table($keys_ref, $hash_mult1, $hash_mult2, + $hash_seed); + last if @subresult; + } + + # Extract info from the function result array. + my $elemtype = $subresult[0]; + my @hashtab = @{ $subresult[1] }; + my $nhash = scalar(@hashtab); + + # OK, construct the hash function definition including the hash table. + my $f = ''; + $f .= sprintf "int\n"; + $f .= sprintf "%s(const void *key, size_t keylen)\n{\n", $funcname; + $f .= sprintf "\tstatic const %s h[%d] = {\n", $elemtype, $nhash; + for (my $i = 0; $i < $nhash; $i++) + { + $f .= sprintf "%s%6d,%s", + ($i % 8 == 0 ? "\t\t" : " "), + $hashtab[$i], + ($i % 8 == 7 ? "\n" : ""); + } + $f .= sprintf "\n" if ($nhash % 8 != 0); + $f .= sprintf "\t};\n\n"; + $f .= sprintf "\tconst unsigned char *k = key;\n"; + $f .= sprintf "\tuint32\t\ta = %d;\n", $hash_seed; + $f .= sprintf "\tuint32\t\tb = %d;\n\n", $hash_seed; + $f .= sprintf "\twhile (keylen--)\n\t{\n"; + $f .= sprintf "\t\tunsigned char c = *k++"; + $f .= sprintf " | 0x20" if $case_insensitive; # see comment below + $f .= sprintf ";\n\n"; + $f .= sprintf "\t\ta = a * %d + c;\n", $hash_mult1; + $f .= sprintf "\t\tb = b * %d + c;\n", $hash_mult2; + $f .= sprintf "\t}\n"; + $f .= sprintf "\treturn h[a %% %d] + h[b %% %d];\n", $nhash, $nhash; + $f .= sprintf "}\n"; + + return $f; + } + + + # Calculate a hash function as the run-time code will do. + # + # If we are making a case-insensitive hash function, we implement that + # by OR'ing 0x20 into each byte of the key. This correctly transforms + # upper-case ASCII into lower-case ASCII, while not changing digits or + # dollar signs. (It does change '_', else we could just skip adjusting + # $cn here at all, for typical keyword strings.) + sub _calc_hash + { + my ($key, $mult, $seed) = @_; + + my $result = $seed; + for my $c (split //, $key) + { + my $cn = ord($c); + $cn |= 0x20 if $case_insensitive; + $result = ($result * $mult + $cn) % 4294967296; + } + return $result; + } + + + # Attempt to construct a mapping table for a minimal perfect hash function + # for the given keys, using the specified hash parameters. + # + # Returns an array containing the mapping table element type name as the + # first element, and a ref to an array of the table values as the second. + # + # Returns an empty array on failure; then caller should choose different + # hash parameter(s) and try again. + sub _construct_hash_table + { + my ($keys_ref, $hash_mult1, $hash_mult2, $hash_seed) = @_; + my @keys = @{$keys_ref}; + + # This algorithm is based on a graph whose edges correspond to the + # keys and whose vertices correspond to entries of the mapping table. + # A key edge links the two vertices whose indexes are the outputs of + # the two hash functions for that key. For K keys, the mapping + # table must have at least 2*K+1 entries, guaranteeing that there's at + # least one unused entry. (In principle, larger mapping tables make it + # easier to find a workable hash and increase the number of inputs that + # can be rejected due to touching unused hashtable entries. In practice, + # neither effect seems strong enough to justify using a larger table.) + my $nedges = scalar @keys; # number of edges + my $nverts = 2 * $nedges + 1; # number of vertices + + # Initialize the array of edges. + my @E = (); + foreach my $kw (@keys) + { + # Calculate hashes for this key. + # The hashes are immediately reduced modulo the mapping table size. + my $hash1 = _calc_hash($kw, $hash_mult1, $hash_seed) % $nverts; + my $hash2 = _calc_hash($kw, $hash_mult2, $hash_seed) % $nverts; + + # If the two hashes are the same for any key, we have to fail + # since this edge would itself form a cycle in the graph. + return () if $hash1 == $hash2; + + # Add the edge for this key. + push @E, { left => $hash1, right => $hash2 }; + } + + # Initialize the array of vertices, giving them all empty lists + # of associated edges. (The lists will be hashes of edge numbers.) + my @V = (); + for (my $v = 0; $v < $nverts; $v++) + { + push @V, { edges => {} }; + } + + # Insert each edge in the lists of edges using its vertices. + for (my $e = 0; $e < $nedges; $e++) + { + my $v = $E[$e]{left}; + $V[$v]{edges}->{$e} = 1; + + $v = $E[$e]{right}; + $V[$v]{edges}->{$e} = 1; + } + + # Now we attempt to prove the graph acyclic. + # A cycle-free graph is either empty or has some vertex of degree 1. + # Removing the edge attached to that vertex doesn't change this property, + # so doing that repeatedly will reduce the size of the graph. + # If the graph is empty at the end of the process, it was acyclic. + # We track the order of edge removal so that the next phase can process + # them in reverse order of removal. + my @output_order = (); + + # Consider each vertex as a possible starting point for edge-removal. + for (my $startv = 0; $startv < $nverts; $startv++) + { + my $v = $startv; + + # If vertex v is of degree 1 (i.e. exactly 1 edge connects to it), + # remove that edge, and then consider the edge's other vertex to see + # if it is now of degree 1. The inner loop repeats until reaching a + # vertex not of degree 1. + while (scalar(keys(%{ $V[$v]{edges} })) == 1) + { + # Unlink its only edge. + my $e = (keys(%{ $V[$v]{edges} }))[0]; + delete($V[$v]{edges}->{$e}); + + # Unlink the edge from its other vertex, too. + my $v2 = $E[$e]{left}; + $v2 = $E[$e]{right} if ($v2 == $v); + delete($V[$v2]{edges}->{$e}); + + # Push e onto the front of the output-order list. + unshift @output_order, $e; + + # Consider v2 on next iteration of inner loop. + $v = $v2; + } + } + + # We succeeded only if all edges were removed from the graph. + return () if (scalar(@output_order) != $nedges); + + # OK, build the hash table of size $nverts. + my @hashtab = (0) x $nverts; + # We need a "visited" flag array in this step, too. + my @visited = (0) x $nverts; + + # The idea is that for any key, the sum of the hash table entries + # for its first and second hash values is the desired output (i.e., the + # key number). By assigning hash table values in the selected edge + # order, we can guarantee that that's true. + foreach my $e (@output_order) + { + my $l = $E[$e]{left}; + my $r = $E[$e]{right}; + if (!$visited[$l]) + { + # $hashtab[$r] might be zero, or some previously assigned value. + $hashtab[$l] = $e - $hashtab[$r]; + } + else + { + die "oops, doubly used hashtab entry" if $visited[$r]; + # $hashtab[$l] might be zero, or some previously assigned value. + $hashtab[$r] = $e - $hashtab[$l]; + } + # Now freeze both of these hashtab entries. + $visited[$l] = 1; + $visited[$r] = 1; + } + + # Detect range of values needed in hash table. + my $hmin = $nedges; + my $hmax = 0; + for (my $v = 0; $v < $nverts; $v++) + { + $hmin = $hashtab[$v] if $hashtab[$v] < $hmin; + $hmax = $hashtab[$v] if $hashtab[$v] > $hmax; + } + + # Choose width of hashtable entries. In addition to the actual values, + # we need to be able to store a flag for unused entries, and we wish to + # have the property that adding any other entry value to the flag gives + # an out-of-range result (>= $nedges). + my $elemtype; + my $unused_flag; + + if ( $hmin >= -0x7F + && $hmax <= 0x7F + && $hmin + 0x7F >= $nedges) + { + # int8 will work + $elemtype = 'int8'; + $unused_flag = 0x7F; + } + elsif ($hmin >= -0x7FFF + && $hmax <= 0x7FFF + && $hmin + 0x7FFF >= $nedges) + { + # int16 will work + $elemtype = 'int16'; + $unused_flag = 0x7FFF; + } + elsif ($hmin >= -0x7FFFFFFF + && $hmax <= 0x7FFFFFFF + && $hmin + 0x3FFFFFFF >= $nedges) + { + # int32 will work + $elemtype = 'int32'; + $unused_flag = 0x3FFFFFFF; + } + else + { + die "hash table values too wide"; + } + + # Set any unvisited hashtable entries to $unused_flag. + for (my $v = 0; $v < $nverts; $v++) + { + $hashtab[$v] = $unused_flag if !$visited[$v]; + } + + return ($elemtype, \@hashtab); + } + + 1; diff --git a/src/tools/gen_keywordlist.pl b/src/tools/gen_keywordlist.pl index d764aff..e912c3e 100644 *** a/src/tools/gen_keywordlist.pl --- b/src/tools/gen_keywordlist.pl *************** *** 14,19 **** --- 14,25 ---- # variable named according to the -v switch ("ScanKeywords" by default). # The variable is marked "static" unless the -e switch is given. # + # ScanKeywordList uses hash-based lookup, so this script also selects + # a minimal perfect hash function for the keyword set, and emits a + # static hash function that is referenced in the ScanKeywordList struct. + # The hash function is case-insensitive unless --case is specified. + # Note that case insensitivity assumes all-ASCII keywords! + # # # Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group # Portions Copyright (c) 1994, Regents of the University of California *************** *** 25,39 **** use strict; use warnings; use Getopt::Long; my $output_path = ''; my $extern = 0; my $varname = 'ScanKeywords'; GetOptions( ! 'output:s' => \$output_path, ! 'extern' => \$extern, ! 'varname:s' => \$varname) || usage(); my $kw_input_file = shift @ARGV || die "No input file.\n"; --- 31,48 ---- use strict; use warnings; use Getopt::Long; + use PerfectHash; my $output_path = ''; my $extern = 0; + my $case_sensitive = 0; my $varname = 'ScanKeywords'; GetOptions( ! 'output:s' => \$output_path, ! 'extern' => \$extern, ! 'case-sensitive' => \$case_sensitive, ! 'varname:s' => \$varname) || usage(); my $kw_input_file = shift @ARGV || die "No input file.\n"; *************** while (<$kif>) *** 87,93 **** --- 96,117 ---- } } + # When being case-insensitive, insist that the input be all-lower-case. + if (!$case_sensitive) + { + foreach my $kw (@keywords) + { + die qq|The keyword "$kw" is not lower-case in $kw_input_file\n| + if ($kw ne lc $kw); + } + } + # Error out if the keyword names are not in ASCII order. + # + # While this isn't really necessary with hash-based lookup, it's still + # helpful because it provides a cheap way to reject duplicate keywords. + # Also, insisting on sorted order ensures that code that scans the keyword + # table linearly will see the keywords in a canonical order. for my $i (0..$#keywords - 1) { die qq|The keyword "$keywords[$i + 1]" is out of order in $kw_input_file\n| *************** print $kwdef "};\n\n"; *** 128,142 **** printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords; # Emit the struct that wraps all this lookup info into one variable. ! print $kwdef "static " if !$extern; printf $kwdef "const ScanKeywordList %s = {\n", $varname; printf $kwdef qq|\t%s_kw_string,\n|, $varname; printf $kwdef qq|\t%s_kw_offsets,\n|, $varname; printf $kwdef qq|\t%s_NUM_KEYWORDS,\n|, uc $varname; printf $kwdef qq|\t%d\n|, $max_len; ! print $kwdef "};\n\n"; printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename; --- 152,176 ---- printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords; + # Emit the definition of the hash function. + + my $funcname = $varname . "_hash_func"; + + my $f = PerfectHash::generate_hash_function(\@keywords, + $funcname, !$case_sensitive); + + printf $kwdef qq|static %s\n|, $f; + # Emit the struct that wraps all this lookup info into one variable. ! printf $kwdef "static " if !$extern; printf $kwdef "const ScanKeywordList %s = {\n", $varname; printf $kwdef qq|\t%s_kw_string,\n|, $varname; printf $kwdef qq|\t%s_kw_offsets,\n|, $varname; + printf $kwdef qq|\t%s,\n|, $funcname; printf $kwdef qq|\t%s_NUM_KEYWORDS,\n|, uc $varname; printf $kwdef qq|\t%d\n|, $max_len; ! printf $kwdef "};\n\n"; printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename; *************** Usage: gen_keywordlist.pl [--output/-o < *** 148,153 **** --- 182,188 ---- --output Output directory (default '.') --varname Name for ScanKeywordList variable (default 'ScanKeywords') --extern Allow the ScanKeywordList variable to be globally visible + --case Keyword matching is to be case-sensitive gen_keywordlist.pl transforms a list of keywords into a ScanKeywordList. The output filename is derived from the input file by inserting _d, diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm index 937bf18..8f54e45 100644 *** a/src/tools/msvc/Solution.pm --- b/src/tools/msvc/Solution.pm *************** sub GenerateFiles *** 414,420 **** 'src/include/parser/kwlist.h')) { print "Generating kwlist_d.h...\n"; ! system('perl src/tools/gen_keywordlist.pl --extern -o src/common src/include/parser/kwlist.h'); } if (IsNewer( --- 414,420 ---- 'src/include/parser/kwlist.h')) { print "Generating kwlist_d.h...\n"; ! system('perl -I src/tools src/tools/gen_keywordlist.pl --extern -o src/common src/include/parser/kwlist.h'); } if (IsNewer( *************** sub GenerateFiles *** 426,433 **** { print "Generating pl_reserved_kwlist_d.h and pl_unreserved_kwlist_d.h...\n"; chdir('src/pl/plpgsql/src'); ! system('perl ../../../tools/gen_keywordlist.pl --varname ReservedPLKeywords pl_reserved_kwlist.h'); ! system('perl ../../../tools/gen_keywordlist.pl --varname UnreservedPLKeywords pl_unreserved_kwlist.h'); chdir('../../../..'); } --- 426,433 ---- { print "Generating pl_reserved_kwlist_d.h and pl_unreserved_kwlist_d.h...\n"; chdir('src/pl/plpgsql/src'); ! system('perl -I ../../../tools ../../../tools/gen_keywordlist.pl --varname ReservedPLKeywords pl_reserved_kwlist.h'); ! system('perl -I ../../../tools ../../../tools/gen_keywordlist.pl --varname UnreservedPLKeywords pl_unreserved_kwlist.h'); chdir('../../../..'); } *************** sub GenerateFiles *** 440,447 **** { print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n"; chdir('src/interfaces/ecpg/preproc'); ! system('perl ../../../tools/gen_keywordlist.pl --varname ScanCKeywords c_kwlist.h'); ! system('perl ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h'); chdir('../../../..'); } --- 440,447 ---- { print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n"; chdir('src/interfaces/ecpg/preproc'); ! system('perl -I ../../../tools ../../../tools/gen_keywordlist.pl --varname ScanCKeywords --case c_kwlist.h'); ! system('perl -I ../../../tools ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h'); chdir('../../../..'); }
Andres Freund <andres@anarazel.de> writes: > On 2019-01-08 13:41:16 -0500, John Naylor wrote: >> Do you mean the fmgr table? > Not the entire fmgr table, but just the builtin oid index, generated by > the following section: > ... > The generated fmgr_builtin_oid_index is pretty sparse, and a more dense > hashtable might e.g. more efficient from a cache perspective. I experimented with this, but TBH I think it's a dead loss. We currently have 2768 built-in functions, so the perfect hash table requires 5537 int16 entries, which is not *that* much less than the 10000 entries that are in fmgr_builtin_oid_index presently. When you consider the extra cycles needed to do the hashing, and the fact that you have to touch (usually) two cache lines not one in the lookup table, it's hard to see how this could net out as a win performance-wise. Also, I fail to understand why fmgr_builtin_oid_index has 10000 entries anyway. We could easily have fmgrtab.c expose the last actually assigned builtin function OID (presently 6121) and make the index array only that big, which just about eliminates the space advantage completely. BTW, I found out while trying this that Joerg's fear of the hash multipliers being too simplistic is valid: the perfect hash generator failed until I changed them. I picked a larger value that should be just as easy to use for shift-and-add purposes. regards, tom lane diff --git a/src/backend/utils/Gen_fmgrtab.pl b/src/backend/utils/Gen_fmgrtab.pl index cafe408..9fceb60 100644 *** a/src/backend/utils/Gen_fmgrtab.pl --- b/src/backend/utils/Gen_fmgrtab.pl *************** use Catalog; *** 18,23 **** --- 18,24 ---- use strict; use warnings; + use PerfectHash; # Collect arguments my @input_files; *************** foreach my $s (sort { $a->{oid} <=> $b-> *** 219,237 **** print $pfh "extern Datum $s->{prosrc}(PG_FUNCTION_ARGS);\n"; } ! # Create the fmgr_builtins table, collect data for fmgr_builtin_oid_index print $tfh "\nconst FmgrBuiltin fmgr_builtins[] = {\n"; my %bmap; $bmap{'t'} = 'true'; $bmap{'f'} = 'false'; ! my @fmgr_builtin_oid_index; my $fmgr_count = 0; foreach my $s (sort { $a->{oid} <=> $b->{oid} } @fmgr) { print $tfh " { $s->{oid}, $s->{nargs}, $bmap{$s->{strict}}, $bmap{$s->{retset}}, \"$s->{prosrc}\", $s->{prosrc} }"; ! $fmgr_builtin_oid_index[ $s->{oid} ] = $fmgr_count++; if ($fmgr_count <= $#fmgr) { --- 220,244 ---- print $pfh "extern Datum $s->{prosrc}(PG_FUNCTION_ARGS);\n"; } ! # Create the fmgr_builtins table, collect data for hash function print $tfh "\nconst FmgrBuiltin fmgr_builtins[] = {\n"; my %bmap; $bmap{'t'} = 'true'; $bmap{'f'} = 'false'; ! my @fmgr_builtin_oids; ! my $prev_oid = 0; my $fmgr_count = 0; foreach my $s (sort { $a->{oid} <=> $b->{oid} } @fmgr) { print $tfh " { $s->{oid}, $s->{nargs}, $bmap{$s->{strict}}, $bmap{$s->{retset}}, \"$s->{prosrc}\", $s->{prosrc} }"; ! die "duplicate OIDs" if $s->{oid} <= $prev_oid; ! $prev_oid = $s->{oid}; ! ! push @fmgr_builtin_oids, pack("n", $s->{oid}); ! ! $fmgr_count++; if ($fmgr_count <= $#fmgr) { *************** print $tfh "};\n"; *** 246,283 **** print $tfh qq| const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin)); - |; - - # Create fmgr_builtins_oid_index table. - # - # Note that the array has to be filled up to FirstGenbkiObjectId, - # as we can't rely on zero initialization as 0 is a valid mapping. - print $tfh qq| - const uint16 fmgr_builtin_oid_index[FirstGenbkiObjectId] = { |; - for (my $i = 0; $i < $FirstGenbkiObjectId; $i++) - { - my $oid = $fmgr_builtin_oid_index[$i]; ! # fmgr_builtin_oid_index is sparse, map nonexistant functions to ! # InvalidOidBuiltinMapping ! if (not defined $oid) ! { ! $oid = 'InvalidOidBuiltinMapping'; ! } ! if ($i + 1 == $FirstGenbkiObjectId) ! { ! print $tfh " $oid\n"; ! } ! else ! { ! print $tfh " $oid,\n"; ! } ! } ! print $tfh "};\n"; # And add the file footers. --- 253,267 ---- print $tfh qq| const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin)); |; ! # Create perfect hash function for searching fmgr_builtin by OID. ! print $tfh PerfectHash::generate_hash_function(\@fmgr_builtin_oids, ! "fmgr_builtin_oid_hash", ! 0); # And add the file footers. diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c index b41649f..ad93032 100644 *** a/src/backend/utils/fmgr/fmgr.c --- b/src/backend/utils/fmgr/fmgr.c *************** extern Datum fmgr_security_definer(PG_FU *** 72,92 **** static const FmgrBuiltin * fmgr_isbuiltin(Oid id) { ! uint16 index; /* fast lookup only possible if original oid still assigned */ if (id >= FirstGenbkiObjectId) return NULL; /* ! * Lookup function data. If there's a miss in that range it's likely a ! * nonexistant function, returning NULL here will trigger an ERROR later. */ ! index = fmgr_builtin_oid_index[id]; ! if (index == InvalidOidBuiltinMapping) return NULL; ! return &fmgr_builtins[index]; } /* --- 72,103 ---- static const FmgrBuiltin * fmgr_isbuiltin(Oid id) { ! const FmgrBuiltin *result; ! uint16 hashkey; ! int index; /* fast lookup only possible if original oid still assigned */ if (id >= FirstGenbkiObjectId) return NULL; /* ! * Lookup function data. The hash key for this is the low-order 16 bits ! * of the OID, in network byte order. */ ! hashkey = htons(id); ! index = fmgr_builtin_oid_hash(&hashkey, sizeof(hashkey)); ! ! /* Out-of-range hash result means definitely no match */ ! if (index < 0 || index >= fmgr_nbuiltins) return NULL; ! result = &fmgr_builtins[index]; ! ! /* We have to verify the match, though */ ! if (id != result->foid) ! return NULL; ! ! return result; } /* diff --git a/src/include/utils/fmgrtab.h b/src/include/utils/fmgrtab.h index a778f88..f27aff5 100644 *** a/src/include/utils/fmgrtab.h --- b/src/include/utils/fmgrtab.h *************** extern const FmgrBuiltin fmgr_builtins[] *** 36,46 **** extern const int fmgr_nbuiltins; /* number of entries in table */ ! /* ! * Mapping from a builtin function's oid to the index in the fmgr_builtins ! * array. ! */ ! #define InvalidOidBuiltinMapping PG_UINT16_MAX ! extern const uint16 fmgr_builtin_oid_index[FirstGenbkiObjectId]; #endif /* FMGRTAB_H */ --- 36,41 ---- extern const int fmgr_nbuiltins; /* number of entries in table */ ! extern int fmgr_builtin_oid_hash(const void *key, size_t keylen); #endif /* FMGRTAB_H */ diff --git a/src/tools/PerfectHash.pm b/src/tools/PerfectHash.pm index 34d55cf..862357b 100644 *** a/src/tools/PerfectHash.pm --- b/src/tools/PerfectHash.pm *************** sub generate_hash_function *** 77,83 **** # but these two multipliers are chosen to be cheap to calculate via # shift-and-add, so don't change them except at great need. my $hash_mult1 = 31; ! my $hash_mult2 = 37; # We just try successive hash seed values until we find one that works. # (Commonly, random seeds are tried, but we want reproducible results --- 77,83 ---- # but these two multipliers are chosen to be cheap to calculate via # shift-and-add, so don't change them except at great need. my $hash_mult1 = 31; ! my $hash_mult2 = 1029; # We just try successive hash seed values until we find one that works. # (Commonly, random seeds are tried, but we want reproducible results
On Tue, Jan 8, 2019 at 3:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I'll take a crack at separating into a module. I'll wait a bit in > > case there are any stylistic suggestions on the patch as it stands. > > I had a go at that myself. I'm sure there's plenty to criticize in > the result, but at least it passes make check-world ;-) Just a couple comments about the module: -If you qualify the function's module name as you did (PerfectHash::generate_hash_function), you don't have to export the function into the callers namespace, so you can skip the @EXPORT_OK setting. Most of our modules don't export. -There is a bit of a cognitive clash between $case_sensitive in gen_keywordlist.pl and $case_insensitive in PerfectHash.pm. They each make sense in their own file, but might it be worth using one or the other? -As for the graph algorithm, I'd have to play with it to understand how it works. In the committed keyword patch, I noticed that in common/keywords.c, the array length is defined with ScanKeywordCategories[SCANKEYWORDS_NUM_KEYWORDS] but other keyword arrays just have ...[]. Is there a reason for the difference?
John Naylor <john.naylor@2ndquadrant.com> writes: > Just a couple comments about the module: > -If you qualify the function's module name as you did > (PerfectHash::generate_hash_function), you don't have to export the > function into the callers namespace, so you can skip the @EXPORT_OK > setting. Most of our modules don't export. OK by me. I was more concerned about hiding the stuff that isn't supposed to be exported. > -There is a bit of a cognitive clash between $case_sensitive in > gen_keywordlist.pl and $case_insensitive in PerfectHash.pm. They each > make sense in their own file, but might it be worth using one or the > other? Yeah, dunno. It seems to make sense for the command-line-level default of gen_keywordlist.pl to be "case insensitive", since most users want that. But that surely shouldn't be the default in PerfectHash.pm, and I'm not very sure how to reconcile the discrepancy. > In the committed keyword patch, I noticed that in common/keywords.c, > the array length is defined with > ScanKeywordCategories[SCANKEYWORDS_NUM_KEYWORDS] > but other keyword arrays just have ...[]. Is there a reason for the difference? The length macro was readily available there so I used it. AFAIR that wasn't true elsewhere, though I might've missed something. It's pretty much just belt-and-suspenders coding anyway, since all those arrays are machine generated ... regards, tom lane
John Naylor <john.naylor@2ndquadrant.com> writes: > -As for the graph algorithm, I'd have to play with it to understand > how it works. I improved the comment about how come the hash table entry assignment works. One thing I'm not clear about myself is # A cycle-free graph is either empty or has some vertex of degree 1. That sounds like a standard graph theory result, but I'm not familiar with a proof for it. regards, tom lane #---------------------------------------------------------------------- # # PerfectHash.pm # Perl module that constructs minimal perfect hash functions # # This code constructs a minimal perfect hash function for the given # set of keys, using an algorithm described in # "An optimal algorithm for generating minimal perfect hash functions" # by Czech, Havas and Majewski in Information Processing Letters, # 43(5):256-264, October 1992. # This implementation is loosely based on NetBSD's "nbperf", # which was written by Joerg Sonnenberger. # # The resulting hash function is perfect in the sense that if the presented # key is one of the original set, it will return the key's index in the set # (in range 0..N-1). However, the caller must still verify the match, # as false positives are possible. Also, the hash function may return # values that are out of range (negative, or >= N). This indicates that # the presented key is definitely not in the set. # # # Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group # Portions Copyright (c) 1994, Regents of the University of California # # src/tools/PerfectHash.pm # #---------------------------------------------------------------------- package PerfectHash; use strict; use warnings; # At runtime, we'll compute two simple hash functions of the input key, # and use them to index into a mapping table. The hash functions are just # multiply-and-add in uint32 arithmetic, with different multipliers but # the same initial seed. All the complexity in this module is concerned # with selecting hash parameters that will work and building the mapping # table. # We support making case-insensitive hash functions, though this only # works for a strict-ASCII interpretation of case insensitivity, # ie, A-Z maps onto a-z and nothing else. my $case_insensitive = 0; # # Construct a C function implementing a perfect hash for the given keys. # The C function definition is returned as a string. # # The keys can be any set of Perl strings; it is caller's responsibility # that there not be any duplicates. (Note that the "strings" can be # binary data, but endianness is the caller's problem.) # # The name to use for the function is caller-specified, but its signature # is always "int f(const void *key, size_t keylen)". The caller may # prepend "static " to the result string if it wants a static function. # # If $ci is true, the function is case-insensitive, for the limited idea # of case-insensitivity explained above. # sub generate_hash_function { my ($keys_ref, $funcname, $ci) = @_; # It's not worth passing this around as a parameter; just use a global. $case_insensitive = $ci; # Try different hash function parameters until we find a set that works # for these keys. In principle we might need to change multipliers, # but these two multipliers are chosen to be primes that are cheap to # calculate via shift-and-add, so don't change them without care. my $hash_mult1 = 31; my $hash_mult2 = 2053; # We just try successive hash seed values until we find one that works. # (Commonly, random seeds are tried, but we want reproducible results # from this program so we don't do that.) my $hash_seed; my @subresult; for ($hash_seed = 0; $hash_seed < 1000; $hash_seed++) { @subresult = _construct_hash_table($keys_ref, $hash_mult1, $hash_mult2, $hash_seed); last if @subresult; } # Choke if we didn't succeed in a reasonable number of tries. die "failed to generate perfect hash" if !@subresult; # Extract info from the function result array. my $elemtype = $subresult[0]; my @hashtab = @{ $subresult[1] }; my $nhash = scalar(@hashtab); # OK, construct the hash function definition including the hash table. my $f = ''; $f .= sprintf "int\n"; $f .= sprintf "%s(const void *key, size_t keylen)\n{\n", $funcname; $f .= sprintf "\tstatic const %s h[%d] = {\n", $elemtype, $nhash; for (my $i = 0; $i < $nhash; $i++) { $f .= sprintf "%s%6d,%s", ($i % 8 == 0 ? "\t\t" : " "), $hashtab[$i], ($i % 8 == 7 ? "\n" : ""); } $f .= sprintf "\n" if ($nhash % 8 != 0); $f .= sprintf "\t};\n\n"; $f .= sprintf "\tconst unsigned char *k = key;\n"; $f .= sprintf "\tuint32\t\ta = %d;\n", $hash_seed; $f .= sprintf "\tuint32\t\tb = %d;\n\n", $hash_seed; $f .= sprintf "\twhile (keylen--)\n\t{\n"; $f .= sprintf "\t\tunsigned char c = *k++"; $f .= sprintf " | 0x20" if $case_insensitive; # see comment below $f .= sprintf ";\n\n"; $f .= sprintf "\t\ta = a * %d + c;\n", $hash_mult1; $f .= sprintf "\t\tb = b * %d + c;\n", $hash_mult2; $f .= sprintf "\t}\n"; $f .= sprintf "\treturn h[a %% %d] + h[b %% %d];\n", $nhash, $nhash; $f .= sprintf "}\n"; return $f; } # Calculate a hash function as the run-time code will do. # # If we are making a case-insensitive hash function, we implement that # by OR'ing 0x20 into each byte of the key. This correctly transforms # upper-case ASCII into lower-case ASCII, while not changing digits or # dollar signs. (It does change '_', else we could just skip adjusting # $cn here at all, for typical keyword strings.) sub _calc_hash { my ($key, $mult, $seed) = @_; my $result = $seed; for my $c (split //, $key) { my $cn = ord($c); $cn |= 0x20 if $case_insensitive; $result = ($result * $mult + $cn) % 4294967296; } return $result; } # Attempt to construct a mapping table for a minimal perfect hash function # for the given keys, using the specified hash parameters. # # Returns an array containing the mapping table element type name as the # first element, and a ref to an array of the table values as the second. # # Returns an empty array on failure; then caller should choose different # hash parameter(s) and try again. sub _construct_hash_table { my ($keys_ref, $hash_mult1, $hash_mult2, $hash_seed) = @_; my @keys = @{$keys_ref}; # This algorithm is based on a graph whose edges correspond to the # keys and whose vertices correspond to entries of the mapping table. # A key's edge links the two vertices whose indexes are the outputs of # the two hash functions for that key. For K keys, the mapping # table must have at least 2*K+1 entries, guaranteeing that there's at # least one unused entry. (In principle, larger mapping tables make it # easier to find a workable hash and increase the number of inputs that # can be rejected due to touching unused hashtable entries. In practice, # neither effect seems strong enough to justify using a larger table.) my $nedges = scalar @keys; # number of edges my $nverts = 2 * $nedges + 1; # number of vertices # Initialize the array of edges. my @E = (); foreach my $kw (@keys) { # Calculate hashes for this key. # The hashes are immediately reduced modulo the mapping table size. my $hash1 = _calc_hash($kw, $hash_mult1, $hash_seed) % $nverts; my $hash2 = _calc_hash($kw, $hash_mult2, $hash_seed) % $nverts; # If the two hashes are the same for any key, we have to fail # since this edge would itself form a cycle in the graph. return () if $hash1 == $hash2; # Add the edge for this key. push @E, { left => $hash1, right => $hash2 }; } # Initialize the array of vertices, giving them all empty lists # of associated edges. (The lists will be hashes of edge numbers.) my @V = (); for (my $v = 0; $v < $nverts; $v++) { push @V, { edges => {} }; } # Insert each edge in the lists of edges using its vertices. for (my $e = 0; $e < $nedges; $e++) { my $v = $E[$e]{left}; $V[$v]{edges}->{$e} = 1; $v = $E[$e]{right}; $V[$v]{edges}->{$e} = 1; } # Now we attempt to prove the graph acyclic. # A cycle-free graph is either empty or has some vertex of degree 1. # Removing the edge attached to that vertex doesn't change this property, # so doing that repeatedly will reduce the size of the graph. # If the graph is empty at the end of the process, it was acyclic. # We track the order of edge removal so that the next phase can process # them in reverse order of removal. my @output_order = (); # Consider each vertex as a possible starting point for edge-removal. for (my $startv = 0; $startv < $nverts; $startv++) { my $v = $startv; # If vertex v is of degree 1 (i.e. exactly 1 edge connects to it), # remove that edge, and then consider the edge's other vertex to see # if it is now of degree 1. The inner loop repeats until reaching a # vertex not of degree 1. while (scalar(keys(%{ $V[$v]{edges} })) == 1) { # Unlink its only edge. my $e = (keys(%{ $V[$v]{edges} }))[0]; delete($V[$v]{edges}->{$e}); # Unlink the edge from its other vertex, too. my $v2 = $E[$e]{left}; $v2 = $E[$e]{right} if ($v2 == $v); delete($V[$v2]{edges}->{$e}); # Push e onto the front of the output-order list. unshift @output_order, $e; # Consider v2 on next iteration of inner loop. $v = $v2; } } # We succeeded only if all edges were removed from the graph. return () if (scalar(@output_order) != $nedges); # OK, build the hash table of size $nverts. my @hashtab = (0) x $nverts; # We need a "visited" flag array in this step, too. my @visited = (0) x $nverts; # The goal is that for any key, the sum of the hash table entries for # its first and second hash values is the desired output (i.e., the key # number). By assigning hash table values in the selected edge order, # we can guarantee that that's true. This works because the edge first # removed from the graph (and hence last to be visited here) must have # at least one vertex it shared with no other edge; hence it will have at # least one vertex (hashtable entry) still unvisited when we reach it here, # and we can assign that unvisited entry a value that makes the sum come # out as we wish. By induction, the same holds for all the other edges. foreach my $e (@output_order) { my $l = $E[$e]{left}; my $r = $E[$e]{right}; if (!$visited[$l]) { # $hashtab[$r] might be zero, or some previously assigned value. $hashtab[$l] = $e - $hashtab[$r]; } else { die "oops, doubly used hashtab entry" if $visited[$r]; # $hashtab[$l] might be zero, or some previously assigned value. $hashtab[$r] = $e - $hashtab[$l]; } # Now freeze both of these hashtab entries. $visited[$l] = 1; $visited[$r] = 1; } # Detect range of values needed in hash table. my $hmin = $nedges; my $hmax = 0; for (my $v = 0; $v < $nverts; $v++) { $hmin = $hashtab[$v] if $hashtab[$v] < $hmin; $hmax = $hashtab[$v] if $hashtab[$v] > $hmax; } # Choose width of hashtable entries. In addition to the actual values, # we need to be able to store a flag for unused entries, and we wish to # have the property that adding any other entry value to the flag gives # an out-of-range result (>= $nedges). my $elemtype; my $unused_flag; if ( $hmin >= -0x7F && $hmax <= 0x7F && $hmin + 0x7F >= $nedges) { # int8 will work $elemtype = 'int8'; $unused_flag = 0x7F; } elsif ($hmin >= -0x7FFF && $hmax <= 0x7FFF && $hmin + 0x7FFF >= $nedges) { # int16 will work $elemtype = 'int16'; $unused_flag = 0x7FFF; } elsif ($hmin >= -0x7FFFFFFF && $hmax <= 0x7FFFFFFF && $hmin + 0x3FFFFFFF >= $nedges) { # int32 will work $elemtype = 'int32'; $unused_flag = 0x3FFFFFFF; } else { die "hash table values too wide"; } # Set any unvisited hashtable entries to $unused_flag. for (my $v = 0; $v < $nverts; $v++) { $hashtab[$v] = $unused_flag if !$visited[$v]; } return ($elemtype, \@hashtab); } 1;
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Joerg Sonnenberger
Date:
On Tue, Jan 08, 2019 at 05:53:25PM -0500, Tom Lane wrote: > John Naylor <john.naylor@2ndquadrant.com> writes: > > -As for the graph algorithm, I'd have to play with it to understand > > how it works. > > I improved the comment about how come the hash table entry assignment > works. One thing I'm not clear about myself is > > # A cycle-free graph is either empty or has some vertex of degree 1. > > That sounds like a standard graph theory result, but I'm not familiar > with a proof for it. Let's assume all vertexes have a degree > 1, the graph is acyclic and non-empty. Pick any vertex. Let's construct a path now starting from this vertex. It is connected to at least one other vertex. Let's follow that path. Again, there must be connected to one more vertex and it can't go back to the starting point (since that would be a cycle). The next vertex must still have another connections and it can't go back to any already visited vertexes. Continue until you run out of vertex... Joerg
I wrote: > John Naylor <john.naylor@2ndquadrant.com> writes: >> -There is a bit of a cognitive clash between $case_sensitive in >> gen_keywordlist.pl and $case_insensitive in PerfectHash.pm. They each >> make sense in their own file, but might it be worth using one or the >> other? > Yeah, dunno. It seems to make sense for the command-line-level default of > gen_keywordlist.pl to be "case insensitive", since most users want that. > But that surely shouldn't be the default in PerfectHash.pm, and I'm not > very sure how to reconcile the discrepancy. Working on the fmgr-oid-lookup idea gave me the thought that PerfectHash.pm ought to support fixed-length keys. Rather than start adding random parameters to the function, I borrowed an idea from PostgresNode.pm and made the options be keyword-style parameters. Now the impedance mismatch about case sensitivity is handled with my $f = PerfectHash::generate_hash_function(\@keywords, $funcname, case_insensitive => !$case_sensitive); which is at least a little clearer than before, though I'm not sure if it entirely solves the problem. Also, in view of finding that the original multiplier choices failed on the fmgr oid problem, I spent a little effort making the code able to try more combinations of hash multipliers and seeds. It'd be nice to have some theory rather than just heuristics about what will work, though ... Barring objections or further review, I plan to push this soon. regards, tom lane diff --git a/src/common/Makefile b/src/common/Makefile index 317b071..d0c2b97 100644 *** a/src/common/Makefile --- b/src/common/Makefile *************** OBJS_FRONTEND = $(OBJS_COMMON) fe_memuti *** 63,68 **** --- 63,73 ---- OBJS_SHLIB = $(OBJS_FRONTEND:%.o=%_shlib.o) OBJS_SRV = $(OBJS_COMMON:%.o=%_srv.o) + # where to find gen_keywordlist.pl and subsidiary files + TOOLSDIR = $(top_srcdir)/src/tools + GEN_KEYWORDLIST = $(PERL) -I $(TOOLSDIR) $(TOOLSDIR)/gen_keywordlist.pl + GEN_KEYWORDLIST_DEPS = $(TOOLSDIR)/gen_keywordlist.pl $(TOOLSDIR)/PerfectHash.pm + all: libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a distprep: kwlist_d.h *************** libpgcommon_srv.a: $(OBJS_SRV) *** 118,125 **** $(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@ # generate SQL keyword lookup table to be included into keywords*.o. ! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(top_srcdir)/src/tools/gen_keywordlist.pl ! $(PERL) $(top_srcdir)/src/tools/gen_keywordlist.pl --extern $< # Dependencies of keywords*.o need to be managed explicitly to make sure # that you don't get broken parsing code, even in a non-enable-depend build. --- 123,130 ---- $(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@ # generate SQL keyword lookup table to be included into keywords*.o. ! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(GEN_KEYWORDLIST_DEPS) ! $(GEN_KEYWORDLIST) --extern $< # Dependencies of keywords*.o need to be managed explicitly to make sure # that you don't get broken parsing code, even in a non-enable-depend build. diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c index d72842e..6545480 100644 *** a/src/common/kwlookup.c --- b/src/common/kwlookup.c *************** *** 35,94 **** * receive a different case-normalization mapping. */ int ! ScanKeywordLookup(const char *text, const ScanKeywordList *keywords) { ! int len, ! i; ! char word[NAMEDATALEN]; ! const char *kw_string; ! const uint16 *kw_offsets; ! const uint16 *low; ! const uint16 *high; ! ! len = strlen(text); if (len > keywords->max_kw_len) ! return -1; /* too long to be any keyword */ ! ! /* We assume all keywords are shorter than NAMEDATALEN. */ ! Assert(len < NAMEDATALEN); /* ! * Apply an ASCII-only downcasing. We must not use tolower() since it may ! * produce the wrong translation in some locales (eg, Turkish). */ ! for (i = 0; i < len; i++) ! { ! char ch = text[i]; ! if (ch >= 'A' && ch <= 'Z') ! ch += 'a' - 'A'; ! word[i] = ch; ! } ! word[len] = '\0'; /* ! * Now do a binary search using plain strcmp() comparison. */ ! kw_string = keywords->kw_string; ! kw_offsets = keywords->kw_offsets; ! low = kw_offsets; ! high = kw_offsets + (keywords->num_keywords - 1); ! while (low <= high) { ! const uint16 *middle; ! int difference; ! middle = low + (high - low) / 2; ! difference = strcmp(kw_string + *middle, word); ! if (difference == 0) ! return middle - kw_offsets; ! else if (difference < 0) ! low = middle + 1; ! else ! high = middle - 1; } ! return -1; } --- 35,85 ---- * receive a different case-normalization mapping. */ int ! ScanKeywordLookup(const char *str, const ScanKeywordList *keywords) { ! size_t len; ! int h; ! const char *kw; + /* + * Reject immediately if too long to be any keyword. This saves useless + * hashing and downcasing work on long strings. + */ + len = strlen(str); if (len > keywords->max_kw_len) ! return -1; /* ! * Compute the hash function. We assume it was generated to produce ! * case-insensitive results. Since it's a perfect hash, we need only ! * match to the specific keyword it identifies. */ ! h = keywords->hash(str, len); ! /* An out-of-range result implies no match */ ! if (h < 0 || h >= keywords->num_keywords) ! return -1; /* ! * Compare character-by-character to see if we have a match, applying an ! * ASCII-only downcasing to the input characters. We must not use ! * tolower() since it may produce the wrong translation in some locales ! * (eg, Turkish). */ ! kw = GetScanKeyword(h, keywords); ! while (*str != '\0') { ! char ch = *str++; ! if (ch >= 'A' && ch <= 'Z') ! ch += 'a' - 'A'; ! if (ch != *kw++) ! return -1; } + if (*kw != '\0') + return -1; ! /* Success! */ ! return h; } diff --git a/src/include/common/kwlookup.h b/src/include/common/kwlookup.h index 39efb35..dbff367 100644 *** a/src/include/common/kwlookup.h --- b/src/include/common/kwlookup.h *************** *** 14,19 **** --- 14,22 ---- #ifndef KWLOOKUP_H #define KWLOOKUP_H + /* Hash function used by ScanKeywordLookup */ + typedef int (*ScanKeywordHashFunc) (const void *key, size_t keylen); + /* * This struct contains the data needed by ScanKeywordLookup to perform a * search within a set of keywords. The contents are typically generated by *************** typedef struct ScanKeywordList *** 23,28 **** --- 26,32 ---- { const char *kw_string; /* all keywords in order, separated by \0 */ const uint16 *kw_offsets; /* offsets to the start of each keyword */ + ScanKeywordHashFunc hash; /* perfect hash function for keywords */ int num_keywords; /* number of keywords */ int max_kw_len; /* length of longest keyword */ } ScanKeywordList; diff --git a/src/interfaces/ecpg/preproc/Makefile b/src/interfaces/ecpg/preproc/Makefile index b5b74a3..6c02f97 100644 *** a/src/interfaces/ecpg/preproc/Makefile --- b/src/interfaces/ecpg/preproc/Makefile *************** OBJS= preproc.o pgc.o type.o ecpg.o outp *** 28,34 **** keywords.o c_keywords.o ecpg_keywords.o typename.o descriptor.o variable.o \ $(WIN32RES) ! GEN_KEYWORDLIST = $(top_srcdir)/src/tools/gen_keywordlist.pl # Suppress parallel build to avoid a bug in GNU make 3.82 # (see comments in ../Makefile) --- 28,37 ---- keywords.o c_keywords.o ecpg_keywords.o typename.o descriptor.o variable.o \ $(WIN32RES) ! # where to find gen_keywordlist.pl and subsidiary files ! TOOLSDIR = $(top_srcdir)/src/tools ! GEN_KEYWORDLIST = $(PERL) -I $(TOOLSDIR) $(TOOLSDIR)/gen_keywordlist.pl ! GEN_KEYWORDLIST_DEPS = $(TOOLSDIR)/gen_keywordlist.pl $(TOOLSDIR)/PerfectHash.pm # Suppress parallel build to avoid a bug in GNU make 3.82 # (see comments in ../Makefile) *************** preproc.y: ../../../backend/parser/gram. *** 56,66 **** $(PERL) $(srcdir)/check_rules.pl $(srcdir) $< # generate keyword headers ! c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST) ! $(PERL) $(GEN_KEYWORDLIST) --varname ScanCKeywords $< ! ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST) ! $(PERL) $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $< # Force these dependencies to be known even without dependency info built: ecpg_keywords.o c_keywords.o keywords.o preproc.o pgc.o parser.o: preproc.h --- 59,69 ---- $(PERL) $(srcdir)/check_rules.pl $(srcdir) $< # generate keyword headers ! c_kwlist_d.h: c_kwlist.h $(GEN_KEYWORDLIST_DEPS) ! $(GEN_KEYWORDLIST) --varname ScanCKeywords --case $< ! ecpg_kwlist_d.h: ecpg_kwlist.h $(GEN_KEYWORDLIST_DEPS) ! $(GEN_KEYWORDLIST) --varname ScanECPGKeywords $< # Force these dependencies to be known even without dependency info built: ecpg_keywords.o c_keywords.o keywords.o preproc.o pgc.o parser.o: preproc.h diff --git a/src/interfaces/ecpg/preproc/c_keywords.c b/src/interfaces/ecpg/preproc/c_keywords.c index 38ddf6f..80aa7d5 100644 *** a/src/interfaces/ecpg/preproc/c_keywords.c --- b/src/interfaces/ecpg/preproc/c_keywords.c *************** *** 9,16 **** */ #include "postgres_fe.h" - #include <ctype.h> - #include "preproc_extern.h" #include "preproc.h" --- 9,14 ---- *************** static const uint16 ScanCKeywordTokens[] *** 32,70 **** * * Returns the token value of the keyword, or -1 if no match. * ! * Do a binary search using plain strcmp() comparison. This is much like * ScanKeywordLookup(), except we want case-sensitive matching. */ int ! ScanCKeywordLookup(const char *text) { ! const char *kw_string; ! const uint16 *kw_offsets; ! const uint16 *low; ! const uint16 *high; ! if (strlen(text) > ScanCKeywords.max_kw_len) ! return -1; /* too long to be any keyword */ ! kw_string = ScanCKeywords.kw_string; ! kw_offsets = ScanCKeywords.kw_offsets; ! low = kw_offsets; ! high = kw_offsets + (ScanCKeywords.num_keywords - 1); ! while (low <= high) ! { ! const uint16 *middle; ! int difference; ! middle = low + (high - low) / 2; ! difference = strcmp(kw_string + *middle, text); ! if (difference == 0) ! return ScanCKeywordTokens[middle - kw_offsets]; ! else if (difference < 0) ! low = middle + 1; ! else ! high = middle - 1; ! } return -1; } --- 30,67 ---- * * Returns the token value of the keyword, or -1 if no match. * ! * Do a hash search using plain strcmp() comparison. This is much like * ScanKeywordLookup(), except we want case-sensitive matching. */ int ! ScanCKeywordLookup(const char *str) { ! size_t len; ! int h; ! const char *kw; ! /* ! * Reject immediately if too long to be any keyword. This saves useless ! * hashing work on long strings. ! */ ! len = strlen(str); ! if (len > ScanCKeywords.max_kw_len) ! return -1; ! /* ! * Compute the hash function. Since it's a perfect hash, we need only ! * match to the specific keyword it identifies. ! */ ! h = ScanCKeywords_hash_func(str, len); ! /* An out-of-range result implies no match */ ! if (h < 0 || h >= ScanCKeywords.num_keywords) ! return -1; ! kw = GetScanKeyword(h, &ScanCKeywords); ! ! if (strcmp(kw, str) == 0) ! return ScanCKeywordTokens[h]; return -1; } diff --git a/src/pl/plpgsql/src/Makefile b/src/pl/plpgsql/src/Makefile index f5958d1..cc1c261 100644 *** a/src/pl/plpgsql/src/Makefile --- b/src/pl/plpgsql/src/Makefile *************** REGRESS_OPTS = --dbname=$(PL_TESTDB) *** 29,35 **** REGRESS = plpgsql_call plpgsql_control plpgsql_domain plpgsql_record \ plpgsql_cache plpgsql_transaction plpgsql_trigger plpgsql_varprops ! GEN_KEYWORDLIST = $(top_srcdir)/src/tools/gen_keywordlist.pl all: all-lib --- 29,38 ---- REGRESS = plpgsql_call plpgsql_control plpgsql_domain plpgsql_record \ plpgsql_cache plpgsql_transaction plpgsql_trigger plpgsql_varprops ! # where to find gen_keywordlist.pl and subsidiary files ! TOOLSDIR = $(top_srcdir)/src/tools ! GEN_KEYWORDLIST = $(PERL) -I $(TOOLSDIR) $(TOOLSDIR)/gen_keywordlist.pl ! GEN_KEYWORDLIST_DEPS = $(TOOLSDIR)/gen_keywordlist.pl $(TOOLSDIR)/PerfectHash.pm all: all-lib *************** plerrcodes.h: $(top_srcdir)/src/backend/ *** 76,86 **** $(PERL) $(srcdir)/generate-plerrcodes.pl $< > $@ # generate keyword headers for the scanner ! pl_reserved_kwlist_d.h: pl_reserved_kwlist.h $(GEN_KEYWORDLIST) ! $(PERL) $(GEN_KEYWORDLIST) --varname ReservedPLKeywords $< ! pl_unreserved_kwlist_d.h: pl_unreserved_kwlist.h $(GEN_KEYWORDLIST) ! $(PERL) $(GEN_KEYWORDLIST) --varname UnreservedPLKeywords $< check: submake --- 79,89 ---- $(PERL) $(srcdir)/generate-plerrcodes.pl $< > $@ # generate keyword headers for the scanner ! pl_reserved_kwlist_d.h: pl_reserved_kwlist.h $(GEN_KEYWORDLIST_DEPS) ! $(GEN_KEYWORDLIST) --varname ReservedPLKeywords $< ! pl_unreserved_kwlist_d.h: pl_unreserved_kwlist.h $(GEN_KEYWORDLIST_DEPS) ! $(GEN_KEYWORDLIST) --varname UnreservedPLKeywords $< check: submake diff --git a/src/tools/PerfectHash.pm b/src/tools/PerfectHash.pm index ...12223fa . *** a/src/tools/PerfectHash.pm --- b/src/tools/PerfectHash.pm *************** *** 0 **** --- 1,375 ---- + #---------------------------------------------------------------------- + # + # PerfectHash.pm + # Perl module that constructs minimal perfect hash functions + # + # This code constructs a minimal perfect hash function for the given + # set of keys, using an algorithm described in + # "An optimal algorithm for generating minimal perfect hash functions" + # by Czech, Havas and Majewski in Information Processing Letters, + # 43(5):256-264, October 1992. + # This implementation is loosely based on NetBSD's "nbperf", + # which was written by Joerg Sonnenberger. + # + # The resulting hash function is perfect in the sense that if the presented + # key is one of the original set, it will return the key's index in the set + # (in range 0..N-1). However, the caller must still verify the match, + # as false positives are possible. Also, the hash function may return + # values that are out of range (negative or >= N), due to summing unrelated + # hashtable entries. This indicates that the presented key is definitely + # not in the set. + # + # + # Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group + # Portions Copyright (c) 1994, Regents of the University of California + # + # src/tools/PerfectHash.pm + # + #---------------------------------------------------------------------- + + package PerfectHash; + + use strict; + use warnings; + + + # At runtime, we'll compute two simple hash functions of the input key, + # and use them to index into a mapping table. The hash functions are just + # multiply-and-add in uint32 arithmetic, with different multipliers and + # initial seeds. All the complexity in this module is concerned with + # selecting hash parameters that will work and building the mapping table. + + # We support making case-insensitive hash functions, though this only + # works for a strict-ASCII interpretation of case insensitivity, + # ie, A-Z maps onto a-z and nothing else. + my $case_insensitive = 0; + + + # + # Construct a C function implementing a perfect hash for the given keys. + # The C function definition is returned as a string. + # + # The keys should be passed as an array reference. They can be any set + # of Perl strings; it is caller's responsibility that there not be any + # duplicates. (Note that the "strings" can be binary data, but hashing + # e.g. OIDs has endianness hazards that callers must overcome.) + # + # The name to use for the function is specified as the second argument. + # It will be a global function by default, but the caller may prepend + # "static " to the result string if it wants a static function. + # + # Additional options can be specified as keyword-style arguments: + # + # case_insensitive => bool + # If specified as true, the hash function is case-insensitive, for the + # limited idea of case-insensitivity explained above. + # + # fixed_key_length => N + # If specified, all keys are assumed to have length N bytes, and the + # hash function signature will be just "int f(const void *key)" + # rather than "int f(const void *key, size_t keylen)". + # + sub generate_hash_function + { + my ($keys_ref, $funcname, %options) = @_; + + # It's not worth passing this around as a parameter; just use a global. + $case_insensitive = $options{case_insensitive} || 0; + + # Try different hash function parameters until we find a set that works + # for these keys. The multipliers are chosen to be primes that are cheap + # to calculate via shift-and-add, so don't change them without care. + # (Commonly, random seeds are tried, but we want reproducible results + # from this program so we don't do that.) + my $hash_mult1 = 31; + my $hash_mult2; + my $hash_seed1; + my $hash_seed2; + my @subresult; + FIND_PARAMS: + foreach (127, 257, 521, 1033, 2053) + { + $hash_mult2 = $_; # "foreach $hash_mult2" doesn't work + for ($hash_seed1 = 0; $hash_seed1 < 10; $hash_seed1++) + { + for ($hash_seed2 = 0; $hash_seed2 < 10; $hash_seed2++) + { + @subresult = _construct_hash_table( + $keys_ref, $hash_mult1, $hash_mult2, + $hash_seed1, $hash_seed2); + last FIND_PARAMS if @subresult; + } + } + } + + # Choke if we couldn't find a workable set of parameters. + die "failed to generate perfect hash" if !@subresult; + + # Extract info from _construct_hash_table's result array. + my $elemtype = $subresult[0]; + my @hashtab = @{ $subresult[1] }; + my $nhash = scalar(@hashtab); + + # OK, construct the hash function definition including the hash table. + my $f = ''; + $f .= sprintf "int\n"; + if (defined $options{fixed_key_length}) + { + $f .= sprintf "%s(const void *key)\n{\n", $funcname; + } + else + { + $f .= sprintf "%s(const void *key, size_t keylen)\n{\n", $funcname; + } + $f .= sprintf "\tstatic const %s h[%d] = {\n", $elemtype, $nhash; + for (my $i = 0; $i < $nhash; $i++) + { + $f .= sprintf "%s%6d,%s", + ($i % 8 == 0 ? "\t\t" : " "), + $hashtab[$i], + ($i % 8 == 7 ? "\n" : ""); + } + $f .= sprintf "\n" if ($nhash % 8 != 0); + $f .= sprintf "\t};\n\n"; + $f .= sprintf "\tconst unsigned char *k = key;\n"; + $f .= sprintf "\tsize_t\t\tkeylen = %d;\n", $options{fixed_key_length} + if (defined $options{fixed_key_length}); + $f .= sprintf "\tuint32\t\ta = %d;\n", $hash_seed1; + $f .= sprintf "\tuint32\t\tb = %d;\n\n", $hash_seed2; + $f .= sprintf "\twhile (keylen--)\n\t{\n"; + $f .= sprintf "\t\tunsigned char c = *k++"; + $f .= sprintf " | 0x20" if $case_insensitive; # see comment below + $f .= sprintf ";\n\n"; + $f .= sprintf "\t\ta = a * %d + c;\n", $hash_mult1; + $f .= sprintf "\t\tb = b * %d + c;\n", $hash_mult2; + $f .= sprintf "\t}\n"; + $f .= sprintf "\treturn h[a %% %d] + h[b %% %d];\n", $nhash, $nhash; + $f .= sprintf "}\n"; + + return $f; + } + + + # Calculate a hash function as the run-time code will do. + # + # If we are making a case-insensitive hash function, we implement that + # by OR'ing 0x20 into each byte of the key. This correctly transforms + # upper-case ASCII into lower-case ASCII, while not changing digits or + # dollar signs. (It does change '_', else we could just skip adjusting + # $cn here at all, for typical keyword strings.) + sub _calc_hash + { + my ($key, $mult, $seed) = @_; + + my $result = $seed; + for my $c (split //, $key) + { + my $cn = ord($c); + $cn |= 0x20 if $case_insensitive; + $result = ($result * $mult + $cn) % 4294967296; + } + return $result; + } + + + # Attempt to construct a mapping table for a minimal perfect hash function + # for the given keys, using the specified hash parameters. + # + # Returns an array containing the mapping table element type name as the + # first element, and a ref to an array of the table values as the second. + # + # Returns an empty array on failure; then caller should choose different + # hash parameter(s) and try again. + sub _construct_hash_table + { + my ($keys_ref, $hash_mult1, $hash_mult2, $hash_seed1, $hash_seed2) = @_; + my @keys = @{$keys_ref}; + + # This algorithm is based on a graph whose edges correspond to the + # keys and whose vertices correspond to entries of the mapping table. + # A key's edge links the two vertices whose indexes are the outputs of + # the two hash functions for that key. For K keys, the mapping + # table must have at least 2*K+1 entries, guaranteeing that there's at + # least one unused entry. (In principle, larger mapping tables make it + # easier to find a workable hash and increase the number of inputs that + # can be rejected due to touching unused hashtable entries. In practice, + # neither effect seems strong enough to justify using a larger table.) + my $nedges = scalar @keys; # number of edges + my $nverts = 2 * $nedges + 1; # number of vertices + + # However, it would be very bad if $nverts were exactly equal to either + # $hash_mult1 or $hash_mult2: effectively, that hash function would be + # sensitive to only the last byte of each key. Cases where $nverts is a + # multiple of either multiplier likewise lose information. (But $nverts + # can't actually divide them, if they've been intelligently chosen as + # primes.) We can avoid such problems by adjusting the table size. + while ($nverts % $hash_mult1 == 0 + || $nverts % $hash_mult2 == 0) + { + $nverts++; + } + + # Initialize the array of edges. + my @E = (); + foreach my $kw (@keys) + { + # Calculate hashes for this key. + # The hashes are immediately reduced modulo the mapping table size. + my $hash1 = _calc_hash($kw, $hash_mult1, $hash_seed1) % $nverts; + my $hash2 = _calc_hash($kw, $hash_mult2, $hash_seed2) % $nverts; + + # If the two hashes are the same for any key, we have to fail + # since this edge would itself form a cycle in the graph. + return () if $hash1 == $hash2; + + # Add the edge for this key. + push @E, { left => $hash1, right => $hash2 }; + } + + # Initialize the array of vertices, giving them all empty lists + # of associated edges. (The lists will be hashes of edge numbers.) + my @V = (); + for (my $v = 0; $v < $nverts; $v++) + { + push @V, { edges => {} }; + } + + # Insert each edge in the lists of edges using its vertices. + for (my $e = 0; $e < $nedges; $e++) + { + my $v = $E[$e]{left}; + $V[$v]{edges}->{$e} = 1; + + $v = $E[$e]{right}; + $V[$v]{edges}->{$e} = 1; + } + + # Now we attempt to prove the graph acyclic. + # A cycle-free graph is either empty or has some vertex of degree 1. + # Removing the edge attached to that vertex doesn't change this property, + # so doing that repeatedly will reduce the size of the graph. + # If the graph is empty at the end of the process, it was acyclic. + # We track the order of edge removal so that the next phase can process + # them in reverse order of removal. + my @output_order = (); + + # Consider each vertex as a possible starting point for edge-removal. + for (my $startv = 0; $startv < $nverts; $startv++) + { + my $v = $startv; + + # If vertex v is of degree 1 (i.e. exactly 1 edge connects to it), + # remove that edge, and then consider the edge's other vertex to see + # if it is now of degree 1. The inner loop repeats until reaching a + # vertex not of degree 1. + while (scalar(keys(%{ $V[$v]{edges} })) == 1) + { + # Unlink its only edge. + my $e = (keys(%{ $V[$v]{edges} }))[0]; + delete($V[$v]{edges}->{$e}); + + # Unlink the edge from its other vertex, too. + my $v2 = $E[$e]{left}; + $v2 = $E[$e]{right} if ($v2 == $v); + delete($V[$v2]{edges}->{$e}); + + # Push e onto the front of the output-order list. + unshift @output_order, $e; + + # Consider v2 on next iteration of inner loop. + $v = $v2; + } + } + + # We succeeded only if all edges were removed from the graph. + return () if (scalar(@output_order) != $nedges); + + # OK, build the hash table of size $nverts. + my @hashtab = (0) x $nverts; + # We need a "visited" flag array in this step, too. + my @visited = (0) x $nverts; + + # The goal is that for any key, the sum of the hash table entries for + # its first and second hash values is the desired output (i.e., the key + # number). By assigning hash table values in the selected edge order, + # we can guarantee that that's true. This works because the edge first + # removed from the graph (and hence last to be visited here) must have + # at least one vertex it shared with no other edge; hence it will have at + # least one vertex (hashtable entry) still unvisited when we reach it here, + # and we can assign that unvisited entry a value that makes the sum come + # out as we wish. By induction, the same holds for all the other edges. + foreach my $e (@output_order) + { + my $l = $E[$e]{left}; + my $r = $E[$e]{right}; + if (!$visited[$l]) + { + # $hashtab[$r] might be zero, or some previously assigned value. + $hashtab[$l] = $e - $hashtab[$r]; + } + else + { + die "oops, doubly used hashtab entry" if $visited[$r]; + # $hashtab[$l] might be zero, or some previously assigned value. + $hashtab[$r] = $e - $hashtab[$l]; + } + # Now freeze both of these hashtab entries. + $visited[$l] = 1; + $visited[$r] = 1; + } + + # Detect range of values needed in hash table. + my $hmin = $nedges; + my $hmax = 0; + for (my $v = 0; $v < $nverts; $v++) + { + $hmin = $hashtab[$v] if $hashtab[$v] < $hmin; + $hmax = $hashtab[$v] if $hashtab[$v] > $hmax; + } + + # Choose width of hashtable entries. In addition to the actual values, + # we need to be able to store a flag for unused entries, and we wish to + # have the property that adding any other entry value to the flag gives + # an out-of-range result (>= $nedges). + my $elemtype; + my $unused_flag; + + if ( $hmin >= -0x7F + && $hmax <= 0x7F + && $hmin + 0x7F >= $nedges) + { + # int8 will work + $elemtype = 'int8'; + $unused_flag = 0x7F; + } + elsif ($hmin >= -0x7FFF + && $hmax <= 0x7FFF + && $hmin + 0x7FFF >= $nedges) + { + # int16 will work + $elemtype = 'int16'; + $unused_flag = 0x7FFF; + } + elsif ($hmin >= -0x7FFFFFFF + && $hmax <= 0x7FFFFFFF + && $hmin + 0x3FFFFFFF >= $nedges) + { + # int32 will work + $elemtype = 'int32'; + $unused_flag = 0x3FFFFFFF; + } + else + { + die "hash table values too wide"; + } + + # Set any unvisited hashtable entries to $unused_flag. + for (my $v = 0; $v < $nverts; $v++) + { + $hashtab[$v] = $unused_flag if !$visited[$v]; + } + + return ($elemtype, \@hashtab); + } + + 1; diff --git a/src/tools/gen_keywordlist.pl b/src/tools/gen_keywordlist.pl index d764aff..2744e1d 100644 *** a/src/tools/gen_keywordlist.pl --- b/src/tools/gen_keywordlist.pl *************** *** 14,19 **** --- 14,25 ---- # variable named according to the -v switch ("ScanKeywords" by default). # The variable is marked "static" unless the -e switch is given. # + # ScanKeywordList uses hash-based lookup, so this script also selects + # a minimal perfect hash function for the keyword set, and emits a + # static hash function that is referenced in the ScanKeywordList struct. + # The hash function is case-insensitive unless --case is specified. + # Note that case insensitivity assumes all-ASCII keywords! + # # # Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group # Portions Copyright (c) 1994, Regents of the University of California *************** *** 25,39 **** use strict; use warnings; use Getopt::Long; my $output_path = ''; my $extern = 0; my $varname = 'ScanKeywords'; GetOptions( ! 'output:s' => \$output_path, ! 'extern' => \$extern, ! 'varname:s' => \$varname) || usage(); my $kw_input_file = shift @ARGV || die "No input file.\n"; --- 31,48 ---- use strict; use warnings; use Getopt::Long; + use PerfectHash; my $output_path = ''; my $extern = 0; + my $case_sensitive = 0; my $varname = 'ScanKeywords'; GetOptions( ! 'output:s' => \$output_path, ! 'extern' => \$extern, ! 'case-sensitive' => \$case_sensitive, ! 'varname:s' => \$varname) || usage(); my $kw_input_file = shift @ARGV || die "No input file.\n"; *************** while (<$kif>) *** 87,93 **** --- 96,117 ---- } } + # When being case-insensitive, insist that the input be all-lower-case. + if (!$case_sensitive) + { + foreach my $kw (@keywords) + { + die qq|The keyword "$kw" is not lower-case in $kw_input_file\n| + if ($kw ne lc $kw); + } + } + # Error out if the keyword names are not in ASCII order. + # + # While this isn't really necessary with hash-based lookup, it's still + # helpful because it provides a cheap way to reject duplicate keywords. + # Also, insisting on sorted order ensures that code that scans the keyword + # table linearly will see the keywords in a canonical order. for my $i (0..$#keywords - 1) { die qq|The keyword "$keywords[$i + 1]" is out of order in $kw_input_file\n| *************** print $kwdef "};\n\n"; *** 128,142 **** printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords; # Emit the struct that wraps all this lookup info into one variable. ! print $kwdef "static " if !$extern; printf $kwdef "const ScanKeywordList %s = {\n", $varname; printf $kwdef qq|\t%s_kw_string,\n|, $varname; printf $kwdef qq|\t%s_kw_offsets,\n|, $varname; printf $kwdef qq|\t%s_NUM_KEYWORDS,\n|, uc $varname; printf $kwdef qq|\t%d\n|, $max_len; ! print $kwdef "};\n\n"; printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename; --- 152,176 ---- printf $kwdef "#define %s_NUM_KEYWORDS %d\n\n", uc $varname, scalar @keywords; + # Emit the definition of the hash function. + + my $funcname = $varname . "_hash_func"; + + my $f = PerfectHash::generate_hash_function(\@keywords, $funcname, + case_insensitive => !$case_sensitive); + + printf $kwdef qq|static %s\n|, $f; + # Emit the struct that wraps all this lookup info into one variable. ! printf $kwdef "static " if !$extern; printf $kwdef "const ScanKeywordList %s = {\n", $varname; printf $kwdef qq|\t%s_kw_string,\n|, $varname; printf $kwdef qq|\t%s_kw_offsets,\n|, $varname; + printf $kwdef qq|\t%s,\n|, $funcname; printf $kwdef qq|\t%s_NUM_KEYWORDS,\n|, uc $varname; printf $kwdef qq|\t%d\n|, $max_len; ! printf $kwdef "};\n\n"; printf $kwdef "#endif\t\t\t\t\t\t\t/* %s_H */\n", uc $base_filename; *************** Usage: gen_keywordlist.pl [--output/-o < *** 148,153 **** --- 182,188 ---- --output Output directory (default '.') --varname Name for ScanKeywordList variable (default 'ScanKeywords') --extern Allow the ScanKeywordList variable to be globally visible + --case Keyword matching is to be case-sensitive gen_keywordlist.pl transforms a list of keywords into a ScanKeywordList. The output filename is derived from the input file by inserting _d, diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm index 937bf18..8f54e45 100644 *** a/src/tools/msvc/Solution.pm --- b/src/tools/msvc/Solution.pm *************** sub GenerateFiles *** 414,420 **** 'src/include/parser/kwlist.h')) { print "Generating kwlist_d.h...\n"; ! system('perl src/tools/gen_keywordlist.pl --extern -o src/common src/include/parser/kwlist.h'); } if (IsNewer( --- 414,420 ---- 'src/include/parser/kwlist.h')) { print "Generating kwlist_d.h...\n"; ! system('perl -I src/tools src/tools/gen_keywordlist.pl --extern -o src/common src/include/parser/kwlist.h'); } if (IsNewer( *************** sub GenerateFiles *** 426,433 **** { print "Generating pl_reserved_kwlist_d.h and pl_unreserved_kwlist_d.h...\n"; chdir('src/pl/plpgsql/src'); ! system('perl ../../../tools/gen_keywordlist.pl --varname ReservedPLKeywords pl_reserved_kwlist.h'); ! system('perl ../../../tools/gen_keywordlist.pl --varname UnreservedPLKeywords pl_unreserved_kwlist.h'); chdir('../../../..'); } --- 426,433 ---- { print "Generating pl_reserved_kwlist_d.h and pl_unreserved_kwlist_d.h...\n"; chdir('src/pl/plpgsql/src'); ! system('perl -I ../../../tools ../../../tools/gen_keywordlist.pl --varname ReservedPLKeywords pl_reserved_kwlist.h'); ! system('perl -I ../../../tools ../../../tools/gen_keywordlist.pl --varname UnreservedPLKeywords pl_unreserved_kwlist.h'); chdir('../../../..'); } *************** sub GenerateFiles *** 440,447 **** { print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n"; chdir('src/interfaces/ecpg/preproc'); ! system('perl ../../../tools/gen_keywordlist.pl --varname ScanCKeywords c_kwlist.h'); ! system('perl ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h'); chdir('../../../..'); } --- 440,447 ---- { print "Generating c_kwlist_d.h and ecpg_kwlist_d.h...\n"; chdir('src/interfaces/ecpg/preproc'); ! system('perl -I ../../../tools ../../../tools/gen_keywordlist.pl --varname ScanCKeywords --case c_kwlist.h'); ! system('perl -I ../../../tools ../../../tools/gen_keywordlist.pl --varname ScanECPGKeywords ecpg_kwlist.h'); chdir('../../../..'); }
I wrote: > Also, I fail to understand why fmgr_builtin_oid_index has 10000 entries > anyway. We could easily have fmgrtab.c expose the last actually assigned > builtin function OID (presently 6121) and make the index array only > that big, which just about eliminates the space advantage completely. Concretely, like the attached. We could make the index table still smaller if we wanted to reassign a couple dozen high-numbered functions down to lower OIDs, but I dunno if it's worth the trouble. It certainly isn't from a performance standpoint, because those unused entry ranges will never be touched in normal usage; but it'd make the server executable a couple KB smaller. regards, tom lane diff --git a/src/backend/utils/Gen_fmgrtab.pl b/src/backend/utils/Gen_fmgrtab.pl index cafe408..f970940 100644 *** a/src/backend/utils/Gen_fmgrtab.pl --- b/src/backend/utils/Gen_fmgrtab.pl *************** foreach my $datfile (@input_files) *** 80,90 **** $catalog_data{$catname} = Catalog::ParseData($datfile, $schema, 0); } - # Fetch some values for later. - my $FirstGenbkiObjectId = - Catalog::FindDefinedSymbol('access/transam.h', $include_path, - 'FirstGenbkiObjectId'); - # Collect certain fields from pg_proc.dat. my @fmgr = (); --- 80,85 ---- *************** my %bmap; *** 225,230 **** --- 220,226 ---- $bmap{'t'} = 'true'; $bmap{'f'} = 'false'; my @fmgr_builtin_oid_index; + my $last_builtin_oid = 0; my $fmgr_count = 0; foreach my $s (sort { $a->{oid} <=> $b->{oid} } @fmgr) { *************** foreach my $s (sort { $a->{oid} <=> $b-> *** 232,237 **** --- 228,234 ---- " { $s->{oid}, $s->{nargs}, $bmap{$s->{strict}}, $bmap{$s->{retset}}, \"$s->{prosrc}\", $s->{prosrc} }"; $fmgr_builtin_oid_index[ $s->{oid} ] = $fmgr_count++; + $last_builtin_oid = $s->{oid}; if ($fmgr_count <= $#fmgr) { *************** foreach my $s (sort { $a->{oid} <=> $b-> *** 244,274 **** } print $tfh "};\n"; ! print $tfh qq| const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin)); ! |; # Create fmgr_builtins_oid_index table. ! # ! # Note that the array has to be filled up to FirstGenbkiObjectId, ! # as we can't rely on zero initialization as 0 is a valid mapping. ! print $tfh qq| ! const uint16 fmgr_builtin_oid_index[FirstGenbkiObjectId] = { ! |; ! for (my $i = 0; $i < $FirstGenbkiObjectId; $i++) { my $oid = $fmgr_builtin_oid_index[$i]; ! # fmgr_builtin_oid_index is sparse, map nonexistant functions to # InvalidOidBuiltinMapping if (not defined $oid) { $oid = 'InvalidOidBuiltinMapping'; } ! if ($i + 1 == $FirstGenbkiObjectId) { print $tfh " $oid\n"; } --- 241,270 ---- } print $tfh "};\n"; ! printf $tfh qq| const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin)); ! ! const Oid fmgr_last_builtin_oid = %u; ! |, $last_builtin_oid; # Create fmgr_builtins_oid_index table. ! printf $tfh qq| ! const uint16 fmgr_builtin_oid_index[%u] = { ! |, $last_builtin_oid + 1; ! for (my $i = 0; $i <= $last_builtin_oid; $i++) { my $oid = $fmgr_builtin_oid_index[$i]; ! # fmgr_builtin_oid_index is sparse, map nonexistent functions to # InvalidOidBuiltinMapping if (not defined $oid) { $oid = 'InvalidOidBuiltinMapping'; } ! if ($i == $last_builtin_oid) { print $tfh " $oid\n"; } diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c index b41649f..506eeef 100644 *** a/src/backend/utils/fmgr/fmgr.c --- b/src/backend/utils/fmgr/fmgr.c *************** fmgr_isbuiltin(Oid id) *** 75,86 **** uint16 index; /* fast lookup only possible if original oid still assigned */ ! if (id >= FirstGenbkiObjectId) return NULL; /* * Lookup function data. If there's a miss in that range it's likely a ! * nonexistant function, returning NULL here will trigger an ERROR later. */ index = fmgr_builtin_oid_index[id]; if (index == InvalidOidBuiltinMapping) --- 75,86 ---- uint16 index; /* fast lookup only possible if original oid still assigned */ ! if (id > fmgr_last_builtin_oid) return NULL; /* * Lookup function data. If there's a miss in that range it's likely a ! * nonexistent function, returning NULL here will trigger an ERROR later. */ index = fmgr_builtin_oid_index[id]; if (index == InvalidOidBuiltinMapping) diff --git a/src/include/utils/fmgrtab.h b/src/include/utils/fmgrtab.h index a778f88..e981f34 100644 *** a/src/include/utils/fmgrtab.h --- b/src/include/utils/fmgrtab.h *************** extern const FmgrBuiltin fmgr_builtins[] *** 36,46 **** extern const int fmgr_nbuiltins; /* number of entries in table */ /* ! * Mapping from a builtin function's oid to the index in the fmgr_builtins ! * array. */ #define InvalidOidBuiltinMapping PG_UINT16_MAX ! extern const uint16 fmgr_builtin_oid_index[FirstGenbkiObjectId]; #endif /* FMGRTAB_H */ --- 36,48 ---- extern const int fmgr_nbuiltins; /* number of entries in table */ + extern const Oid fmgr_last_builtin_oid; /* highest function OID in table */ + /* ! * Mapping from a builtin function's OID to its index in the fmgr_builtins ! * array. This is indexed from 0 through fmgr_last_builtin_oid. */ #define InvalidOidBuiltinMapping PG_UINT16_MAX ! extern const uint16 fmgr_builtin_oid_index[]; #endif /* FMGRTAB_H */
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Alvaro Herrera
Date:
On 2019-Jan-09, Tom Lane wrote: > We could make the index table still smaller if we wanted to reassign > a couple dozen high-numbered functions down to lower OIDs, but I dunno > if it's worth the trouble. It certainly isn't from a performance > standpoint, because those unused entry ranges will never be touched > in normal usage; but it'd make the server executable a couple KB smaller. Or two couples KB smaller, if we abandoned the idea that pg_proc OIDs must not collide with those in any other catalog, and we renumbered all functions to start at OID 1 or so. duplicate_oids would complain about that, though, I suppose ... and nobody who has ever hardcoded a function OID would love this idea much. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andres Freund
Date:
Hi, On 2019-01-09 14:44:24 -0500, Tom Lane wrote: > I wrote: > > Also, I fail to understand why fmgr_builtin_oid_index has 10000 entries > > anyway. We could easily have fmgrtab.c expose the last actually assigned > > builtin function OID (presently 6121) and make the index array only > > that big, which just about eliminates the space advantage completely. > > Concretely, like the attached. Seems like a good improvement. > We could make the index table still smaller if we wanted to reassign > a couple dozen high-numbered functions down to lower OIDs, but I dunno > if it's worth the trouble. It certainly isn't from a performance > standpoint, because those unused entry ranges will never be touched > in normal usage; but it'd make the server executable a couple KB smaller. Probably indeed not worth it. I'm not 100% convinced on the performance POV, but in contrast to the earlier binary search either approach is fast enough that it probably hard to measure any difference. > diff --git a/src/backend/utils/fmgr/fmgindex b41649f..506eeef 100644 > --- a/src/backend/utils/fmgr/fmgr.c > +++ b/src/backend/utils/fmgr/fmgr.c > @@ -75,12 +75,12 @@ fmgr_isbuiltin(Oid id) > uint16 index; > > /* fast lookup only possible if original oid still assigned */ > - if (id >= FirstGenbkiObjectId) > + if (id > fmgr_last_builtin_oid) > return NULL; An extern reference here will make the code a bit less efficient, but it's probably not worth generating a header with a define for it instead... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-01-09 14:44:24 -0500, Tom Lane wrote: >> /* fast lookup only possible if original oid still assigned */ >> - if (id >= FirstGenbkiObjectId) >> + if (id > fmgr_last_builtin_oid) >> return NULL; > An extern reference here will make the code a bit less efficient, but > it's probably not worth generating a header with a define for it > instead... Yeah, also that would be significantly more fragile, in that it'd be hard to be sure where that OID had propagated to when rebuilding. We haven't chosen to make fmgr_nbuiltins a #define either, and I think it's best to treat this the same way. regards, tom lane
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2019-Jan-09, Tom Lane wrote: >> We could make the index table still smaller if we wanted to reassign >> a couple dozen high-numbered functions down to lower OIDs, but I dunno >> if it's worth the trouble. It certainly isn't from a performance >> standpoint, because those unused entry ranges will never be touched >> in normal usage; but it'd make the server executable a couple KB smaller. > Or two couples KB smaller, if we abandoned the idea that pg_proc OIDs > must not collide with those in any other catalog, and we renumbered all > functions to start at OID 1 or so. duplicate_oids would complain about > that, though, I suppose ... and nobody who has ever hardcoded a function > OID would love this idea much. I think that'd be a nonstarter for commonly-used functions. I'm guessing that pg_replication_origin_create() and so on, which are the immediate problem, haven't been around long enough or get used often enough for someone to have hard-coded their OIDs. But I could be wrong. (Speaking of which, I've been wondering for awhile if libpq ought not obtain the OIDs of lo_create and friends by #including fmgroids.h instead of doing a runtime query on every connection. If we did that, we'd be forever giving up the option to renumber them ... but do you really want to bet that nobody else has done this already in some other client code?) regards, tom lane
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Andres Freund
Date:
Hi, On 2019-01-09 15:03:35 -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > On 2019-Jan-09, Tom Lane wrote: > >> We could make the index table still smaller if we wanted to reassign > >> a couple dozen high-numbered functions down to lower OIDs, but I dunno > >> if it's worth the trouble. It certainly isn't from a performance > >> standpoint, because those unused entry ranges will never be touched > >> in normal usage; but it'd make the server executable a couple KB smaller. > > > Or two couples KB smaller, if we abandoned the idea that pg_proc OIDs > > must not collide with those in any other catalog, and we renumbered all > > functions to start at OID 1 or so. duplicate_oids would complain about > > that, though, I suppose ... and nobody who has ever hardcoded a function > > OID would love this idea much. > > I think that'd be a nonstarter for commonly-used functions. I'm guessing > that pg_replication_origin_create() and so on, which are the immediate > problem, haven't been around long enough or get used often enough for > someone to have hard-coded their OIDs. But I could be wrong. I don't think it's likely that it'd be useful to hardcode them, and therefore hope that nobody would do so. I personally feel limited sympathy to people hardcoding oids across major versions. The benefits of making pg easier to maintain and more efficient seem higher than allowing for that. > (Speaking of which, I've been wondering for awhile if libpq ought not > obtain the OIDs of lo_create and friends by #including fmgroids.h > instead of doing a runtime query on every connection. If we did that, > we'd be forever giving up the option to renumber them ... but do you > really want to bet that nobody else has done this already in some > other client code?) I'm not enthusiastic about that. I kinda hope we're going to evolve that interface further, which'd make it version dependent anyway (we don't require all of them right now...). And it's not that expensive to query their oids once. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-01-09 15:03:35 -0500, Tom Lane wrote: >> (Speaking of which, I've been wondering for awhile if libpq ought not >> obtain the OIDs of lo_create and friends by #including fmgroids.h >> instead of doing a runtime query on every connection. If we did that, >> we'd be forever giving up the option to renumber them ... but do you >> really want to bet that nobody else has done this already in some >> other client code?) > I'm not enthusiastic about that. I kinda hope we're going to evolve that > interface further, which'd make it version dependent anyway (we don't > require all of them right now...). And it's not that expensive to query > their oids once. Version dependency doesn't seem like much of an argument: we'd just teach libpq to pay attention to the server version, which it knows anyway (and uses for other purposes already). But this is a bit off-topic for this thread, perhaps. regards, tom lane
Re: reducing the footprint of ScanKeyword (was Re: Large writablevariables)
From
Joerg Sonnenberger
Date:
On Wed, Jan 09, 2019 at 02:04:15PM -0500, Tom Lane wrote: > Also, in view of finding that the original multiplier choices failed > on the fmgr oid problem, I spent a little effort making the code > able to try more combinations of hash multipliers and seeds. It'd > be nice to have some theory rather than just heuristics about what > will work, though ... The theory is that the code needs two families of pair-wise independent hash functions to give the O(1) number of tries. For practical purposes, the Jenkins hash easily qualifies or any cryptographic hash function. The downside is that those hash functions are very expensive though. A multiplicative hash, especially using the high bits of the results (e.g. the top 32bit of a 32x32->64 multiplication) qualifies for the requirements, but for strings of input it would need a pair of constant per word. So the choice of a hash function family for a performance sensitive part is a heuristic in the sense of trying to get away with as simple a function as possible. Joerg
On Wed, Jan 9, 2019 at 2:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: > > John Naylor <john.naylor@2ndquadrant.com> writes: > >> -There is a bit of a cognitive clash between $case_sensitive in > >> gen_keywordlist.pl and $case_insensitive in PerfectHash.pm. They each > >> make sense in their own file, but might it be worth using one or the > >> other? > Working on the fmgr-oid-lookup idea gave me the thought that > PerfectHash.pm ought to support fixed-length keys. Rather than start > adding random parameters to the function, I borrowed an idea from > PostgresNode.pm and made the options be keyword-style parameters. Now > the impedance mismatch about case sensitivity is handled with > > my $f = PerfectHash::generate_hash_function(\@keywords, $funcname, > case_insensitive => !$case_sensitive); > > which is at least a little clearer than before, though I'm not sure > if it entirely solves the problem. It's a bit clearer, but thinking about this some more, it makes sense for gen_keywordlist.pl to use $case_insensitive, because right now every instance of the var is "!$case_sensitive". In the attached (on top of v4), I change the command line option to --citext, and add the ability to negate it within the option, as '--no-citext'. It's kind of a double negative for the C-keywords invocation, but we can have the option for both cases, so we don't need to worry about what the default is (which is case_insensitive=1). -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Tue, Jan 8, 2019 at 5:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > John Naylor <john.naylor@2ndquadrant.com> writes: > > In the committed keyword patch, I noticed that in common/keywords.c, > > the array length is defined with > > ScanKeywordCategories[SCANKEYWORDS_NUM_KEYWORDS] > > but other keyword arrays just have ...[]. Is there a reason for the difference? > > The length macro was readily available there so I used it. AFAIR > that wasn't true elsewhere, though I might've missed something. > It's pretty much just belt-and-suspenders coding anyway, since all > those arrays are machine generated ... I tried using the available num_keywords macro in plpgsql and it worked fine, but it makes the lines really long. Alternatively, as in the attached, we could remove the single use of the core macro and maybe add comments to the generated magic numbers. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Wed, Jan 9, 2019 at 2:44 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > [patch to shrink oid index] It would help maintaining its newfound sveltness if we warned if a higher oid was assigned, as in the attached. I used 6200 as a soft limit, but that could be anything similiar. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
John Naylor <john.naylor@2ndquadrant.com> writes: > On Wed, Jan 9, 2019 at 2:44 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> [patch to shrink oid index] > It would help maintaining its newfound sveltness if we warned if a > higher oid was assigned, as in the attached. I used 6200 as a soft > limit, but that could be anything similiar. I think the reason we have this issue is that people tend to use high OIDs during development of a patch, so that their elbows won't be joggled by unrelated changes. Then sometimes they forget to renumber them down before committing. A warning like this would lead to lots of noise during the development stage, which nobody would thank us for. If we could find a way to notice this only when we were about to commit, it'd be good .. but I don't have an idea about a nice way to do that. (No, I don't want a commit hook on gitmaster; that's warning too late, which is not better.) regards, tom lane
John Naylor <john.naylor@2ndquadrant.com> writes: > On Wed, Jan 9, 2019 at 2:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Now the impedance mismatch about case sensitivity is handled with >> my $f = PerfectHash::generate_hash_function(\@keywords, $funcname, >> case_insensitive => !$case_sensitive); >> which is at least a little clearer than before, though I'm not sure >> if it entirely solves the problem. > It's a bit clearer, but thinking about this some more, it makes sense > for gen_keywordlist.pl to use $case_insensitive, because right now > every instance of the var is "!$case_sensitive". In the attached (on > top of v4), I change the command line option to --citext, and add the > ability to negate it within the option, as '--no-citext'. It's kind of > a double negative for the C-keywords invocation, but we can have the > option for both cases, so we don't need to worry about what the > default is (which is case_insensitive=1). Ah, I didn't realize that Getopt allows having a boolean option defaulting to "on". That makes it more practical to do something here. I'm not in love with "citext" as the option name, though ... that has little to recommend it except brevity, which is not a property we really need here. We could go with "[no-]case-insensitive", perhaps. Or "[no-]case-fold", which is at least a little shorter and less double-negative-y. regards, tom lane
John Naylor <john.naylor@2ndquadrant.com> writes: > On Tue, Jan 8, 2019 at 5:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The length macro was readily available there so I used it. AFAIR >> that wasn't true elsewhere, though I might've missed something. >> It's pretty much just belt-and-suspenders coding anyway, since all >> those arrays are machine generated ... > I tried using the available num_keywords macro in plpgsql and it > worked fine, but it makes the lines really long. Alternatively, as in > the attached, we could remove the single use of the core macro and > maybe add comments to the generated magic numbers. Meh, I'm not excited about removing the option just because there's only one use of it now. There might be more-compelling uses later. regards, tom lane
On Wed, Jan 9, 2019 at 5:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > really need here. We could go with "[no-]case-insensitive", perhaps. > Or "[no-]case-fold", which is at least a little shorter and less > double-negative-y. I'd be in favor of --[no-]case-fold. On Tue, Jan 8, 2019 at 5:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I improved the comment about how come the hash table entry assignment > works. I've gone over the algorithm in more detail and I don't see any nicer way to write it. This comment in PerfectHash.pm: (It does change '_', else we could just skip adjusting # $cn here at all, for typical keyword strings.) ...seems a bit out of place in the module, because of its reference to keywords, of interest right now to its only caller. Maybe a bit of context here. (I also don't quite understand why we could hypothetically skip the adjustment.) Lastly, the keyword headers still have a dire warning about ASCII order and binary search. Those could be softened to match the comment in gen_keywordlist.pl. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
John Naylor <john.naylor@2ndquadrant.com> writes: > On Wed, Jan 9, 2019 at 5:33 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> really need here. We could go with "[no-]case-insensitive", perhaps. >> Or "[no-]case-fold", which is at least a little shorter and less >> double-negative-y. > I'd be in favor of --[no-]case-fold. Yeah, I like that better too; I've been having to stop and think every time as to which direction is which with the [in]sensitive terminology. I'll make it "case-fold" throughout. > This comment in PerfectHash.pm: > (It does change '_', else we could just skip adjusting > # $cn here at all, for typical keyword strings.) > ...seems a bit out of place in the module, because of its reference to > keywords, of interest right now to its only caller. Maybe a bit of > context here. (I also don't quite understand why we could > hypothetically skip the adjustment.) Were it not for the underscore case, we could plausibly assume that the supplied keywords are already all-lower-case and don't need any further folding. But I agree that this comment is probably more confusing than helpful; it's easier just to see that the code is applying the same transform as the runtime lookup will do. > Lastly, the keyword headers still have a dire warning about ASCII > order and binary search. Those could be softened to match the comment > in gen_keywordlist.pl. Agreed, will do. Thanks for reviewing! regards, tom lane
Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)
From
Joel Jacobson
Date:
Many thanks for working on this, amazing work, really nice you made it a separate reusable Perl-module.
The generated hash functions reads one character at a time.
I've seen a performance trick in other hash functions [1]
to instead read multiple bytes in each iteration,
and then handle the remaining bytes after the loop.
I've done some testing and it looks like a ~30% speed-up of the generated
ScanKeywords_hash_func() function would be possible.
If you think this approach is promising, I would be happy to prepare a patch for it,
but I wanted to check with the project this idea has not already been considered and ruled out
for some technical reasons I've failed to see, is there any?
For this to work you would need to use larger constants for $hash_mult1 and $hash_mult2 though.
I've successfully used these values:
$hash_mult1 0x2c1b3c6d
$hash_mult2 (0x297a2d39, 0x85ebca6b, 0xc2b2ae35, 0x7feb352d, 0x846ca68b)
Here is the idea:
Generated C-code:
for (; keylen >= 4; keylen -= 4, k += 4)
{
uint32_t v;
memcpy(&v, k, 4);
v |= 0x20202020;
a = a * 739982445 + v;
b = b * 2246822507 + v;
}
uint32_t v = 0;
switch (keylen)
{
case 3:
memcpy(&v, k, 3);
v |= 0x202020;
break;
case 2:
memcpy(&v, k, 2);
v |= 0x2020;
break;
case 1:
memcpy(&v, k, 1);
v |= 0x20;
break;
}
a = a * 739982445 + v;
b = b * 2246822507 + v;
return h[a % 883] + h[b % 883];
(Reding 8 bytes a time instead would perhaps be a win since some keywords are quite long.)
Perl-code:
sub _calc_hash
{
my ($key, $mult, $seed) = @_;
my $result = $seed;
my $i=0;
my $keylen = length($key);
for (; $keylen>=4; $keylen-=4, $i+=4) {
my $cn = (ord(substr($key,$i+0,1)) << 0)
| (ord(substr($key,$i+1,1)) << 8)
| (ord(substr($key,$i+2,1)) << 16)
| (ord(substr($key,$i+3,1)) << 24);
$cn |= 0x20202020 if $case_fold;
$result = ($result * $mult + $cn) % 4294967296;
}
my $cn = 0;
if ($keylen == 3) {
$cn = (ord(substr($key,$i+0,1)) << 0)
| (ord(substr($key,$i+1,1)) << 8)
| (ord(substr($key,$i+2,1)) << 16);
$cn |= 0x202020 if $case_fold;
} elsif ($keylen == 2) {
$cn = (ord(substr($key,$i+0,1)) << 0)
| (ord(substr($key,$i+1,1)) << 8);
$cn |= 0x2020 if $case_fold;
} elsif ($keylen == 1) {
$cn = (ord(substr($key,$i+0,1)) << 0);
$cn |= 0x20 if $case_fold;
}
$result = ($result * $mult + $cn) % 4294967296;
return $result;
}
Joel Jacobson <joel@trustly.com> writes: > I've seen a performance trick in other hash functions [1] > to instead read multiple bytes in each iteration, > and then handle the remaining bytes after the loop. > [1] https://github.com/wangyi-fudan/wyhash/blob/master/wyhash.h#L29 I can't get very excited about this, seeing that we're only going to be hashing short strings. I don't really believe your 30% number for short strings; and even if I did, there's no evidence that the hash functions are worth any further optimization in terms of our overall performance. Also, as best I can tell, the approach you propose would result in an endianness dependence, meaning we'd have to have separate lookup tables for BE and LE machines. That's not a dealbreaker perhaps, but it is certainly another point on the "it's not worth it" side of the argument. regards, tom lane
Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)
From
Joel Jacobson
Date:
On Wed, Mar 20, 2019 at 9:24 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Joel Jacobson <joel@trustly.com> writes:
> I've seen a performance trick in other hash functions [1]
> to instead read multiple bytes in each iteration,
> and then handle the remaining bytes after the loop.
> [1] https://github.com/wangyi-fudan/wyhash/blob/master/wyhash.h#L29
I can't get very excited about this, seeing that we're only going to
be hashing short strings. I don't really believe your 30% number
for short strings; and even if I did, there's no evidence that the
hash functions are worth any further optimization in terms of our
overall performance.
I went ahead and tested this approach anyway, since I need this algorithm in a completely different project.
The benchmark below shows stats for three different keywords per length, compiled with -O2:
$ c++ -O2 -std=c++14 -o bench_perfect_hash bench_perfect_hash.cc
$ ./bench_perfect_hash
keyword length char-a-time (ns) word-a-time (ns) diff (%)
as 2 3.30 2.62 -0.21
at 2 3.54 2.66 -0.25
by 2 3.30 2.59 -0.22
add 3 4.01 3.15 -0.21
all 3 4.04 3.11 -0.23
and 3 3.84 3.11 -0.19
also 4 4.50 3.17 -0.30
both 4 4.49 3.06 -0.32
call 4 4.95 3.42 -0.31
abort 5 6.09 4.02 -0.34
admin 5 5.26 3.65 -0.31
after 5 5.18 3.76 -0.27
access 6 5.97 3.91 -0.34
action 6 5.86 3.89 -0.34
always 6 6.10 3.77 -0.38
analyse 7 6.67 4.64 -0.30
analyze 7 7.09 4.87 -0.31
between 7 7.02 4.66 -0.34
absolute 8 7.49 3.82 -0.49
backward 8 7.13 3.88 -0.46
cascaded 8 7.23 4.17 -0.42
aggregate 9 8.04 4.49 -0.44
assertion 9 7.98 4.52 -0.43
attribute 9 8.03 4.44 -0.45
assignment 10 8.58 4.67 -0.46
asymmetric 10 9.07 4.57 -0.50
checkpoint 10 9.15 4.53 -0.51
constraints 11 9.58 5.14 -0.46
insensitive 11 9.62 5.30 -0.45
publication 11 10.30 5.60 -0.46
concurrently 12 10.36 4.81 -0.54
current_date 12 11.17 5.48 -0.51
current_role 12 11.15 5.10 -0.54
authorization 13 11.87 5.50 -0.54
configuration 13 11.50 5.51 -0.52
xmlattributes 13 11.72 5.66 -0.52
current_schema 14 12.17 5.58 -0.54
localtimestamp 14 11.78 5.46 -0.54
characteristics 15 12.77 5.97 -0.53
current_catalog 15 12.65 5.87 -0.54
current_timestamp 17 14.19 6.12 -0.57
Also, as best I can tell, the approach you propose would result
in an endianness dependence, meaning we'd have to have separate
lookup tables for BE and LE machines. That's not a dealbreaker
perhaps, but it is certainly another point on the "it's not worth it"
side of the argument.
I can see how the same problem has been worked-around in e.g. pg_crc32.h:
#ifdef WORDS_BIGENDIAN
#define FIN_CRC32C(crc) ((crc) = pg_bswap32(crc) ^ 0xFFFFFFFF)
#else
#define FIN_CRC32C(crc) ((crc) ^= 0xFFFFFFFF)
#endif
So I used the same trick in PerfectHash.pm:
$f .= sprintf "#ifdef WORDS_BIGENDIAN\n";
$f .= sprintf "\t\tc4 = pg_bswap32(c4);\n";
$f .= sprintf "#endif\n";
I've also tried to measure the overall effect by hacking postgres.c:
+ struct timespec start, stop;
+ clock_gettime( CLOCK_REALTIME, &start);
+ for (int i=0; i<100000; i++)
+ {
+ List *parsetree_list2;
+ MemoryContext oldcontext2;
+
+ oldcontext2 = MemoryContextSwitchTo(MessageContext);
+ parsetree_list2 = pg_parse_query(query_string);
+ MemoryContextSwitchTo(oldcontext2);
+// MemoryContextReset(MessageContext);
+ CHECK_FOR_INTERRUPTS();
+ }
+ clock_gettime( CLOCK_REALTIME, &stop);
+ printf("Bench: %f\n", ( stop.tv_sec - start.tv_sec ) + (double)( stop.tv_nsec - start.tv_nsec ) / 1000000000L );
I measured the time for a big query found here: https://wiki.postgresql.org/wiki/Index_Maintenance
I might be doing something wrong, but it looks like thee overall effect is a ~3% improvement.