Thread: unused_oids script is broken with bsd sed
Hi. BSD sed in macOS doesn't understand word boundary operator "\b". So after 372728b0d49 unused_oids doesn’t match all oids in new *.dat files marking all of them as unused. It is possible to write more portable regexps, e.g. change "\b" to something like "^.*{*.*", but it seems easier for feature use to just rewrite unused_oids in perl to match duplicate_oids. Also add in-place complain about duplicates instead of running uniq through oids array. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 04/25/2018 06:22 AM, Stas Kelvich wrote: > Hi. > > BSD sed in macOS doesn't understand word boundary operator "\b". So after > 372728b0d49 unused_oids doesn’t match all oids in new *.dat files marking > all of them as unused. > > It is possible to write more portable regexps, e.g. change "\b" to > something like "^.*{*.*", but it seems easier for feature use to just > rewrite unused_oids in perl to match duplicate_oids. Also add in-place > complain about duplicates instead of running uniq through oids array. > +many for rewriting in perl. Do you want to have a go at that? If not I will. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > +many for rewriting in perl. Do you want to have a go at that? If not I > will. +1 ... I was mildly astonished that this didn't already have to happen as part of the bootstrap data conversion effort. It's certainly not hard to imagine future extensions to the .dat file format that would break this script, and duplicate_oids too. I think we should rewrite both of them to use the Catalog.pm infrastructure. regards, tom lane
> On 25 Apr 2018, at 17:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> +many for rewriting in perl. Do you want to have a go at that? If not I >> will. > > +1 ... I was mildly astonished that this didn't already have to happen > as part of the bootstrap data conversion effort. It's certainly not > hard to imagine future extensions to the .dat file format that would > break this script, and duplicate_oids too. I think we should rewrite > both of them to use the Catalog.pm infrastructure. > > regards, tom lane Hm, I attached patch in first message, but seems that my mail client again messed with attachment. However archive caught it: https://www.postgresql.org/message-id/attachment/60920/0001-Rewrite-unused_oids-in-perl.patch -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
> On 25 Apr 2018, at 17:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think we should rewrite > both of them to use the Catalog.pm infrastructure. Okay, seems reasonable. I'll put shared code in Catalog.pm and update patch. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 4/25/18, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: >> On 25 Apr 2018, at 17:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think we should rewrite >> both of them to use the Catalog.pm infrastructure. > > Okay, seems reasonable. I'll put shared code in Catalog.pm and > update patch. I don't think you need any new code in Catalog.pm, I believe the suggestion was just to use that module as a stable interface to the data. Looking at your patch, I'll mention that we have an idiom for extracting #define'd OID symbols, e.g.: my $FirstBootstrapObjectId = Catalog::FindDefinedSymbol( 'access/transam.h', \@include_path, 'FirstBootstrapObjectId'); This is preferred over using awk, which would have its own portability issues (Windows for starters). While I'm thinking out loud, it might be worthwhile to patch genbki.pl for the duplicate test, since they're run at the same time anyway (see catalog/Makefile), and we've already read all the data. -John Naylor
On Wed, Apr 25, 2018 at 09:55:54PM +0700, John Naylor wrote: > On 4/25/18, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > >> On 25 Apr 2018, at 17:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I think we should rewrite > >> both of them to use the Catalog.pm infrastructure. > > > > Okay, seems reasonable. I'll put shared code in Catalog.pm and > > update patch. > > I don't think you need any new code in Catalog.pm, I believe the > suggestion was just to use that module as a stable interface to the > data. Looking at your patch, I'll mention that we have an idiom for > extracting #define'd OID symbols, e.g.: > > my $FirstBootstrapObjectId = Catalog::FindDefinedSymbol( > 'access/transam.h', \@include_path, 'FirstBootstrapObjectId'); While we're improving things, there's not really a reason this program should need to be run from any particular spot. It can detect where it is and act on that information. Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 04/25/2018 04:59 PM, David Fetter wrote: > While we're improving things, there's not really a reason this program > should need to be run from any particular spot. It can detect where it > is and act on that information. +1 I would appreciate this change, and if Stats does not feel like adding it to his patch I can write a patch for this myself. Andreas
On 04/25/2018 10:55 AM, John Naylor wrote: > On 4/25/18, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: >>> On 25 Apr 2018, at 17:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I think we should rewrite >>> both of them to use the Catalog.pm infrastructure. >> Okay, seems reasonable. I'll put shared code in Catalog.pm and >> update patch. > I don't think you need any new code in Catalog.pm, I believe the > suggestion was just to use that module as a stable interface to the > data. Looking at your patch, I'll mention that we have an idiom for > extracting #define'd OID symbols, e.g.: > > my $FirstBootstrapObjectId = Catalog::FindDefinedSymbol( > 'access/transam.h', \@include_path, 'FirstBootstrapObjectId'); > > This is preferred over using awk, which would have its own portability > issues (Windows for starters). > > While I'm thinking out loud, it might be worthwhile to patch genbki.pl > for the duplicate test, since they're run at the same time anyway (see > catalog/Makefile), and we've already read all the data. > The logic for getting the set of oids should be centralized, if not in Catalog.pm then in a script which serves both for dup0licate_oids and unused_oids. Here is something I cobbled together for the latter approach. It could probably improve by using Catalog::FindDefinedSymbol() cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
> On 25 Apr 2018, at 17:55, John Naylor <jcnaylor@gmail.com> wrote: > > On 4/25/18, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: >>> On 25 Apr 2018, at 17:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I think we should rewrite >>> both of them to use the Catalog.pm infrastructure. >> >> Okay, seems reasonable. I'll put shared code in Catalog.pm and >> update patch. > > I don't think you need any new code in Catalog.pm, I believe the > suggestion was just to use that module as a stable interface to the > data. Looking at your patch, I'll mention that we have an idiom for > extracting #define'd OID symbols, e.g.: > > my $FirstBootstrapObjectId = Catalog::FindDefinedSymbol( > 'access/transam.h', \@include_path, 'FirstBootstrapObjectId'); > > This is preferred over using awk, which would have its own portability > issues (Windows for starters). > > While I'm thinking out loud, it might be worthwhile to patch genbki.pl > for the duplicate test, since they're run at the same time anyway (see > catalog/Makefile), and we've already read all the data. > > -John Naylor > New version is attached. I've put iterator into Catalog.pm to avoid copy-pasting it over two scripts. Also instead of greping through *.dat and *.h files I've used parsers provided in Catalog module. That way should be more sustainable to format changes. Anyone who wanted to help with scripts -- feel free to rewrite. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 4/25/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> +many for rewriting in perl. Do you want to have a go at that? If not I >> will. > > +1 ... I was mildly astonished that this didn't already have to happen > as part of the bootstrap data conversion effort. It's certainly not > hard to imagine future extensions to the .dat file format that would > break this script, and duplicate_oids too. I think we should rewrite > both of them to use the Catalog.pm infrastructure. If we're going to use Catalog.pm for that, it seems more convenient to expose toast and index oids directly rather than in strings formatted specifically for the bki file, as in the attached. Thoughts? -John Naylor
Attachment
John Naylor <jcnaylor@gmail.com> writes: > On 4/25/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think we should rewrite >> both of them to use the Catalog.pm infrastructure. > If we're going to use Catalog.pm for that, it seems more convenient to > expose toast and index oids directly rather than in strings formatted > specifically for the bki file, as in the attached. Thoughts? Good idea, I like this better than what Stas did in his patch. I'm a bit inclined to preserve the old variable names (toast_name, toast_oid etc) as the hash keys because they seem more greppable than the generic "oid", "name" that you have here. But maybe that isn't an interesting consideration. regards, tom lane
On 4/26/18, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > New version is attached. I've put iterator into Catalog.pm to avoid > copy-pasting > it over two scripts. Also instead of greping through *.dat and *.h files > I've > used parsers provided in Catalog module. That way should be more > sustainable > to format changes. > > Anyone who wanted to help with scripts -- feel free to rewrite. Looks pretty good. Some comments: Just after you posted, I sent a patch that tweaks the API of Catalog.pm for toast and index oids. If you use that API in your patch, you can get rid of the extra bookkeeping you added for those oids. The original scripts skipped the relation and rowtype oids for bootstrap catalogs. You've replaced that with two data-level tests for pg_class plus pg_type composite types. I think the original test was a bit cleaner in this regard. For this type of call: +my @oids = @{ Catalog::FindAllOidsFromHeaders(@input_files) }; +foreach my $oid (@oids) this style is used by other callers of the module: +my $oids = Catalog::FindAllOidsFromHeaders(@input_files); +foreach my $oid (@{ $oids }) For those following along, these scripts still assume we're in the catalog directory. I can hack on that part tomorrow if no one else has. -John Naylor
John Naylor <jcnaylor@gmail.com> writes: > Just after you posted, I sent a patch that tweaks the API of > Catalog.pm for toast and index oids. If you use that API in your > patch, you can get rid of the extra bookkeeping you added for those > oids. I've adjusted these two patches to play together, and pushed them. > The original scripts skipped the relation and rowtype oids for > bootstrap catalogs. You've replaced that with two data-level tests for > pg_class plus pg_type composite types. I think the original test was a > bit cleaner in this regard. Yeah, I thought so too. Changed. > For those following along, these scripts still assume we're in the > catalog directory. I can hack on that part tomorrow if no one else > has. I didn't touch this point. I notice that duplicate_oids is now a good factor of 10 slower than it was before (~0.04 sec to ~0.4 sec, on my machine). While this doesn't seem like a problem for manual use, it seems annoying as part of the standard build process, especially on slower buildfarm critters. I think we should do what you said upthread and teach genbki.pl to complain about duplicate oids, so that we can remove duplicate_oids from the build process. I went ahead and pushed things as-is so we can confirm that duplicate_oids has no portability issues that the buildfarm could find. But let's try to get the build process change in place after a day or so. regards, tom lane
On 4/26/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > John Naylor <jcnaylor@gmail.com> writes: >> For those following along, these scripts still assume we're in the >> catalog directory. I can hack on that part tomorrow if no one else >> has. > > I didn't touch this point. Patch 0002 is my attempt at this. Not sure how portable this is. It's also clunkier than before in that you need to tell it where to find Catalog.pm, since it seems Perl won't compile unless "use lib" points to a string and not an expression. Suggestions welcome. I also edited the boilerplate comments. > I notice that duplicate_oids is now a good factor of 10 slower than it was > before (~0.04 sec to ~0.4 sec, on my machine). While this doesn't seem > like a problem for manual use, it seems annoying as part of the standard > build process, especially on slower buildfarm critters. If you're concerned about build speed, I think the generated *_d.h headers cause a lot more files to be rebuilt than before. I'll think about how to recover that. > I think we should > do what you said upthread and teach genbki.pl to complain about duplicate > oids, so that we can remove duplicate_oids from the build process. > > I went ahead and pushed things as-is so we can confirm that duplicate_oids > has no portability issues that the buildfarm could find. But let's try > to get the build process change in place after a day or so. Patch 0001 -moves the compile-time test into genbki.pl -removes a usability quirk from unused_oids -removes duplicate_oids (not sure if you intended this, but since unused_oids has already been committed to report duplicates, we may as well standardize on that) and cleans up after that fact -updates the documentation. -John Naylor
Attachment
On 04/26/2018 01:37 PM, John Naylor wrote:> Patch 0002 is my attempt at this. Not sure how portable this is. It's > also clunkier than before in that you need to tell it where to find > Catalog.pm, since it seems Perl won't compile unless "use lib" points > to a string and not an expression. Suggestions welcome. I also edited > the boilerplate comments. What about using FindBin (http://perldoc.perl.org/FindBin.html)? Andreas
Andreas Karlsson <andreas@proxel.se> writes: > What about using FindBin (http://perldoc.perl.org/FindBin.html)? A quick check suggests that that's present in the core perl distribution pretty far back, so I think it'll work. Question is, is it any better than John's "dirname(__FILE__)" solution? regards, tom lane
I wrote: > Andreas Karlsson <andreas@proxel.se> writes: >> What about using FindBin (http://perldoc.perl.org/FindBin.html)? > A quick check suggests that that's present in the core perl > distribution pretty far back, so I think it'll work. > Question is, is it any better than John's "dirname(__FILE__)" > solution? I experimented a bit, and found that actually what we seem to need is FindBin::RealBin. That's the only one of these solutions that gives the right answer if someone's put a symlink to unused_oids or duplicate_oids into their PATH --- the other ones give you the directory containing the symlink. Now, maybe that would be an uncommon usage, but it hardly seems out of the question. regards, tom lane
John Naylor <jcnaylor@gmail.com> writes: > On 4/26/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I notice that duplicate_oids is now a good factor of 10 slower than it was >> before (~0.04 sec to ~0.4 sec, on my machine). While this doesn't seem >> like a problem for manual use, it seems annoying as part of the standard >> build process, especially on slower buildfarm critters. > If you're concerned about build speed, I think the generated *_d.h > headers cause a lot more files to be rebuilt than before. I'll think > about how to recover that. Personally, I use ccache which doesn't seem to care too much, but I agree than in some usages, extra touches of headers would be costly. Perhaps it's worth adding logic to avoid overwriting an existing output file unless it changed? I'm not sure; that would cost something too. > -removes duplicate_oids (not sure if you intended this, but since > unused_oids has already been committed to report duplicates, we may as > well standardize on that) and cleans up after that fact I don't think anyone's asked for duplicate_oids to be removed altogether, and I'm afraid that doing so would break existing workflows. For that matter, now that I think about it, changing the behavior of unused_oids as we did yesterday was ill-considered as well, so I undid that part. I've pushed the current-directory-independence change by itself so that we can see if any buildfarm members think it's problematic. (They'll still be testing duplicate_oids, as things stand.) Once we have some confirmation on the FindBin stuff actually being portable, I'll push the changes to make genbki do duplicate checks and remove duplicate_oids from the build sequence. regards, tom lane
On Thu, Apr 26, 2018 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Personally, I use ccache which doesn't seem to care too much, but I agree > than in some usages, extra touches of headers would be costly. Perhaps > it's worth adding logic to avoid overwriting an existing output file > unless it changed? I'm not sure; that would cost something too. It seems like a good idea to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Apr 26, 2018 at 11:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Personally, I use ccache which doesn't seem to care too much, but I agree >> than in some usages, extra touches of headers would be costly. Perhaps >> it's worth adding logic to avoid overwriting an existing output file >> unless it changed? I'm not sure; that would cost something too. > It seems like a good idea to me. I took a quick look into this. It's very easy to do so far as the Perl code is concerned: just teach Catalog::RenameTempFile that if the target file already exists, run File::Compare::compare to see if its contents are identical to the temp file, and if so unlink the temp file rather than renaming. I've tried this, it works, and I can't measure any difference in the runtime of genbki.pl (at least on my usual dev machine). So it seems like there's little downside. However, RenameTempFile is also used by Gen_fmgrtab.pl, and having the same sort of no-touch semantics for fmgroids.h and friends would have some additional fallout. The makefiles would think they have to keep re-running Gen_fmgrtab.pl if fmgroids.h is older than the mod time on any input file, and that's certainly no good. We can fix that by inventing a stamp file for the Gen_fmgrtab.pl run, analogous to bki-stamp for the genbki.pl run, but that means changes in the makefiles that go a bit beyond the realm of triviality. A compromise answer is to arrange things so that genbki.pl has no-touch semantics but Gen_fmgrtab.pl doesn't, requiring either two support functions in Catalog.pm or an extra argument to RenameTempFile. But that's really ugly. Besides, if we're trying to avoid excess recompiles due to useless touches of common header files, then failing to have no-touch behavior for fmgroids.h is leaving a lot of money on the table. So I don't like that answer at all. Anyway, I'm happy to go make this happen; it looks like only another hour or so's worth of work to fix the makefiles. But I wonder if anyone will say this is too much churn for post-feature-freeze and we should shelve it till v12. regards, tom lane diff --git a/src/backend/catalog/Catalog.pm b/src/backend/catalog/Catalog.pm index eee7cb3..1b31cdd 100644 *** a/src/backend/catalog/Catalog.pm --- b/src/backend/catalog/Catalog.pm *************** package Catalog; *** 16,21 **** --- 16,24 ---- use strict; use warnings; + use File::Compare; + + # Parses a catalog header file into a data structure describing the schema # of the catalog. sub ParseHeader *************** sub AddDefaultValues *** 336,350 **** } # Rename temporary files to final names. ! # Call this function with the final file name and the .tmp extension # Note: recommended extension is ".tmp$$", so that parallel make steps ! # can't use the same temp files sub RenameTempFile { my $final_name = shift; my $extension = shift; my $temp_name = $final_name . $extension; ! rename($temp_name, $final_name) || die "rename: $temp_name: $!"; } # Find a symbol defined in a particular header file and extract the value. --- 339,367 ---- } # Rename temporary files to final names. ! # Call this function with the final file name and the .tmp extension. ! # ! # If the final file already exists and has identical contents, don't ! # overwrite it; this behavior avoids unnecessary recompiles due to updating ! # the mod date on header files. ! # # Note: recommended extension is ".tmp$$", so that parallel make steps ! # can't use the same temp files. sub RenameTempFile { my $final_name = shift; my $extension = shift; my $temp_name = $final_name . $extension; ! ! if (-f $final_name && ! compare($temp_name, $final_name) == 0) ! { ! unlink $temp_name || die "unlink: $temp_name: $!"; ! } ! else ! { ! rename($temp_name, $final_name) || die "rename: $temp_name: $!"; ! } } # Find a symbol defined in a particular header file and extract the value.
Tom Lane wrote: > However, RenameTempFile is also used by Gen_fmgrtab.pl, and having the > same sort of no-touch semantics for fmgroids.h and friends would have some > additional fallout. The makefiles would think they have to keep > re-running Gen_fmgrtab.pl if fmgroids.h is older than the mod time on any > input file, and that's certainly no good. We can fix that by inventing a > stamp file for the Gen_fmgrtab.pl run, analogous to bki-stamp for the > genbki.pl run, but that means changes in the makefiles that go a bit > beyond the realm of triviality. Sounds OK to me -- a stamp file is already established technique, so it shouldn't go *too much* beyond triviality. I do note that msvc/Solution.pm runs Gen_fmgrtab.pl, but it shouldn't require any changes anyhow. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, May 3, 2018 at 3:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Anyway, I'm happy to go make this happen; it looks like only another hour > or so's worth of work to fix the makefiles. But I wonder if anyone will > say this is too much churn for post-feature-freeze and we should shelve > it till v12. > I think we can be a bit more liberal about build system changes than we would be with core code. So I'd say go for it. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> However, RenameTempFile is also used by Gen_fmgrtab.pl, and having the >> same sort of no-touch semantics for fmgroids.h and friends would have some >> additional fallout. The makefiles would think they have to keep >> re-running Gen_fmgrtab.pl if fmgroids.h is older than the mod time on any >> input file, and that's certainly no good. We can fix that by inventing a >> stamp file for the Gen_fmgrtab.pl run, analogous to bki-stamp for the >> genbki.pl run, but that means changes in the makefiles that go a bit >> beyond the realm of triviality. > Sounds OK to me -- a stamp file is already established technique, so it > shouldn't go *too much* beyond triviality. Yeah, what I'm envisioning is to change the makefile rules around these files to look as much as possible like the ones around the BKI files, which are (we hope) already debugged. So it doesn't seem like a high risk change ... at least so far as the makefiles are concerned. > I do note that > msvc/Solution.pm runs Gen_fmgrtab.pl, but it shouldn't require any > changes anyhow. Hmm. Actually, given the IsNewer checks there, it looks like Solution.pm is basically hand-rolling makefile-like dependency checking, which means it would be fooled by no-touch updates in the same way as make is, causing rebuilds to do unnecessary work. We could live with that for awhile maybe, but ultimately Solution.pm would need to be fixed to use a stamp file like the makefile logic. I could take a stab at that, but I don't have any way to test it myself. regards, tom lane
On 5/3/18 15:37, Tom Lane wrote: > I took a quick look into this. It's very easy to do so far as the Perl > code is concerned: I think in order to introduce warts like that, we have to have really great savings. I haven't seen any actual analysis what the current problem is, other than one person expression a suspicion. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > On 5/3/18 15:37, Tom Lane wrote: >> I took a quick look into this. It's very easy to do so far as the Perl >> code is concerned: > I think in order to introduce warts like that, we have to have really > great savings. I haven't seen any actual analysis what the current > problem is, other than one person expression a suspicion. Well, the work's already done, and personally I think the code is cleaner after 9bf28f96c and bad51a49a regardless of whether there's any performance benefit. You could call 1f1cd9b5d a wart if you wanted. But we've done largely the same thing to pgindent, for one, in the past year or so, and I find that to be a usability improvement, independently of whether there's any build performance win. My editor gets cranky when files change under it for no reason. regards, tom lane
On 5/4/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Well, the work's already done, and personally I think the code is > cleaner after 9bf28f96c and bad51a49a regardless of whether there's any > performance benefit. You could call 1f1cd9b5d a wart if you wanted. > But we've done largely the same thing to pgindent, for one, in the past > year or so, and I find that to be a usability improvement, independently > of whether there's any build performance win. My editor gets cranky > when files change under it for no reason. Thanks for looking into that. I meant to do so, but it got stuck on the back burner. I'm happy to report that, running non-parallel make on my old laptop goes from over 5 minutes to about 5 seconds, when touching a catalog file. -John Naylor