Thread: BUG #15525: Build failures when compiling Postgres with Makeparallelization
BUG #15525: Build failures when compiling Postgres with Makeparallelization
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 15525 Logged by: Alyssa Ross Email address: hi@alyssa.is PostgreSQL version: 9.6.11 Operating system: macOS Description: There have been multiple reports to the Nix package manager that compiling PostgreSQL on macOS with Make's -j option have resulted in build failures. As far as we know, this has only happened on macOS. This appears to be an issue with PostgreSQL's build process, rather than with the package. Here's what we know: https://github.com/NixOS/nixpkgs/issues/51093 Example of a build failure: ar crs libpgtypes.a numeric.o datetime.o common.o dt_common.o timestamp.o interval.o pgstrcasecmp.o echo 'Libs: -L/nix/store/7l8df1na81psj9cx56yjs4m9084nrayr-postgresql-9.6.10-lib/lib -lecpg_compat' >>libecpg_compat.pc error: /nix/store/p1zg3dnaaglqm34pq12ynfa2pc3r2lq8-cctools-port-895/bin/ranlib: can't open file: libpgtypes.a (No such file or directory) ar: internal ranlib command failed make[5]: *** [../../../../src/Makefile.shlib:306: libpgtypes.a] Error 1
Re: BUG #15525: Build failures when compiling Postgres with Make parallelization
From
Thomas Munro
Date:
On Wed, Nov 28, 2018 at 11:27 AM PG Bug reporting form <noreply@postgresql.org> wrote: > PostgreSQL on macOS with Make's -j option have resulted in build failures. > As far as we know, this has only happened on macOS. This appears to be an > issue with PostgreSQL's build process, rather than with the package. I don't know the answer but one thought that occurred to me: are you using Apple's frozen-in-time GNU make? $ make -v GNU Make 3.81 Copyright (C) 2006 Free Software Foundation, Inc. Could it have bugs that were later fixed? On the other hand, I build PostgreSQL myself quite regularly on a Mac with that version and -j8 and I have not seen that problem. -- Thomas Munro http://www.enterprisedb.com
Re: BUG #15525: Build failures when compiling Postgres with Make parallelization
From
Tom Lane
Date:
=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes: > There have been multiple reports to the Nix package manager that compiling > PostgreSQL on macOS with Make's -j option have resulted in build failures. > As far as we know, this has only happened on macOS. This isn't too helpful if you don't mention which macOS version nor what sort of hardware exactly. Most PG developers use parallel builds routinely, so we know that it's not broken in general. For me, trying 9.6 branch on macOS Mojave (10.14.1) on a 2018 6-core MBP, "make -j" is unusable because the OS fails to support an indefinite number of processes: lots of commands fall over with messages like clang: error: unable to execute command: posix_spawn failed: Resource temporarily unavailable This is not a Postgres bug; maybe you could make a case that it's make's fault, but I'm not sure. It looks like a lot of the fork failures happen outside of make's view. However, if I use a more reasonable parallelism level like -j8, or even as high as -j25, it goes through fine. It doesn't look like there's any net reduction in build time above around -j10, so I'm not very excited about seeing whether it would fall over at some level short of what breaks the OS. Having said that, we did do a round of patches in the v11 development cycle that addressed some parallel-make hazards. A lot of said hazards were new in v11 :-(, but I think that some of them were pre-existing problems. So you might find that PG 11 is more resistant to whatever is going on here. BTW, are you using the Apple-supplied make, or some other version? In the past we've had to fight with parallelism bugs in old gmake versions ... regards, tom lane
Re: BUG #15525: Build failures when compiling Postgres with Make parallelization
From
Jack Kelly
Date:
Hi Tom & List, OP reported the bug I encountered on my behalf, which is reported to nixpkgs at https://github.com/NixOS/nixpkgs/issues/51093#issuecomment-442301157 . I'm not a subscriber to this list and have constructed the threading headers manually (I hope this works). Please CC me in replies where relevant. Tom Lane <tgl@sss.pgh.pa.us> writes: > This isn't too helpful if you don't mention which macOS version nor what > sort of hardware exactly. Most PG developers use parallel builds > routinely, so we know that it's not broken in general. To complicate matters further, I'm reporting the nixpkgs bug on behalf of my mac-using colleague. His machine overview: Model Name: MacBook Pro Model Identifier: MacBookPro14,3 Processor Name: Intel Core i7 Processor Speed: 2.9 GHz Number of Processors: 1 Total Number of Cores: 4 L2 Cache (per Core): 256 KB L3 Cache: 8 MB Memory: 16 GB Boot ROM Version: 185.0.0.0.0 SMC Version (system): 2.45f0 The build is happening on an APFS file system, I believe. > - snip discussion of process-spawn failures at high -j numbers - > Having said that, we did do a round of patches in the v11 development > cycle that addressed some parallel-make hazards. A lot of said hazards > were new in v11 :-(, but I think that some of them were pre-existing > problems. So you might find that PG 11 is more resistant to whatever > is going on here. A failing build log is attached to the nixpkgs GH issue at https://github.com/NixOS/nixpkgs/files/2617687/build.log I note that it calls `ar rcs libpgtypes.a ...` multiple times during the build, and I speculate that these `ar` invocations start racing each other. @delroth on GH notes: @delroth> Starting in 9.1.0 (postgres/postgres@19e231b) this is in the postgres codebase: @delroth> @delroth> ./src/interfaces/ecpg/compatlib/Makefile: $(MAKE) -C $(top_builddir)/src/interfaces/ecpg/pgtypeslib all @delroth> ./src/interfaces/ecpg/ecpglib/Makefile: $(MAKE) -C $(top_builddir)/src/interfaces/ecpg/pgtypeslib all > BTW, are you using the Apple-supplied make, or some other version? > In the past we've had to fight with parallelism bugs in old gmake > versions ... Whatever nixpkgs asks for, which is probably gnumake, and it looks like nixpkgs has gnumake 4.2.1. HTH, -- Jack
Re: BUG #15525: Build failures when compiling Postgres with Make parallelization
From
Tom Lane
Date:
Jack Kelly <jack@jackkelly.name> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> BTW, are you using the Apple-supplied make, or some other version? >> In the past we've had to fight with parallelism bugs in old gmake >> versions ... > Whatever nixpkgs asks for, which is probably gnumake, and it looks like > nixpkgs has gnumake 4.2.1. Hmm. Maybe the critical combination is macOS plus a non-Apple version of gmake? Doesn't make a lot of sense ... > A failing build log is attached to the nixpkgs GH issue at > https://github.com/NixOS/nixpkgs/files/2617687/build.log I note that it > calls `ar rcs libpgtypes.a ...` multiple times during the build, and I > speculate that these `ar` invocations start racing each other. After staring at that for awhile, I don't think that this is a bug in the PG makefiles. It looks like maybe it could be a clock skew problem. You can see that the pgtypeslib build is completing, and then ecpglib does submake-pgtypeslib which should find nothing to do, and indeed mostly it thinks it has nothing to do --- except it wants to rebuild libpgtypes.a. Which makes no sense, because that has the exact same dependencies as libpgtypes.so, which is not getting rebuilt. And concurrently, exactly the same thing is happening with libpq.a, but not libpq.so. ... and after contemplating my navel for awhile more, I believe I understand the problem. APFS has sub-second file timestamp resolution, which doesn't seem to be exposed in Apple's version of "ls", but you can find it out from stat(2). And what I'm seeing is that "ranlib" is truncating the timestamp of its output file to a one-second boundary: $ ls -ltr ... -rw-r--r-- 1 tgl admin 25888 Nov 2 11:37 interval.c -rw-r--r-- 1 tgl admin 20692 Nov 2 11:37 timestamp.c -rw-r--r-- 1 tgl admin 210640 Nov 28 21:04 libpgtypes.a -rw-r--r-- 1 tgl admin 43232 Nov 28 21:04 numeric.o -rw-r--r-- 1 tgl admin 18688 Nov 28 21:04 datetime.o -rw-r--r-- 1 tgl admin 5916 Nov 28 21:04 common.o -rw-r--r-- 1 tgl admin 76584 Nov 28 21:04 dt_common.o ... $ ~/a.out numeric.o mtime - Actual: 1543457047.155932 atime - Actual: 1543457047.651358 $ ~/a.out libpgtypes.a mtime - Actual: 1543457047.000000 atime - Actual: 1543457047.000000 (a.out is a stupid little program I made to print out the extended timespec fields from stat(2).) This is observable fact. Also observable fact is that Apple's gmake does not think libpgtypes.a needs to be rebuilt in this situation, which implies that it does its work with seconds-truncated file timestamps. Where I'm speculating a bit is to guess that nix's version of gmake thinks "whee, this filesystem has nanosecond timestamps, so I'll believe them". But given these facts, it's not much of a leap to conclude that nix's gmake is rebuilding the .a files based on them apparently being older than their inputs. Recommendations: 1. File a bug with Apple to tell them it's not nice that ranlib produces a file that appears older than its input files. 2. Pending some action from Apple, nix's build of gmake should not trust sub-second timestamps on Darwin. I suppose that this could be worked around with something like ifndef haslibarule $(stlib): $(OBJS) | $(SHLIB_PREREQS) rm -f $@ $(LINK.static) $@ $^ $(RANLIB) $@ + touch $@ endif #haslibarule but ick. Who's to say that ranlib is the only tool with such a problem? regards, tom lane
Re: BUG #15525: Build failures when compiling Postgres with Make parallelization
From
Thomas Munro
Date:
On Thu, Nov 29, 2018 at 3:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > ... and after contemplating my navel for awhile more, I believe > I understand the problem. APFS has sub-second file timestamp resolution, > which doesn't seem to be exposed in Apple's version of "ls", but you can > find it out from stat(2). And what I'm seeing is that "ranlib" is > truncating the timestamp of its output file to a one-second boundary: > > $ ls -ltr > ... > -rw-r--r-- 1 tgl admin 25888 Nov 2 11:37 interval.c > -rw-r--r-- 1 tgl admin 20692 Nov 2 11:37 timestamp.c > -rw-r--r-- 1 tgl admin 210640 Nov 28 21:04 libpgtypes.a > -rw-r--r-- 1 tgl admin 43232 Nov 28 21:04 numeric.o > -rw-r--r-- 1 tgl admin 18688 Nov 28 21:04 datetime.o > -rw-r--r-- 1 tgl admin 5916 Nov 28 21:04 common.o > -rw-r--r-- 1 tgl admin 76584 Nov 28 21:04 dt_common.o > ... > > $ ~/a.out numeric.o > mtime - Actual: 1543457047.155932 > atime - Actual: 1543457047.651358 > $ ~/a.out libpgtypes.a > mtime - Actual: 1543457047.000000 > atime - Actual: 1543457047.000000 Nice detective work. Possibly because libtool/ranlib whacks it with utime(), which only knows about time_t, here: https://github.com/opensource-apple/cctools/blob/master/misc/libtool.c#L2779 -- Thomas Munro http://www.enterprisedb.com
Re: BUG #15525: Build failures when compiling Postgres with Make parallelization
From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes: > Nice detective work. Possibly because libtool/ranlib whacks it with > utime(), which only knows about time_t, here: > https://github.com/opensource-apple/cctools/blob/master/misc/libtool.c#L2779 I suspected as much, but hadn't gone looking for the code. I wonder why it bothers with any of that... it's certainly not documented behavior per the man page. regards, tom lane
Re: BUG #15525: Build failures when compiling Postgres with Make parallelization
From
Thomas Munro
Date:
On Thu, Nov 29, 2018 at 4:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: > > Nice detective work. Possibly because libtool/ranlib whacks it with > > utime(), which only knows about time_t, here: > > https://github.com/opensource-apple/cctools/blob/master/misc/libtool.c#L2779 > > I suspected as much, but hadn't gone looking for the code. I wonder > why it bothers with any of that... it's certainly not documented > behavior per the man page. As for why Apple make doesn't have the problem, I think it's simply that high resolution timestamp support for Darwin came along ~5 years after Apple forked/froze their make due to the license change. Here's the commit: https://github.com/mirror/make/commit/bfc3e1ca7c0c1504c9873ee1baacce73330b037e As for what could be done about it, it seems like we (or the Nix project, in a local patch) could declare individual targets to have .LOW_RESOLUTION_TIME: https://www.gnu.org/software/make/manual/html_node/Special-Targets.html That doesn't seem any better than using "touch" to make a better mtime though. I'm kinda surprised that the Nix project doesn't have this problem on other projects, though, if they're always using a modern GNU make. What are they doing differently? -- Thomas Munro http://www.enterprisedb.com
Re: BUG #15525: Build failures when compiling Postgres with Make parallelization
From
Jack Kelly
Date:
Thanks Thomas and Tom for your detective work. Thomas Munro <thomas.munro@enterprisedb.com> writes: > As for what could be done about it, it seems like we (or the Nix > project, in a local patch) could declare individual targets to have > .LOW_RESOLUTION_TIME: > > https://www.gnu.org/software/make/manual/html_node/Special-Targets.html > > That doesn't seem any better than using "touch" to make a better mtime > though. I'm kinda surprised that the Nix project doesn't have this > problem on other projects, though, if they're always using a modern > GNU make. What are they doing differently? I think this is slightly better than using "touch", because it's a Makefile-level fix instead of kludging around with the file system, and its designed purpose is to deal with broken tools. IMHO, The Nth-degree "correct" thing for the postgresql build system would be check if the most recent versions are vulnerable, and if so update the configure script to detect a high-resolution filesystem and a truncating ranlib, and if that is true for that build, then set a variable so the Makefiles can conditionally add static libraries to `.LOW_RESOLUTION_TIME` targets. This seems like a lot of work for marginal payoff, particularly if releases newer than 9.x are not brittle in this way. On the Nix question: I'm an itinerant nixpkgs contributor, so I can't speak definitively, but I think there are at least a couple of things going on: 1. Nix on macOS is still a bit of a second-class citizen. (I recently had to fix code that assumed shared objects always ended in ".so", instead of substituting a variable that expanded to ".dylib" on macOS). 2. If Nix successfully builds postgres and adds it to a binary cache, most people will not run the build themselves. 3. This bug seems to be tickled because two different Makefiles use attempt to build the same target at the same time, using a tool (macOS libtool/ranlib, albeit through a recursive $(MAKE) invocation) that doesn't support subsecond timestamps, on a filesystem that does (APFS). That's a bit of a corner case. I speculate that building a static library with a nonrecursive Makefile would only kick off one build of the `.a` file, because make will only invoke the command once as it walks the DAG of dependencies. #1 and #2 mean the actual amount of building done is relatively low, and #3 means that it is actually somewhat hard to trip over. I have filed https://github.com/NixOS/nixpkgs/issues/51221 with nixpkgs, and now @dalroth is talking about patching cctools' ranlib. Thanks again for all your help. I probably won't have time to dig into this further but if you need more information I'll see what I can do. -- Jack
Re: BUG #15525: Build failures when compiling Postgres with Make parallelization
From
Tom Lane
Date:
Jack Kelly <jack@jackkelly.name> writes: > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> As for what could be done about it, it seems like we (or the Nix >> project, in a local patch) could declare individual targets to have >> .LOW_RESOLUTION_TIME: >> https://www.gnu.org/software/make/manual/html_node/Special-Targets.html >> That doesn't seem any better than using "touch" to make a better mtime >> though. In fact it's worse, because it opens you up to the same problems that sub-second timestamps were meant to fix. After sleeping on this, I'm liking the idea of adding "touch" to our rule better. We shouldn't imagine that this problem exists in a vacuum: Apple got that ranlib code from some BSD or other, so it probably exists in similar form elsewhere. And filesystems with sub-second timestamps are getting more common. So it seems likely that this issue could manifest on other combinations than the one we see here. > IMHO, The Nth-degree "correct" thing for the postgresql build system > would be check if the most recent versions are vulnerable, and if so > update the configure script to detect a high-resolution filesystem and a > truncating ranlib, and if that is true for that build, then set a > variable so the Makefiles can conditionally add static libraries to > `.LOW_RESOLUTION_TIME` targets. This seems like a lot of work for > marginal payoff, particularly if releases newer than 9.x are not brittle > in this way. The issue is still there in the same form. I agree that this sketch of the "correct" thing is not going to happen, though. The "touch" fix seems like a far more appropriate level of effort, plus it actually fixes the problem rather than applying a band-aid. (I have checked that "touch" applies a sub-second timestamp on APFS, btw.) > 3. This bug seems to be tickled because two different Makefiles use > attempt to build the same target at the same time, using a tool > (macOS libtool/ranlib, albeit through a recursive $(MAKE) invocation) > that doesn't support subsecond timestamps, on a filesystem that does > (APFS). That's a bit of a corner case. Yeah, this. Under typical circumstances, the worst that would happen is an extra rebuild of the .a file. We're unlucky because two such rebuilds could get launched in parallel, something that I bet is not that common. regards, tom lane
Re: BUG #15525: Build failures when compiling Postgres with Make parallelization
From
Tom Lane
Date:
I wrote: > After sleeping on this, I'm liking the idea of adding "touch" to our > rule better. I've pushed a patch along that line: https://git.postgresql.org/gitweb/?p=postgresql.git;a=patch;h=826eff57c4c23f77314ba7151d3dc506ce0fa24c Possibly the nix PG package would want to absorb that before our next releases (which aren't scheduled until February). regards, tom lane
Re: BUG #15525: Build failures when compiling Postgres with Make parallelization
From
Thomas Munro
Date:
On Fri, Nov 30, 2018 at 5:03 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jack Kelly <jack@jackkelly.name> writes: > > Thomas Munro <thomas.munro@enterprisedb.com> writes: > >> As for what could be done about it, it seems like we (or the Nix > >> project, in a local patch) could declare individual targets to have > >> .LOW_RESOLUTION_TIME: > >> https://www.gnu.org/software/make/manual/html_node/Special-Targets.html > >> That doesn't seem any better than using "touch" to make a better mtime > >> though. > > In fact it's worse, because it opens you up to the same problems that > sub-second timestamps were meant to fix. > > After sleeping on this, I'm liking the idea of adding "touch" to our > rule better. We shouldn't imagine that this problem exists in a vacuum: > Apple got that ranlib code from some BSD or other, so it probably > exists in similar form elsewhere. And filesystems with sub-second > timestamps are getting more common. So it seems likely that this issue > could manifest on other combinations than the one we see here. +1 for that solution (which I see you've just pushed). But just for the record, while we're doing amateur software archeology: I'm pretty sure Apple's libtool/ranlib is not derived from BSD... it says it's from NeXT and has no University of California copyright. They probably needed something different to work with Mach-O objects, whereas ancient BSD used a.out and modern BSDen use ELF. It also supports their funky fat/universal libraries which NeXT and Apple used to change CPU architectures several times surprisingly smoothly. I don't see anything like that utime() in either modern FreeBSD (where it's been rewritten at least once) or ancient 4.4BSD lite sources. -- Thomas Munro http://www.enterprisedb.com
Re: BUG #15525: Build failures when compiling Postgres with Make parallelization
From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes: > But just for the record, while we're doing amateur software > archeology: I'm pretty sure Apple's libtool/ranlib is not derived > from BSD... it says it's from NeXT and has no University of California > copyright. They probably needed something different to work with > Mach-O objects, whereas ancient BSD used a.out and modern BSDen use > ELF. It also supports their funky fat/universal libraries which NeXT > and Apple used to change CPU architectures several times surprisingly > smoothly. I don't see anything like that utime() in either modern > FreeBSD (where it's been rewritten at least once) or ancient 4.4BSD > lite sources. Interesting. There's definitely some funky behavior in Apple's ranlib. While testing this, I noted that sometimes it will produce a timestamp that seems to be max-of-the-input-timestamps (truncated to seconds), which can be much older than current time. Other times it will produce current time (truncated to seconds). No idea what's causing this difference in behavior. regards, tom lane
Re: BUG #15525: Build failures when compiling Postgres with Make parallelization
From
Andrew Gierth
Date:
>>>>> "Thomas" == Thomas Munro <thomas.munro@enterprisedb.com> writes: Thomas> But just for the record, while we're doing amateur software Thomas> archeology: I'm pretty sure Apple's libtool/ranlib is not Thomas> derived from BSD... it says it's from NeXT and has no Thomas> University of California copyright. They probably needed Thomas> something different to work with Mach-O objects, whereas Thomas> ancient BSD used a.out and modern BSDen use ELF. It also Thomas> supports their funky fat/universal libraries which NeXT and Thomas> Apple used to change CPU architectures several times Thomas> surprisingly smoothly. I don't see anything like that utime() Thomas> in either modern FreeBSD (where it's been rewritten at least Thomas> once) or ancient 4.4BSD lite sources. I also noticed that an Apple manpage mentions that the linker at one time compared the mod-time of the .a file with the embedded timestamp of its archive symbol table member, which is probably why the utime() call existed in the first place. I don't recall that behavior in other linkers, offhand. -- Andrew (irc:RhodiumToad)