Thread: BUG #15525: Build failures when compiling Postgres with Makeparallelization

BUG #15525: Build failures when compiling Postgres with Makeparallelization

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      15525
Logged by:          Alyssa Ross
Email address:      hi@alyssa.is
PostgreSQL version: 9.6.11
Operating system:   macOS
Description:

There have been multiple reports to the Nix package manager that compiling
PostgreSQL on macOS with Make's -j option have resulted in build failures.
As far as we know, this has only happened on macOS. This appears to be an
issue with PostgreSQL's build process, rather than with the package.

Here's what we know: https://github.com/NixOS/nixpkgs/issues/51093

Example of a build failure:

ar crs libpgtypes.a numeric.o datetime.o common.o dt_common.o timestamp.o
interval.o pgstrcasecmp.o
echo 'Libs:
-L/nix/store/7l8df1na81psj9cx56yjs4m9084nrayr-postgresql-9.6.10-lib/lib
-lecpg_compat' >>libecpg_compat.pc
error:
/nix/store/p1zg3dnaaglqm34pq12ynfa2pc3r2lq8-cctools-port-895/bin/ranlib:
can't open file: libpgtypes.a (No such file or directory)
ar: internal ranlib command failed
make[5]: *** [../../../../src/Makefile.shlib:306: libpgtypes.a] Error 1


Re: BUG #15525: Build failures when compiling Postgres with Make parallelization

From
Thomas Munro
Date:
On Wed, Nov 28, 2018 at 11:27 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
> PostgreSQL on macOS with Make's -j option have resulted in build failures.
> As far as we know, this has only happened on macOS. This appears to be an
> issue with PostgreSQL's build process, rather than with the package.

I don't know the answer but one thought that occurred to me: are you
using Apple's frozen-in-time GNU make?

$ make -v
GNU Make 3.81
Copyright (C) 2006  Free Software Foundation, Inc.

Could it have bugs that were later fixed?  On the other hand, I build
PostgreSQL myself quite regularly on a Mac with that version and -j8
and I have not seen that problem.

-- 
Thomas Munro
http://www.enterprisedb.com


=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:
> There have been multiple reports to the Nix package manager that compiling
> PostgreSQL on macOS with Make's -j option have resulted in build failures.
> As far as we know, this has only happened on macOS.

This isn't too helpful if you don't mention which macOS version nor what
sort of hardware exactly.  Most PG developers use parallel builds
routinely, so we know that it's not broken in general.

For me, trying 9.6 branch on macOS Mojave (10.14.1) on a 2018 6-core MBP,
"make -j" is unusable because the OS fails to support an indefinite number
of processes: lots of commands fall over with messages like
    clang: error: unable to execute command: posix_spawn failed: Resource temporarily unavailable
This is not a Postgres bug; maybe you could make a case that it's
make's fault, but I'm not sure.  It looks like a lot of the fork
failures happen outside of make's view.

However, if I use a more reasonable parallelism level like -j8,
or even as high as -j25, it goes through fine.  It doesn't look
like there's any net reduction in build time above around -j10,
so I'm not very excited about seeing whether it would fall over
at some level short of what breaks the OS.

Having said that, we did do a round of patches in the v11 development
cycle that addressed some parallel-make hazards.  A lot of said hazards
were new in v11 :-(, but I think that some of them were pre-existing
problems.  So you might find that PG 11 is more resistant to whatever
is going on here.

BTW, are you using the Apple-supplied make, or some other version?
In the past we've had to fight with parallelism bugs in old gmake
versions ...

            regards, tom lane


Hi Tom & List,

OP reported the bug I encountered on my behalf, which is reported to
nixpkgs at
https://github.com/NixOS/nixpkgs/issues/51093#issuecomment-442301157
. I'm not a subscriber to this list and have constructed the threading
headers manually (I hope this works). Please CC me in replies where
relevant.

Tom Lane <tgl@sss.pgh.pa.us> writes:

> This isn't too helpful if you don't mention which macOS version nor what
> sort of hardware exactly.  Most PG developers use parallel builds
> routinely, so we know that it's not broken in general.

To complicate matters further, I'm reporting the nixpkgs bug on behalf
of my mac-using colleague. His machine overview:

Model Name:    MacBook Pro
Model Identifier:    MacBookPro14,3
Processor Name:    Intel Core i7
Processor Speed:    2.9 GHz
Number of Processors:    1
Total Number of Cores:    4
L2 Cache (per Core):    256 KB
L3 Cache:    8 MB
Memory:    16 GB
Boot ROM Version:    185.0.0.0.0
SMC Version (system):    2.45f0

The build is happening on an APFS file system, I believe.

> - snip discussion of process-spawn failures at high -j numbers -

> Having said that, we did do a round of patches in the v11 development
> cycle that addressed some parallel-make hazards.  A lot of said hazards
> were new in v11 :-(, but I think that some of them were pre-existing
> problems.  So you might find that PG 11 is more resistant to whatever
> is going on here.

A failing build log is attached to the nixpkgs GH issue at
https://github.com/NixOS/nixpkgs/files/2617687/build.log I note that it
calls `ar rcs libpgtypes.a ...` multiple times during the build, and I
speculate that these `ar` invocations start racing each other.

@delroth on GH notes:

@delroth> Starting in 9.1.0 (postgres/postgres@19e231b) this is in the postgres codebase:
@delroth>
@delroth> ./src/interfaces/ecpg/compatlib/Makefile:    $(MAKE) -C $(top_builddir)/src/interfaces/ecpg/pgtypeslib all
@delroth> ./src/interfaces/ecpg/ecpglib/Makefile:    $(MAKE) -C $(top_builddir)/src/interfaces/ecpg/pgtypeslib all

> BTW, are you using the Apple-supplied make, or some other version?
> In the past we've had to fight with parallelism bugs in old gmake
> versions ...

Whatever nixpkgs asks for, which is probably gnumake, and it looks like
nixpkgs has gnumake 4.2.1.

HTH,

-- Jack


Jack Kelly <jack@jackkelly.name> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> BTW, are you using the Apple-supplied make, or some other version?
>> In the past we've had to fight with parallelism bugs in old gmake
>> versions ...

> Whatever nixpkgs asks for, which is probably gnumake, and it looks like
> nixpkgs has gnumake 4.2.1.

Hmm.  Maybe the critical combination is macOS plus a non-Apple
version of gmake?  Doesn't make a lot of sense ...

> A failing build log is attached to the nixpkgs GH issue at
> https://github.com/NixOS/nixpkgs/files/2617687/build.log I note that it
> calls `ar rcs libpgtypes.a ...` multiple times during the build, and I
> speculate that these `ar` invocations start racing each other.

After staring at that for awhile, I don't think that this is a bug in the
PG makefiles.  It looks like maybe it could be a clock skew problem.
You can see that the pgtypeslib build is completing, and then ecpglib
does submake-pgtypeslib which should find nothing to do, and indeed
mostly it thinks it has nothing to do --- except it wants to rebuild
libpgtypes.a.  Which makes no sense, because that has the exact same
dependencies as libpgtypes.so, which is not getting rebuilt.  And
concurrently, exactly the same thing is happening with libpq.a, but
not libpq.so.

... and after contemplating my navel for awhile more, I believe
I understand the problem.  APFS has sub-second file timestamp resolution,
which doesn't seem to be exposed in Apple's version of "ls", but you can
find it out from stat(2).  And what I'm seeing is that "ranlib" is
truncating the timestamp of its output file to a one-second boundary:

$ ls -ltr
...
-rw-r--r--  1 tgl  admin   25888 Nov  2 11:37 interval.c
-rw-r--r--  1 tgl  admin   20692 Nov  2 11:37 timestamp.c
-rw-r--r--  1 tgl  admin  210640 Nov 28 21:04 libpgtypes.a
-rw-r--r--  1 tgl  admin   43232 Nov 28 21:04 numeric.o
-rw-r--r--  1 tgl  admin   18688 Nov 28 21:04 datetime.o
-rw-r--r--  1 tgl  admin    5916 Nov 28 21:04 common.o
-rw-r--r--  1 tgl  admin   76584 Nov 28 21:04 dt_common.o
...

$ ~/a.out numeric.o
mtime - Actual: 1543457047.155932
atime - Actual: 1543457047.651358
$ ~/a.out libpgtypes.a
mtime - Actual: 1543457047.000000
atime - Actual: 1543457047.000000

(a.out is a stupid little program I made to print out the extended
timespec fields from stat(2).)

This is observable fact.  Also observable fact is that Apple's gmake
does not think libpgtypes.a needs to be rebuilt in this situation,
which implies that it does its work with seconds-truncated file
timestamps.  Where I'm speculating a bit is to guess that nix's
version of gmake thinks "whee, this filesystem has nanosecond
timestamps, so I'll believe them".  But given these facts, it's
not much of a leap to conclude that nix's gmake is rebuilding the .a
files based on them apparently being older than their inputs.

Recommendations:

1. File a bug with Apple to tell them it's not nice that ranlib
produces a file that appears older than its input files.

2. Pending some action from Apple, nix's build of gmake should not
trust sub-second timestamps on Darwin.

I suppose that this could be worked around with something like

 ifndef haslibarule
 $(stlib): $(OBJS) | $(SHLIB_PREREQS)
     rm -f $@
     $(LINK.static) $@ $^
     $(RANLIB) $@
+    touch $@
 endif #haslibarule
 
but ick.  Who's to say that ranlib is the only tool with such a problem?

            regards, tom lane


Re: BUG #15525: Build failures when compiling Postgres with Make parallelization

From
Thomas Munro
Date:
On Thu, Nov 29, 2018 at 3:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> ... and after contemplating my navel for awhile more, I believe
> I understand the problem.  APFS has sub-second file timestamp resolution,
> which doesn't seem to be exposed in Apple's version of "ls", but you can
> find it out from stat(2).  And what I'm seeing is that "ranlib" is
> truncating the timestamp of its output file to a one-second boundary:
>
> $ ls -ltr
> ...
> -rw-r--r--  1 tgl  admin   25888 Nov  2 11:37 interval.c
> -rw-r--r--  1 tgl  admin   20692 Nov  2 11:37 timestamp.c
> -rw-r--r--  1 tgl  admin  210640 Nov 28 21:04 libpgtypes.a
> -rw-r--r--  1 tgl  admin   43232 Nov 28 21:04 numeric.o
> -rw-r--r--  1 tgl  admin   18688 Nov 28 21:04 datetime.o
> -rw-r--r--  1 tgl  admin    5916 Nov 28 21:04 common.o
> -rw-r--r--  1 tgl  admin   76584 Nov 28 21:04 dt_common.o
> ...
>
> $ ~/a.out numeric.o
> mtime - Actual: 1543457047.155932
> atime - Actual: 1543457047.651358
> $ ~/a.out libpgtypes.a
> mtime - Actual: 1543457047.000000
> atime - Actual: 1543457047.000000

Nice detective work.  Possibly because libtool/ranlib whacks it with
utime(), which only knows about time_t, here:

https://github.com/opensource-apple/cctools/blob/master/misc/libtool.c#L2779

-- 
Thomas Munro
http://www.enterprisedb.com


Thomas Munro <thomas.munro@enterprisedb.com> writes:
> Nice detective work.  Possibly because libtool/ranlib whacks it with
> utime(), which only knows about time_t, here:
> https://github.com/opensource-apple/cctools/blob/master/misc/libtool.c#L2779

I suspected as much, but hadn't gone looking for the code.  I wonder
why it bothers with any of that... it's certainly not documented
behavior per the man page.

            regards, tom lane


Re: BUG #15525: Build failures when compiling Postgres with Make parallelization

From
Thomas Munro
Date:
On Thu, Nov 29, 2018 at 4:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
> > Nice detective work.  Possibly because libtool/ranlib whacks it with
> > utime(), which only knows about time_t, here:
> > https://github.com/opensource-apple/cctools/blob/master/misc/libtool.c#L2779
>
> I suspected as much, but hadn't gone looking for the code.  I wonder
> why it bothers with any of that... it's certainly not documented
> behavior per the man page.

As for why Apple make doesn't have the problem, I think it's simply
that high resolution timestamp support for Darwin came along ~5 years
after Apple forked/froze their make due to the license change.  Here's
the commit:

https://github.com/mirror/make/commit/bfc3e1ca7c0c1504c9873ee1baacce73330b037e

As for what could be done about it, it seems like we (or the Nix
project, in a local patch) could declare individual targets to have
.LOW_RESOLUTION_TIME:

https://www.gnu.org/software/make/manual/html_node/Special-Targets.html

That doesn't seem any better than using "touch" to make a better mtime
though.  I'm kinda surprised that the Nix project doesn't have this
problem on other projects, though, if they're always using a modern
GNU make.  What are they doing differently?

-- 
Thomas Munro
http://www.enterprisedb.com


Thanks Thomas and Tom for your detective work.

Thomas Munro <thomas.munro@enterprisedb.com> writes:

> As for what could be done about it, it seems like we (or the Nix
> project, in a local patch) could declare individual targets to have
> .LOW_RESOLUTION_TIME:
>
> https://www.gnu.org/software/make/manual/html_node/Special-Targets.html
>
> That doesn't seem any better than using "touch" to make a better mtime
> though.  I'm kinda surprised that the Nix project doesn't have this
> problem on other projects, though, if they're always using a modern
> GNU make.  What are they doing differently?

I think this is slightly better than using "touch", because it's a
Makefile-level fix instead of kludging around with the file system, and
its designed purpose is to deal with broken tools.

IMHO, The Nth-degree "correct" thing for the postgresql build system
would be check if the most recent versions are vulnerable, and if so
update the configure script to detect a high-resolution filesystem and a
truncating ranlib, and if that is true for that build, then set a
variable so the Makefiles can conditionally add static libraries to
`.LOW_RESOLUTION_TIME` targets. This seems like a lot of work for
marginal payoff, particularly if releases newer than 9.x are not brittle
in this way.


On the Nix question: I'm an itinerant nixpkgs contributor, so I can't
speak definitively, but I think there are at least a couple of things
going on:

1. Nix on macOS is still a bit of a second-class citizen. (I recently
   had to fix code that assumed shared objects always ended in ".so",
   instead of substituting a variable that expanded to ".dylib" on
   macOS).

2. If Nix successfully builds postgres and adds it to a binary cache,
   most people will not run the build themselves.

3. This bug seems to be tickled because two different Makefiles use
   attempt to build the same target at the same time, using a tool
   (macOS libtool/ranlib, albeit through a recursive $(MAKE) invocation)
   that doesn't support subsecond timestamps, on a filesystem that does
   (APFS). That's a bit of a corner case. I speculate that building a
   static library with a nonrecursive Makefile would only kick off one
   build of the `.a` file, because make will only invoke the command
   once as it walks the DAG of dependencies.

#1 and #2 mean the actual amount of building done is relatively low, and
#3 means that it is actually somewhat hard to trip over.

I have filed https://github.com/NixOS/nixpkgs/issues/51221 with nixpkgs,
and now @dalroth is talking about patching cctools' ranlib.


Thanks again for all your help. I probably won't have time to dig into
this further but if you need more information I'll see what I can do.

-- Jack


Jack Kelly <jack@jackkelly.name> writes:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> As for what could be done about it, it seems like we (or the Nix
>> project, in a local patch) could declare individual targets to have
>> .LOW_RESOLUTION_TIME:
>> https://www.gnu.org/software/make/manual/html_node/Special-Targets.html
>> That doesn't seem any better than using "touch" to make a better mtime
>> though.

In fact it's worse, because it opens you up to the same problems that
sub-second timestamps were meant to fix.

After sleeping on this, I'm liking the idea of adding "touch" to our
rule better.  We shouldn't imagine that this problem exists in a vacuum:
Apple got that ranlib code from some BSD or other, so it probably
exists in similar form elsewhere.  And filesystems with sub-second
timestamps are getting more common.  So it seems likely that this issue
could manifest on other combinations than the one we see here.

> IMHO, The Nth-degree "correct" thing for the postgresql build system
> would be check if the most recent versions are vulnerable, and if so
> update the configure script to detect a high-resolution filesystem and a
> truncating ranlib, and if that is true for that build, then set a
> variable so the Makefiles can conditionally add static libraries to
> `.LOW_RESOLUTION_TIME` targets. This seems like a lot of work for
> marginal payoff, particularly if releases newer than 9.x are not brittle
> in this way.

The issue is still there in the same form.  I agree that this sketch
of the "correct" thing is not going to happen, though.  The "touch"
fix seems like a far more appropriate level of effort, plus it actually
fixes the problem rather than applying a band-aid.  (I have checked
that "touch" applies a sub-second timestamp on APFS, btw.)

> 3. This bug seems to be tickled because two different Makefiles use
>    attempt to build the same target at the same time, using a tool
>    (macOS libtool/ranlib, albeit through a recursive $(MAKE) invocation)
>    that doesn't support subsecond timestamps, on a filesystem that does
>    (APFS). That's a bit of a corner case.

Yeah, this.  Under typical circumstances, the worst that would happen
is an extra rebuild of the .a file.  We're unlucky because two such
rebuilds could get launched in parallel, something that I bet is not
that common.

            regards, tom lane


I wrote:
> After sleeping on this, I'm liking the idea of adding "touch" to our
> rule better.

I've pushed a patch along that line:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=patch;h=826eff57c4c23f77314ba7151d3dc506ce0fa24c

Possibly the nix PG package would want to absorb that before our next
releases (which aren't scheduled until February).

            regards, tom lane


Re: BUG #15525: Build failures when compiling Postgres with Make parallelization

From
Thomas Munro
Date:
On Fri, Nov 30, 2018 at 5:03 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jack Kelly <jack@jackkelly.name> writes:
> > Thomas Munro <thomas.munro@enterprisedb.com> writes:
> >> As for what could be done about it, it seems like we (or the Nix
> >> project, in a local patch) could declare individual targets to have
> >> .LOW_RESOLUTION_TIME:
> >> https://www.gnu.org/software/make/manual/html_node/Special-Targets.html
> >> That doesn't seem any better than using "touch" to make a better mtime
> >> though.
>
> In fact it's worse, because it opens you up to the same problems that
> sub-second timestamps were meant to fix.
>
> After sleeping on this, I'm liking the idea of adding "touch" to our
> rule better.  We shouldn't imagine that this problem exists in a vacuum:
> Apple got that ranlib code from some BSD or other, so it probably
> exists in similar form elsewhere.  And filesystems with sub-second
> timestamps are getting more common.  So it seems likely that this issue
> could manifest on other combinations than the one we see here.

+1 for that solution (which I see you've just pushed).

But just for the record, while we're doing amateur software
archeology:  I'm pretty sure Apple's libtool/ranlib is not derived
from BSD... it says it's from NeXT and has no University of California
copyright.  They probably needed something different to work with
Mach-O objects, whereas ancient BSD used a.out and modern BSDen use
ELF.  It also supports their funky fat/universal libraries which NeXT
and Apple used to change CPU architectures several times surprisingly
smoothly.  I don't see anything like that utime() in either modern
FreeBSD (where it's been rewritten at least once) or ancient 4.4BSD
lite sources.

-- 
Thomas Munro
http://www.enterprisedb.com


Thomas Munro <thomas.munro@enterprisedb.com> writes:
> But just for the record, while we're doing amateur software
> archeology:  I'm pretty sure Apple's libtool/ranlib is not derived
> from BSD... it says it's from NeXT and has no University of California
> copyright.  They probably needed something different to work with
> Mach-O objects, whereas ancient BSD used a.out and modern BSDen use
> ELF.  It also supports their funky fat/universal libraries which NeXT
> and Apple used to change CPU architectures several times surprisingly
> smoothly.  I don't see anything like that utime() in either modern
> FreeBSD (where it's been rewritten at least once) or ancient 4.4BSD
> lite sources.

Interesting.  There's definitely some funky behavior in Apple's ranlib.
While testing this, I noted that sometimes it will produce a timestamp
that seems to be max-of-the-input-timestamps (truncated to seconds),
which can be much older than current time.  Other times it will produce
current time (truncated to seconds).  No idea what's causing this
difference in behavior.

            regards, tom lane


Re: BUG #15525: Build failures when compiling Postgres with Make parallelization

From
Andrew Gierth
Date:
>>>>> "Thomas" == Thomas Munro <thomas.munro@enterprisedb.com> writes:

 Thomas> But just for the record, while we're doing amateur software
 Thomas> archeology: I'm pretty sure Apple's libtool/ranlib is not
 Thomas> derived from BSD... it says it's from NeXT and has no
 Thomas> University of California copyright. They probably needed
 Thomas> something different to work with Mach-O objects, whereas
 Thomas> ancient BSD used a.out and modern BSDen use ELF. It also
 Thomas> supports their funky fat/universal libraries which NeXT and
 Thomas> Apple used to change CPU architectures several times
 Thomas> surprisingly smoothly. I don't see anything like that utime()
 Thomas> in either modern FreeBSD (where it's been rewritten at least
 Thomas> once) or ancient 4.4BSD lite sources.

I also noticed that an Apple manpage mentions that the linker at one
time compared the mod-time of the .a file with the embedded timestamp of
its archive symbol table member, which is probably why the utime() call
existed in the first place. I don't recall that behavior in other
linkers, offhand.

-- 
Andrew (irc:RhodiumToad)