Thread: "pg_ctl: the PID file ... is empty" at end of make check
Hello, Today I saw a one-off case of $SUBJECT, on macOS. I can't reproduce it, but I noticed exactly the same thing on longfin the other day: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-25%2005%3A39%3A04 Anyone know what that's about? -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > Today I saw a one-off case of $SUBJECT, on macOS. I can't reproduce > it, but I noticed exactly the same thing on longfin the other day: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-25%2005%3A39%3A04 > Anyone know what that's about? No :-(. I was wondering about that longfin failure too. What macOS version are you running? (longfin is up to date, 10.14.1. Is it worth noting that I jacked up the name and drove new hardware underneath it just last week? Hard to see how that could be related, but ...) regards, tom lane
On Wed, Nov 28, 2018 at 4:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: > > Today I saw a one-off case of $SUBJECT, on macOS. I can't reproduce > > it, but I noticed exactly the same thing on longfin the other day: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-25%2005%3A39%3A04 > > Anyone know what that's about? > > No :-(. I was wondering about that longfin failure too. What macOS > version are you running? > > (longfin is up to date, 10.14.1. Is it worth noting that I jacked > up the name and drove new hardware underneath it just last week? > Hard to see how that could be related, but ...) I have not yet accepted the offer to update to 10.14.1, so I guess it must be 10.14.0, though the about box just says 10.14. According to sw_vers, it's 10.14 build 18A391. -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > Today I saw a one-off case of $SUBJECT, on macOS. I can't reproduce > it, but I noticed exactly the same thing on longfin the other day: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-25%2005%3A39%3A04 I trawled the buildfarm logs and discovered a second instance of exactly the same thing: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-19%2018%3A37%3A00 There have not been any other occurrences in the past 3 months, which is as far back as I went. (lorikeet has half a dozen occurrences of "could not stop postmaster", which is what I was grepping for, but they all are associated with that machine's intermittent postmaster crashes.) So that lets out the flaky-hardware theory: that occurrence is before longfin's hardware transplant. Also, I don't think I believe the OS-bug idea either, given that you saw it on 10.14.0. longfin's been running 10.14.something since 2018-09-26, and has accumulated circa 200 runs since then just on HEAD, never mind the back branches. It'd be pretty unlikely to see it only in the past week, and only on HEAD, if it were an OS bug introduced two months ago. So my theory is we broke something in HEAD a couple weeks ago. But what? The fsync changes you made are suspiciously close to this issue (ie one could explain it as written data not getting out), and were committed in the right time frame, but that change didn't affect writes to postmaster.pid did it? regards, tom lane
On Wed, Nov 28, 2018 at 5:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: > > Today I saw a one-off case of $SUBJECT, on macOS. I can't reproduce > > it, but I noticed exactly the same thing on longfin the other day: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-25%2005%3A39%3A04 > > I trawled the buildfarm logs and discovered a second instance of exactly > the same thing: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-19%2018%3A37%3A00 > > There have not been any other occurrences in the past 3 months, which is > as far back as I went. (lorikeet has half a dozen occurrences of "could > not stop postmaster", which is what I was grepping for, but they all > are associated with that machine's intermittent postmaster crashes.) > > So that lets out the flaky-hardware theory: that occurrence is before > longfin's hardware transplant. > > Also, I don't think I believe the OS-bug idea either, given that you > saw it on 10.14.0. longfin's been running 10.14.something since > 2018-09-26, and has accumulated circa 200 runs since then just on HEAD, > never mind the back branches. It'd be pretty unlikely to see it only > in the past week, and only on HEAD, if it were an OS bug introduced two > months ago. Yeah, it'd be slightly easier to believe when High Sierra first came out and every hfs+ volume was silently migrated to the brand new apfs. But yeah, that idea seems like a long shot at this point. > So my theory is we broke something in HEAD a couple weeks ago. But what? Hmm. Not seeing it. I'm trying to do it again, with a make check loop. > The fsync changes you made are suspiciously close to this issue (ie one > could explain it as written data not getting out), and were committed in > the right time frame, but that change didn't affect writes to > postmaster.pid did it? Commit 9ccdd7f6 doesn't affect writes to anything. It just changes the elevel if certain fsync calls fail (and incidentally none near this code, and in any case there was no failure). -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > On Wed, Nov 28, 2018 at 5:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So my theory is we broke something in HEAD a couple weeks ago. But what? > Hmm. Not seeing it. I'm trying to do it again, with a make check loop. >> The fsync changes you made are suspiciously close to this issue (ie one >> could explain it as written data not getting out), and were committed in >> the right time frame, but that change didn't affect writes to >> postmaster.pid did it? > Commit 9ccdd7f6 doesn't affect writes to anything. It just changes > the elevel if certain fsync calls fail (and incidentally none near > this code, and in any case there was no failure). Yeah ... the other weird thing about this is that the postmaster log shows normal shutdown. Is it possible that unlink() on APFS is not atomic? That is, the sequence of events is something like pg_ctl: open("postmaster.pid") postmaster: unlink("postmaster.pid") pg_ctl: reads file, gets zero bytes But that idea is in the OS-bug class, which we agree seems unlikely. regards, tom lane
On Wed, Nov 28, 2018 at 6:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Is it possible that unlink() on APFS is not atomic? That is, the > sequence of events is something like > > pg_ctl: open("postmaster.pid") > postmaster: unlink("postmaster.pid") > pg_ctl: reads file, gets zero bytes > > But that idea is in the OS-bug class, which we agree seems unlikely. I think you might be right. https://github.com/macdice/unlinktest -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > On Wed, Nov 28, 2018 at 6:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Is it possible that unlink() on APFS is not atomic? > I think you might be right. > https://github.com/macdice/unlinktest Bleah. But you can do better than ask whether it's a bug: you can quote POSIX: The unlink() function shall remove a link to a file. If path names a symbolic link, unlink() shall remove the symbolic link named by path and shall not affect any file or directory named by the contents of the symbolic link. Otherwise, unlink() shall remove the link named by the pathname pointed to by path and shall decrement the link count of the file referenced by the link. When the file's link count becomes 0 and no process has the file open, the space occupied by the file shall be freed and the file shall no longer be accessible. If one or more processes have the file open when the last link is removed, the link shall be removed before unlink() returns, but the removal of the file contents shall be postponed until all references to the file are closed. Not a lot of wiggle room there. regards, tom lane
On Thu, Nov 29, 2018 at 3:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: > > On Wed, Nov 28, 2018 at 6:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Is it possible that unlink() on APFS is not atomic? > > > I think you might be right. > > https://github.com/macdice/unlinktest > > Bleah. But you can do better than ask whether it's a bug: you can > quote POSIX: > > The unlink() function shall remove a link to a file. If path names a > symbolic link, unlink() shall remove the symbolic link named by path > and shall not affect any file or directory named by the contents of > the symbolic link. Otherwise, unlink() shall remove the link named by > the pathname pointed to by path and shall decrement the link count of > the file referenced by the link. > > When the file's link count becomes 0 and no process has the file open, > the space occupied by the file shall be freed and the file shall no > longer be accessible. If one or more processes have the file open when > the last link is removed, the link shall be removed before unlink() > returns, but the removal of the file contents shall be postponed until > all references to the file are closed. > > Not a lot of wiggle room there. Agreed. Secret non-shareable bug report filed. Fingers crossed. -- Thomas Munro http://www.enterprisedb.com
[ blast from the past dept. ] Thomas Munro <thomas.munro@enterprisedb.com> writes: > On Thu, Nov 29, 2018 at 3:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Thomas Munro <thomas.munro@enterprisedb.com> writes: >>> https://github.com/macdice/unlinktest >> Bleah. But you can do better than ask whether it's a bug: you can >> quote POSIX: >> ... >> Not a lot of wiggle room there. > Agreed. Secret non-shareable bug report filed. Fingers crossed. Since that conversation, longfin has shown the same symptom just once more: longfin | REL_11_STABLE | 2019-07-28 22:29:03 | recoveryCheck | waiting for ser ver to shut down......pg_ctl: the PID file "/Users/buildfarm/bf-data/REL_11_STAB LE/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdat a/postmaster.pid" is empty and now prairiedog has shown it too: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-14%2021%3A45%3A47 which is positively fascinating, because prairiedog is running a bronze-age version of macOS that surely never heard of APFS. So this makes it look like this is a basic macOS bug that's not as filesystem-dependent as one might think. Trawling the buildfarm database finds no other matches in the last year, so whatever it is, it's pretty darn improbable. Did you ever get any reaction to that bug report? regards, tom lane
On Tue, Oct 15, 2019 at 1:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: > > Agreed. Secret non-shareable bug report filed. Fingers crossed. > > Since that conversation, longfin has shown the same symptom > just once more: > > longfin | REL_11_STABLE | 2019-07-28 22:29:03 | recoveryCheck | waiting for ser > ver to shut down......pg_ctl: the PID file "/Users/buildfarm/bf-data/REL_11_STAB > LE/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdat > a/postmaster.pid" is empty > > and now prairiedog has shown it too: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-14%2021%3A45%3A47 > > which is positively fascinating, because prairiedog is running a > bronze-age version of macOS that surely never heard of APFS. > So this makes it look like this is a basic macOS bug that's not > as filesystem-dependent as one might think. Does https://github.com/macdice/unlinktest show the problem on that system? > Trawling the buildfarm database finds no other matches in the last year, > so whatever it is, it's pretty darn improbable. > > Did you ever get any reaction to that bug report? No acknowledgement. I did learn from an anonymous source that "it’s prioritized and in the queue".
Thomas Munro <thomas.munro@gmail.com> writes: > On Tue, Oct 15, 2019 at 1:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> and now prairiedog has shown it too: >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-14%2021%3A45%3A47 >> which is positively fascinating, because prairiedog is running a >> bronze-age version of macOS that surely never heard of APFS. >> So this makes it look like this is a basic macOS bug that's not >> as filesystem-dependent as one might think. > Does https://github.com/macdice/unlinktest show the problem on that system? It does, though with a very low frequency: $ ./unlinktest $ ./unlinktest 10000 read 0 bytes, unexpected $ ./unlinktest 10000 read 0 bytes, unexpected read 0 bytes, unexpected $ ./unlinktest 10000 $ The failure rate on my recent-vintage laptop is more like one failure every five loops. regards, tom lane
On Fri, Oct 18, 2019 at 1:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > On Tue, Oct 15, 2019 at 1:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> and now prairiedog has shown it too: > >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-14%2021%3A45%3A47 > >> which is positively fascinating, because prairiedog is running a > >> bronze-age version of macOS that surely never heard of APFS. > >> So this makes it look like this is a basic macOS bug that's not > >> as filesystem-dependent as one might think. > > > Does https://github.com/macdice/unlinktest show the problem on that system? > > It does, though with a very low frequency: > > $ ./unlinktest > $ ./unlinktest 10000 > read 0 bytes, unexpected > $ ./unlinktest 10000 > read 0 bytes, unexpected > read 0 bytes, unexpected > $ ./unlinktest 10000 > $ > > The failure rate on my recent-vintage laptop is more like one > failure every five loops. Wow. Ok, I'll add a note to the bug report to say it's reproducible on "Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007; root:xnu-792.24.17~1/RELEASE_PPC" next time I'm near an Apple device that will let me log into the bug thing. On the off-chance that someone from Apple stumbles on this and is interested, the Radar number is rdar://46318934 and the title is "unlink(2) is not atomic (kernel/file system bug)".
On Fri, Oct 18, 2019 at 2:26 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Oct 18, 2019 at 1:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@gmail.com> writes: > > > On Tue, Oct 15, 2019 at 1:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> and now prairiedog has shown it too: > > >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-14%2021%3A45%3A47 > > >> which is positively fascinating, because prairiedog is running a > > >> bronze-age version of macOS that surely never heard of APFS. > > >> So this makes it look like this is a basic macOS bug that's not > > >> as filesystem-dependent as one might think. > > > > > Does https://github.com/macdice/unlinktest show the problem on that system? > > > > It does, though with a very low frequency: > > > > $ ./unlinktest > > $ ./unlinktest 10000 > > read 0 bytes, unexpected > > $ ./unlinktest 10000 > > read 0 bytes, unexpected > > read 0 bytes, unexpected > > $ ./unlinktest 10000 > > $ > > > > The failure rate on my recent-vintage laptop is more like one > > failure every five loops. > > Wow. Ok, I'll add a note to the bug report to say it's reproducible > on "Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007; > root:xnu-792.24.17~1/RELEASE_PPC" next time I'm near an Apple device > that will let me log into the bug thing. On the off-chance that > someone from Apple stumbles on this and is interested, the Radar > number is rdar://46318934 and the title is "unlink(2) is not atomic > (kernel/file system bug)". My bug report got an automated-looking message telling me to retest in Big Sur beta 6 back in September, and I've just now upgraded an old x86 Mac to Big Sur 11.01 and I can no longer reproduce the problem, so it looks like it was fixed! Thanks, Apple.
Thomas Munro <thomas.munro@gmail.com> writes: > My bug report got an automated-looking message telling me to retest in > Big Sur beta 6 back in September, and I've just now upgraded an old > x86 Mac to Big Sur 11.01 and I can no longer reproduce the problem, so > it looks like it was fixed! Thanks, Apple. Hah! I've occasionally despaired that Apple actually reads anything I file there, so this is glad tidings indeed. regards, tom lane