Thread: "pg_ctl: the PID file ... is empty" at end of make check

"pg_ctl: the PID file ... is empty" at end of make check

From

Thomas Munro

Date:

28 November 2018, 01:57:18

Hello,

Today I saw a one-off case of $SUBJECT, on macOS.  I can't reproduce
it, but I noticed exactly the same thing on longfin the other day:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-25%2005%3A39%3A04

Anyone know what that's about?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: "pg_ctl: the PID file ... is empty" at end of make check

From

Tom Lane

Date:

28 November 2018, 03:10:26

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> Today I saw a one-off case of $SUBJECT, on macOS.  I can't reproduce
> it, but I noticed exactly the same thing on longfin the other day:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-25%2005%3A39%3A04
> Anyone know what that's about?

No :-(.  I was wondering about that longfin failure too.  What macOS
version are you running?

(longfin is up to date, 10.14.1.  Is it worth noting that I jacked
up the name and drove new hardware underneath it just last week?
Hard to see how that could be related, but ...)

            regards, tom lane

Re: "pg_ctl: the PID file ... is empty" at end of make check

From

Thomas Munro

Date:

28 November 2018, 03:18:35

On Wed, Nov 28, 2018 at 4:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
> > Today I saw a one-off case of $SUBJECT, on macOS.  I can't reproduce
> > it, but I noticed exactly the same thing on longfin the other day:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-25%2005%3A39%3A04
> > Anyone know what that's about?
>
> No :-(.  I was wondering about that longfin failure too.  What macOS
> version are you running?
>
> (longfin is up to date, 10.14.1.  Is it worth noting that I jacked
> up the name and drove new hardware underneath it just last week?
> Hard to see how that could be related, but ...)

I have not yet accepted the offer to update to 10.14.1, so I guess it
must be 10.14.0, though the about box just says 10.14.  According to
sw_vers, it's 10.14 build 18A391.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: "pg_ctl: the PID file ... is empty" at end of make check

From

Tom Lane

Date:

28 November 2018, 04:28:30

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> Today I saw a one-off case of $SUBJECT, on macOS.  I can't reproduce
> it, but I noticed exactly the same thing on longfin the other day:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-25%2005%3A39%3A04

I trawled the buildfarm logs and discovered a second instance of exactly
the same thing:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-19%2018%3A37%3A00

There have not been any other occurrences in the past 3 months, which is
as far back as I went.  (lorikeet has half a dozen occurrences of "could
not stop postmaster", which is what I was grepping for, but they all
are associated with that machine's intermittent postmaster crashes.)

So that lets out the flaky-hardware theory: that occurrence is before
longfin's hardware transplant.

Also, I don't think I believe the OS-bug idea either, given that you
saw it on 10.14.0.  longfin's been running 10.14.something since
2018-09-26, and has accumulated circa 200 runs since then just on HEAD,
never mind the back branches.  It'd be pretty unlikely to see it only
in the past week, and only on HEAD, if it were an OS bug introduced two
months ago.

So my theory is we broke something in HEAD a couple weeks ago.  But what?

The fsync changes you made are suspiciously close to this issue (ie one
could explain it as written data not getting out), and were committed in
the right time frame, but that change didn't affect writes to
postmaster.pid did it?

            regards, tom lane

Re: "pg_ctl: the PID file ... is empty" at end of make check

From

Thomas Munro

Date:

28 November 2018, 05:31:10

On Wed, Nov 28, 2018 at 5:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
> > Today I saw a one-off case of $SUBJECT, on macOS.  I can't reproduce
> > it, but I noticed exactly the same thing on longfin the other day:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-25%2005%3A39%3A04
>
> I trawled the buildfarm logs and discovered a second instance of exactly
> the same thing:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2018-11-19%2018%3A37%3A00
>
> There have not been any other occurrences in the past 3 months, which is
> as far back as I went.  (lorikeet has half a dozen occurrences of "could
> not stop postmaster", which is what I was grepping for, but they all
> are associated with that machine's intermittent postmaster crashes.)
>
> So that lets out the flaky-hardware theory: that occurrence is before
> longfin's hardware transplant.
>
> Also, I don't think I believe the OS-bug idea either, given that you
> saw it on 10.14.0.  longfin's been running 10.14.something since
> 2018-09-26, and has accumulated circa 200 runs since then just on HEAD,
> never mind the back branches.  It'd be pretty unlikely to see it only
> in the past week, and only on HEAD, if it were an OS bug introduced two
> months ago.

Yeah, it'd be slightly easier to believe when High Sierra first came
out and every hfs+ volume was silently migrated to the brand new apfs.
But yeah, that idea seems like a long shot at this point.

> So my theory is we broke something in HEAD a couple weeks ago.  But what?

Hmm.  Not seeing it.  I'm trying to do it again, with a make check loop.

> The fsync changes you made are suspiciously close to this issue (ie one
> could explain it as written data not getting out), and were committed in
> the right time frame, but that change didn't affect writes to
> postmaster.pid did it?

Commit 9ccdd7f6 doesn't affect writes to anything.  It just changes
the elevel if certain fsync calls fail (and incidentally none near
this code, and in any case there was no failure).

-- 
Thomas Munro
http://www.enterprisedb.com

Re: "pg_ctl: the PID file ... is empty" at end of make check

From

Tom Lane

Date:

28 November 2018, 05:47:14

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Wed, Nov 28, 2018 at 5:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So my theory is we broke something in HEAD a couple weeks ago.  But what?

> Hmm.  Not seeing it.  I'm trying to do it again, with a make check loop.

>> The fsync changes you made are suspiciously close to this issue (ie one
>> could explain it as written data not getting out), and were committed in
>> the right time frame, but that change didn't affect writes to
>> postmaster.pid did it?

> Commit 9ccdd7f6 doesn't affect writes to anything.  It just changes
> the elevel if certain fsync calls fail (and incidentally none near
> this code, and in any case there was no failure).

Yeah ... the other weird thing about this is that the postmaster log
shows normal shutdown.

Is it possible that unlink() on APFS is not atomic?  That is, the
sequence of events is something like

pg_ctl: open("postmaster.pid")
                                     postmaster: unlink("postmaster.pid")
pg_ctl: reads file, gets zero bytes

But that idea is in the OS-bug class, which we agree seems unlikely.

            regards, tom lane

Re: "pg_ctl: the PID file ... is empty" at end of make check

From

Thomas Munro

Date:

28 November 2018, 09:34:41

On Wed, Nov 28, 2018 at 6:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Is it possible that unlink() on APFS is not atomic?  That is, the
> sequence of events is something like
>
> pg_ctl: open("postmaster.pid")
>                                      postmaster: unlink("postmaster.pid")
> pg_ctl: reads file, gets zero bytes
>
> But that idea is in the OS-bug class, which we agree seems unlikely.

I think you might be right.

https://github.com/macdice/unlinktest

-- 
Thomas Munro
http://www.enterprisedb.com

Re: "pg_ctl: the PID file ... is empty" at end of make check

From

Tom Lane

Date:

28 November 2018, 14:30:21

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Wed, Nov 28, 2018 at 6:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Is it possible that unlink() on APFS is not atomic?

> I think you might be right.
> https://github.com/macdice/unlinktest

Bleah.  But you can do better than ask whether it's a bug: you can
quote POSIX:

    The unlink() function shall remove a link to a file. If path names a
    symbolic link, unlink() shall remove the symbolic link named by path
    and shall not affect any file or directory named by the contents of
    the symbolic link. Otherwise, unlink() shall remove the link named by
    the pathname pointed to by path and shall decrement the link count of
    the file referenced by the link.

    When the file's link count becomes 0 and no process has the file open,
    the space occupied by the file shall be freed and the file shall no
    longer be accessible. If one or more processes have the file open when
    the last link is removed, the link shall be removed before unlink()
    returns, but the removal of the file contents shall be postponed until
    all references to the file are closed.

Not a lot of wiggle room there.

            regards, tom lane

Re: "pg_ctl: the PID file ... is empty" at end of make check

From

Thomas Munro

Date:

28 November 2018, 21:19:30

On Thu, Nov 29, 2018 at 3:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
> > On Wed, Nov 28, 2018 at 6:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Is it possible that unlink() on APFS is not atomic?
>
> > I think you might be right.
> > https://github.com/macdice/unlinktest
>
> Bleah.  But you can do better than ask whether it's a bug: you can
> quote POSIX:
>
>     The unlink() function shall remove a link to a file. If path names a
>     symbolic link, unlink() shall remove the symbolic link named by path
>     and shall not affect any file or directory named by the contents of
>     the symbolic link. Otherwise, unlink() shall remove the link named by
>     the pathname pointed to by path and shall decrement the link count of
>     the file referenced by the link.
>
>     When the file's link count becomes 0 and no process has the file open,
>     the space occupied by the file shall be freed and the file shall no
>     longer be accessible. If one or more processes have the file open when
>     the last link is removed, the link shall be removed before unlink()
>     returns, but the removal of the file contents shall be postponed until
>     all references to the file are closed.
>
> Not a lot of wiggle room there.

Agreed.  Secret non-shareable bug report filed.  Fingers crossed.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: "pg_ctl: the PID file ... is empty" at end of make check

From

Tom Lane

Date:

15 October 2019, 00:54:55

[ blast from the past dept. ]

Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Thu, Nov 29, 2018 at 3:30 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>>> https://github.com/macdice/unlinktest

>> Bleah.  But you can do better than ask whether it's a bug: you can
>> quote POSIX:
>> ...
>> Not a lot of wiggle room there.

> Agreed.  Secret non-shareable bug report filed.  Fingers crossed.

Since that conversation, longfin has shown the same symptom
just once more:

 longfin | REL_11_STABLE | 2019-07-28 22:29:03 | recoveryCheck | waiting for ser
ver to shut down......pg_ctl: the PID file "/Users/buildfarm/bf-data/REL_11_STAB
LE/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdat
a/postmaster.pid" is empty

and now prairiedog has shown it too:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-14%2021%3A45%3A47

which is positively fascinating, because prairiedog is running a
bronze-age version of macOS that surely never heard of APFS.
So this makes it look like this is a basic macOS bug that's not
as filesystem-dependent as one might think.

Trawling the buildfarm database finds no other matches in the last year,
so whatever it is, it's pretty darn improbable.

Did you ever get any reaction to that bug report?

            regards, tom lane

Re: "pg_ctl: the PID file ... is empty" at end of make check

From

Thomas Munro

Date:

15 October 2019, 01:39:16

On Tue, Oct 15, 2019 at 1:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
> > Agreed.  Secret non-shareable bug report filed.  Fingers crossed.
>
> Since that conversation, longfin has shown the same symptom
> just once more:
>
>  longfin | REL_11_STABLE | 2019-07-28 22:29:03 | recoveryCheck | waiting for ser
> ver to shut down......pg_ctl: the PID file "/Users/buildfarm/bf-data/REL_11_STAB
> LE/pgsql.build/src/test/recovery/tmp_check/t_001_stream_rep_standby_2_data/pgdat
> a/postmaster.pid" is empty
>
> and now prairiedog has shown it too:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-14%2021%3A45%3A47
>
> which is positively fascinating, because prairiedog is running a
> bronze-age version of macOS that surely never heard of APFS.
> So this makes it look like this is a basic macOS bug that's not
> as filesystem-dependent as one might think.

Does https://github.com/macdice/unlinktest show the problem on that system?

> Trawling the buildfarm database finds no other matches in the last year,
> so whatever it is, it's pretty darn improbable.
>
> Did you ever get any reaction to that bug report?

No acknowledgement.   I did learn from an anonymous source that "it’s
prioritized and in the queue".

Re: "pg_ctl: the PID file ... is empty" at end of make check

From

Tom Lane

Date:

17 October 2019, 12:20:58

Thomas Munro <thomas.munro@gmail.com> writes:
> On Tue, Oct 15, 2019 at 1:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> and now prairiedog has shown it too:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-14%2021%3A45%3A47
>> which is positively fascinating, because prairiedog is running a
>> bronze-age version of macOS that surely never heard of APFS.
>> So this makes it look like this is a basic macOS bug that's not
>> as filesystem-dependent as one might think.

> Does https://github.com/macdice/unlinktest show the problem on that system?

It does, though with a very low frequency:

$ ./unlinktest
$ ./unlinktest 10000
read 0 bytes, unexpected
$ ./unlinktest 10000
read 0 bytes, unexpected
read 0 bytes, unexpected
$ ./unlinktest 10000
$

The failure rate on my recent-vintage laptop is more like one
failure every five loops.

            regards, tom lane

Re: "pg_ctl: the PID file ... is empty" at end of make check

From

Thomas Munro

Date:

18 October 2019, 01:26:37

On Fri, Oct 18, 2019 at 1:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Tue, Oct 15, 2019 at 1:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> and now prairiedog has shown it too:
> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-14%2021%3A45%3A47
> >> which is positively fascinating, because prairiedog is running a
> >> bronze-age version of macOS that surely never heard of APFS.
> >> So this makes it look like this is a basic macOS bug that's not
> >> as filesystem-dependent as one might think.
>
> > Does https://github.com/macdice/unlinktest show the problem on that system?
>
> It does, though with a very low frequency:
>
> $ ./unlinktest
> $ ./unlinktest 10000
> read 0 bytes, unexpected
> $ ./unlinktest 10000
> read 0 bytes, unexpected
> read 0 bytes, unexpected
> $ ./unlinktest 10000
> $
>
> The failure rate on my recent-vintage laptop is more like one
> failure every five loops.

Wow.  Ok, I'll add a note to the bug report to say it's reproducible
on "Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007;
root:xnu-792.24.17~1/RELEASE_PPC" next time I'm near an Apple device
that will let me log into the bug thing.  On the off-chance that
someone from Apple stumbles on this and is interested, the Radar
number is rdar://46318934 and the title is "unlink(2) is not atomic
(kernel/file system bug)".

Re: "pg_ctl: the PID file ... is empty" at end of make check

From

Thomas Munro

Date:

14 November 2020, 03:14:14

On Fri, Oct 18, 2019 at 2:26 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Oct 18, 2019 at 1:21 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Thomas Munro <thomas.munro@gmail.com> writes:
> > > On Tue, Oct 15, 2019 at 1:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> and now prairiedog has shown it too:
> > >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-14%2021%3A45%3A47
> > >> which is positively fascinating, because prairiedog is running a
> > >> bronze-age version of macOS that surely never heard of APFS.
> > >> So this makes it look like this is a basic macOS bug that's not
> > >> as filesystem-dependent as one might think.
> >
> > > Does https://github.com/macdice/unlinktest show the problem on that system?
> >
> > It does, though with a very low frequency:
> >
> > $ ./unlinktest
> > $ ./unlinktest 10000
> > read 0 bytes, unexpected
> > $ ./unlinktest 10000
> > read 0 bytes, unexpected
> > read 0 bytes, unexpected
> > $ ./unlinktest 10000
> > $
> >
> > The failure rate on my recent-vintage laptop is more like one
> > failure every five loops.
>
> Wow.  Ok, I'll add a note to the bug report to say it's reproducible
> on "Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007;
> root:xnu-792.24.17~1/RELEASE_PPC" next time I'm near an Apple device
> that will let me log into the bug thing.  On the off-chance that
> someone from Apple stumbles on this and is interested, the Radar
> number is rdar://46318934 and the title is "unlink(2) is not atomic
> (kernel/file system bug)".

My bug report got an automated-looking message telling me to retest in
Big Sur beta 6 back in September, and I've just now upgraded an old
x86 Mac to Big Sur 11.01 and I can no longer reproduce the problem, so
it looks like it was fixed!  Thanks, Apple.

Re: "pg_ctl: the PID file ... is empty" at end of make check

From

Tom Lane

Date:

14 November 2020, 03:23:51

Thomas Munro <thomas.munro@gmail.com> writes:
> My bug report got an automated-looking message telling me to retest in
> Big Sur beta 6 back in September, and I've just now upgraded an old
> x86 Mac to Big Sur 11.01 and I can no longer reproduce the problem, so
> it looks like it was fixed!  Thanks, Apple.

Hah!  I've occasionally despaired that Apple actually reads anything
I file there, so this is glad tidings indeed.

            regards, tom lane