Thread: What is happening on buildfarm member crake?

What is happening on buildfarm member crake?

From

Tom Lane

Date:

19 January 2014, 20:55:35

Since contrib/test_shm_mq went into the tree, crake has been crashing
intermittently (maybe one time in three) at the contribcheck step.
While there's not conclusive proof that test_shm_mq is at fault, the
circumstantial evidence is pretty strong.  For instance, in the
postmaster log for the last failure,
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2014-01-19%2013%3A26%3A08
we see

LOG:  registering background worker "test_shm_mq"
LOG:  starting background worker process "test_shm_mq"
LOG:  worker process: test_shm_mq (PID 12585) exited with exit code 1
LOG:  unregistering background worker "test_shm_mq"
LOG:  registering background worker "test_shm_mq"
LOG:  registering background worker "test_shm_mq"
LOG:  registering background worker "test_shm_mq"
LOG:  starting background worker process "test_shm_mq"
LOG:  starting background worker process "test_shm_mq"
LOG:  starting background worker process "test_shm_mq"
LOG:  worker process: test_shm_mq (PID 12588) exited with exit code 1
LOG:  unregistering background worker "test_shm_mq"
LOG:  worker process: test_shm_mq (PID 12586) exited with exit code 1
LOG:  unregistering background worker "test_shm_mq"
LOG:  worker process: test_shm_mq (PID 12587) exited with exit code 1
LOG:  unregistering background worker "test_shm_mq"
LOG:  server process (PID 12584) was terminated by signal 11: Segmentation fault
LOG:  terminating any other active server processes

Comparing the PIDs makes it seem pretty likely that what's crashing is the
backend running the test_shm_mq test.  Since the connected psql doesn't
notice any problem, most likely the crash occurs during backend shutdown
after psql has disconnected (which would also explain why no "current
query" gets shown in the crash report).

crake is running F16 x86_64, which I don't have installed here.  So I
tried to reproduce this on RHEL6 and F19 x86_64 machines, with build
options matching what crake says it's using, with absolutely no success.
I don't think any other buildfarm critters are showing this either, which
makes it look like it's possibly specific to the particular compiler
version crake has got.  Or maybe there's some other factor needed to
trigger it.  Anyway, I wonder whether Andrew could get a stack trace from
the core dump next time it happens.
        regards, tom lane

Re: What is happening on buildfarm member crake?

From

Tom Lane

Date:

19 January 2014, 21:14:43

I wrote:
> I don't think any other buildfarm critters are showing this either,

Scratch that --- closer inspection of the buildfarm logs shows that
kouprey and lapwing have suffered identical failures, though much less
often than crake.  That lets out the theory that it's x86_64 specific, or
even 64-bit specific.  Still have no idea why I can't reproduce it here.
        regards, tom lane

Re: What is happening on buildfarm member crake?

From

Andrew Dunstan

Date:

20 January 2014, 00:53:37

On 01/19/2014 04:14 PM, Tom Lane wrote:
> I wrote:
>> I don't think any other buildfarm critters are showing this either,
> Scratch that --- closer inspection of the buildfarm logs shows that
> kouprey and lapwing have suffered identical failures, though much less
> often than crake.  That lets out the theory that it's x86_64 specific, or
> even 64-bit specific.  Still have no idea why I can't reproduce it here.

I can give you an account on the machine if you want to play.

Also crake does produce backtraces  on core dumps, and they are at the 
bottom of the buildfarm log. The latest failure backtrace is reproduced 
below.
   ================== stack trace: /home/bf/bfr/root/HEAD/inst/data-C/core.12584 ==================   [New LWP 12584]
[Threaddebugging using libthread_db enabled]   Using host libthread_db library "/lib64/libthread_db.so.1".   Core was
generatedby `postgres: buildfarm contrib_regression_test_shm_mq'.   Program terminated with signal 11, Segmentation
fault.  #0  SetLatch (latch=0x1c) at pg_latch.c:509   509        if (latch->is_set)   #0  SetLatch (latch=0x1c) at
pg_latch.c:509  #1  0x000000000064c04e in procsignal_sigusr1_handler (postgres_signal_arg=<optimized out>) at
/home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/procsignal.c:289  #2  <signal handler called>   #3
_dl_fini() at dl-fini.c:190   #4  0x000000361ba39931 in __run_exit_handlers (status=0, listp=0x361bdb1668,
run_list_atexit=true)at exit.c:78   #5  0x000000361ba399b5 in __GI_exit (status=<optimized out>) at exit.c:100   #6
0x00000000006485a6in proc_exit (code=0) at
/home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/ipc.c:143  #7  0x0000000000663abb in PostgresMain
(argc=<optimizedout>, argv=<optimized out>, dbname=0x12b8170 "contrib_regression_test_shm_mq", username=<optimized
out>)at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/tcop/postgres.c:4225   #8  0x000000000062220f in
BackendRun(port=0x12d6bf0) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:4083   #9
BackendStartup (port=0x12d6bf0) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:3772
 #10 ServerLoop () at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1583   #11
PostmasterMain(argc=<optimized out>, argv=<optimized out>) at
/home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1238  #12 0x000000000045e2e8 in main
(argc=3,argv=0x12b7430) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/main/main.c:205

cheers

andrew

Re: What is happening on buildfarm member crake?

From

Robert Haas

Date:

20 January 2014, 01:22:49

On Sun, Jan 19, 2014 at 7:53 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
> Also crake does produce backtraces  on core dumps, and they are at the
> bottom of the buildfarm log. The latest failure backtrace is reproduced
> below.
>
>    ================== stack trace:
> /home/bf/bfr/root/HEAD/inst/data-C/core.12584 ==================
>    [New LWP 12584]
>    [Thread debugging using libthread_db enabled]
>    Using host libthread_db library "/lib64/libthread_db.so.1".
>    Core was generated by `postgres: buildfarm
> contrib_regression_test_shm_mq'.
>    Program terminated with signal 11, Segmentation fault.
>    #0  SetLatch (latch=0x1c) at pg_latch.c:509
>    509          if (latch->is_set)
>    #0  SetLatch (latch=0x1c) at pg_latch.c:509
>    #1  0x000000000064c04e in procsignal_sigusr1_handler
> (postgres_signal_arg=<optimized out>) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/procsignal.c:289
>    #2  <signal handler called>
>    #3  _dl_fini () at dl-fini.c:190
>    #4  0x000000361ba39931 in __run_exit_handlers (status=0,
> listp=0x361bdb1668, run_list_atexit=true) at exit.c:78
>    #5  0x000000361ba399b5 in __GI_exit (status=<optimized out>) at
> exit.c:100
>    #6  0x00000000006485a6 in proc_exit (code=0) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/ipc.c:143
>    #7  0x0000000000663abb in PostgresMain (argc=<optimized out>,
> argv=<optimized out>, dbname=0x12b8170 "contrib_regression_test_shm_mq",
> username=<optimized out>) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/tcop/postgres.c:4225
>    #8  0x000000000062220f in BackendRun (port=0x12d6bf0) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:4083
>    #9  BackendStartup (port=0x12d6bf0) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:3772
>    #10 ServerLoop () at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1583
>    #11 PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1238
>    #12 0x000000000045e2e8 in main (argc=3, argv=0x12b7430) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/main/main.c:205

Hmm, that looks an awful lot like the SIGUSR1 signal handler is
getting called after we've already completed shmem_exit.  And indeed
that seems like the sort of thing that would result in dying horribly
in just this way.  The obvious fix seems to be to check
proc_exit_inprogress before doing anything that might touch shared
memory, but there are a lot of other SIGUSR1 handlers that don't do
that either.  However, in those cases, the likely cause of a SIGUSR1
would be a sinval catchup interrupt or a recovery conflict, which
aren't likely to be so far delayed that they arrive after we've
already disconnected from shared memory.  But the dynamic background
workers stuff adds a new possible cause of SIGUSR1: the postmaster
letting us know that a child has started or died.  And that could
happen even after we've detached shared memory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: What is happening on buildfarm member crake?

From

Andrew Dunstan

Date:

25 January 2014, 21:50:50

On 01/19/2014 08:22 PM, Robert Haas wrote:
> Hmm, that looks an awful lot like the SIGUSR1 signal handler is
> getting called after we've already completed shmem_exit.  And indeed
> that seems like the sort of thing that would result in dying horribly
> in just this way.  The obvious fix seems to be to check
> proc_exit_inprogress before doing anything that might touch shared
> memory, but there are a lot of other SIGUSR1 handlers that don't do
> that either.  However, in those cases, the likely cause of a SIGUSR1
> would be a sinval catchup interrupt or a recovery conflict, which
> aren't likely to be so far delayed that they arrive after we've
> already disconnected from shared memory.  But the dynamic background
> workers stuff adds a new possible cause of SIGUSR1: the postmaster
> letting us know that a child has started or died.  And that could
> happen even after we've detached shared memory.
>

Is anything happening about this? We're still getting quite a few of 
these: 
<http://www.pgbuildfarm.org/cgi-bin/show_failures.pl?max_days=3&member=crake>

cheers

andrew

Re: What is happening on buildfarm member crake?

From

Tom Lane

Date:

25 January 2014, 22:04:49

Andrew Dunstan <andrew@dunslane.net> writes:
> On 01/19/2014 08:22 PM, Robert Haas wrote:
>> Hmm, that looks an awful lot like the SIGUSR1 signal handler is
>> getting called after we've already completed shmem_exit.  And indeed
>> that seems like the sort of thing that would result in dying horribly
>> in just this way.  The obvious fix seems to be to check
>> proc_exit_inprogress before doing anything that might touch shared
>> memory, but there are a lot of other SIGUSR1 handlers that don't do
>> that either.  However, in those cases, the likely cause of a SIGUSR1
>> would be a sinval catchup interrupt or a recovery conflict, which
>> aren't likely to be so far delayed that they arrive after we've
>> already disconnected from shared memory.  But the dynamic background
>> workers stuff adds a new possible cause of SIGUSR1: the postmaster
>> letting us know that a child has started or died.  And that could
>> happen even after we've detached shared memory.

> Is anything happening about this? We're still getting quite a few of 
> these: 
> <http://www.pgbuildfarm.org/cgi-bin/show_failures.pl?max_days=3&member=crake>

Yeah.  If Robert's diagnosis is correct, and it sounds pretty plausible,
then this is really just one instance of a bug that's probably pretty
widespread in our signal handlers.  Somebody needs to go through 'em
all and look for touches of shared memory.

I'm not sure if we can just disable signal response the moment the
proc_exit_inprogress flag goes up, though.  In some cases such as lock
handling, it's likely that we need that functionality to keep working
for some part of the shutdown process.  We might end up having to disable
individual signal handlers at appropriate places.

Ick.
        regards, tom lane

Re: What is happening on buildfarm member crake?

From

Andrew Dunstan

Date:

25 January 2014, 22:35:13

On 01/25/2014 05:04 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 01/19/2014 08:22 PM, Robert Haas wrote:
>>> Hmm, that looks an awful lot like the SIGUSR1 signal handler is
>>> getting called after we've already completed shmem_exit.  And indeed
>>> that seems like the sort of thing that would result in dying horribly
>>> in just this way.  The obvious fix seems to be to check
>>> proc_exit_inprogress before doing anything that might touch shared
>>> memory, but there are a lot of other SIGUSR1 handlers that don't do
>>> that either.  However, in those cases, the likely cause of a SIGUSR1
>>> would be a sinval catchup interrupt or a recovery conflict, which
>>> aren't likely to be so far delayed that they arrive after we've
>>> already disconnected from shared memory.  But the dynamic background
>>> workers stuff adds a new possible cause of SIGUSR1: the postmaster
>>> letting us know that a child has started or died.  And that could
>>> happen even after we've detached shared memory.
>> Is anything happening about this? We're still getting quite a few of
>> these:
>> <http://www.pgbuildfarm.org/cgi-bin/show_failures.pl?max_days=3&member=crake>
> Yeah.  If Robert's diagnosis is correct, and it sounds pretty plausible,
> then this is really just one instance of a bug that's probably pretty
> widespread in our signal handlers.  Somebody needs to go through 'em
> all and look for touches of shared memory.
>
> I'm not sure if we can just disable signal response the moment the
> proc_exit_inprogress flag goes up, though.  In some cases such as lock
> handling, it's likely that we need that functionality to keep working
> for some part of the shutdown process.  We might end up having to disable
> individual signal handlers at appropriate places.
>
> Ick.
>
>             

Yeah. Since we're now providing for user-defined backends, maybe we need 
some mechanism for white-listing handlers that can run in whole or part 
during shutdown.

cheers

andrew

Re: What is happening on buildfarm member crake?

From

Robert Haas

Date:

27 January 2014, 14:14:37

On Sat, Jan 25, 2014 at 5:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah.  If Robert's diagnosis is correct, and it sounds pretty plausible,
> then this is really just one instance of a bug that's probably pretty
> widespread in our signal handlers.  Somebody needs to go through 'em
> all and look for touches of shared memory.

I haven't made a comprehensive study of every signal handler we have,
but looking at procsignal_sigusr1_handler, the list of functions that
can get called from here is quite short: CheckProcSignal(),
RecoveryConflictInterrupt(), SetLatch(), and latch_sigusr1_handler().
Taking those in reverse order:

- latch_sigusr1_handler() is fine.  Nothing down this path touches
shared memory; moreover, if we've already disowned our latch, the
waiting flag won't be set and this will do nothing at all.
- The call to SetLatch() is problematic as we already know.  This is
new code in 9.4.
- RecoveryConflictInterrupt() does nothing if proc_exit_inprogress is
set.  So it's fine.
- CheckProcSignal() also appears problematic.  If we've already
detached shared memory, MyProcSignalSlot will be pointing to garbage,
but we'll try to dereference it anyway.

I think maybe the best fix is to *clear* MyProc in
ProcKill/AuxiliaryProcKill and MyProcSignalSlot in
CleanupProcSignalState, as shown in the attached patch.  Most places
that dereference those pointers already check that they aren't null,
and we can easily add a NULL guard to the SetLatch() call in
procsignal_sigusr1_handler, which the attached patch also does.

This might not be a complete fix to every problem of this type that
exists anywhere in our code, but I think it's enough to make the world
safe for procsignal_sigusr1_handler.  We also have a *large* number of
signal handlers that do little more than this:

    if (MyProc)
        SetLatch(&MyProc->procLatch);

...and this change should make all of those safe as well.  So I think
this is a pretty good start.

Assuming nobody objects too much to this basic approach, should I
back-patch the parts of this that apply pre-9.4?  The problem with
CleanupProcSignalState, at least, goes all the way back to 9.0, when
the signal-multiplexing infrastructure was introduced.  But the
probability of an actual crash must be pretty low, or I imagine we
would have noticed this sooner.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

deinit.patch

Re: What is happening on buildfarm member crake?

From

Tom Lane

Date:

31 January 2014, 23:15:59

Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Jan 25, 2014 at 5:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah.  If Robert's diagnosis is correct, and it sounds pretty plausible,
>> then this is really just one instance of a bug that's probably pretty
>> widespread in our signal handlers.  Somebody needs to go through 'em
>> all and look for touches of shared memory.

> I haven't made a comprehensive study of every signal handler we have,
> but looking at procsignal_sigusr1_handler, the list of functions that
> can get called from here is quite short: CheckProcSignal(),
> RecoveryConflictInterrupt(), SetLatch(), and latch_sigusr1_handler().

You forgot HandleCatchupInterrupt and HandleNotifyInterrupt, but those
are also safe because they do nothing once proc_exit_inprogress is set.

> I think maybe the best fix is to *clear* MyProc in
> ProcKill/AuxiliaryProcKill and MyProcSignalSlot in
> CleanupProcSignalState, as shown in the attached patch.

This looks good to me in principle.  A couple minor beefs:

* The addition to CleanupProcSignalState could use a comment,
similar to the one you added in ProcKill.

* I think the code in ProcKill and AuxiliaryProcKill might be more
readable if the new local variable was named "myproc" (lower case).

> and we can easily add a NULL guard to the SetLatch() call in
> procsignal_sigusr1_handler, which the attached patch also does.

Um ... no such change actually visible in patch, but it's clearly
necessary.

> This might not be a complete fix to every problem of this type that
> exists anywhere in our code, but I think it's enough to make the world
> safe for procsignal_sigusr1_handler.

Yeah; at the least this should cut down on the buildfarm noise we
are seeing ATM.

> Assuming nobody objects too much to this basic approach, should I
> back-patch the parts of this that apply pre-9.4?

Yes, I think so.  We have seen some reports of irreproducible crashes
at process exit, and maybe this explains them.
        regards, tom lane

Re: What is happening on buildfarm member crake?

From

Robert Haas

Date:

01 February 2014, 03:24:16

On Fri, Jan 31, 2014 at 6:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This looks good to me in principle.  A couple minor beefs:
>
> * The addition to CleanupProcSignalState could use a comment,
> similar to the one you added in ProcKill.

OK.

> * I think the code in ProcKill and AuxiliaryProcKill might be more
> readable if the new local variable was named "myproc" (lower case).

grep indicates that naming is less common that what I picked, so I
chose to stick with what I picked.

>> and we can easily add a NULL guard to the SetLatch() call in
>> procsignal_sigusr1_handler, which the attached patch also does.
>
> Um ... no such change actually visible in patch, but it's clearly
> necessary.

Fixed.

>> This might not be a complete fix to every problem of this type that
>> exists anywhere in our code, but I think it's enough to make the world
>> safe for procsignal_sigusr1_handler.
>
> Yeah; at the least this should cut down on the buildfarm noise we
> are seeing ATM.
>
>> Assuming nobody objects too much to this basic approach, should I
>> back-patch the parts of this that apply pre-9.4?
>
> Yes, I think so.  We have seen some reports of irreproducible crashes
> at process exit, and maybe this explains them.

OK, done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: What is happening on buildfarm member crake?

From

Tom Lane

Date:

01 February 2014, 20:19:22

Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Jan 25, 2014 at 5:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah.  If Robert's diagnosis is correct, and it sounds pretty plausible,
>> then this is really just one instance of a bug that's probably pretty
>> widespread in our signal handlers.  Somebody needs to go through 'em
>> all and look for touches of shared memory.

> I haven't made a comprehensive study of every signal handler we have,
> [ but here's how to fix procsignal_sigusr1_handler ]

I've trawled all the remaining signal handlers (or at least everything
declared with SIGNAL_ARGS, which hopefully is all of them).  I find
the following bugs:

1. A couple of signal handlers doif (MyWalSnd)    SetLatch(&MyWalSnd->latch);
which is fine as far as it goes, but the shutdown sequence in WalSndKill
has exactly the same bug you just fixed in ProcKill: it needs to clear
MyWalSnd before disowning the latch, not after.

2. WalRcvSigUsr1Handler and worker_spi_sighup fail to preserve errno.

Will fix, but the latter bug is a tad disappointing considering that
the coding rule about saving errno in signal handlers has been there
for a *long* time.
        regards, tom lane