Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
Date
Msg-id 8671.1497468828@sss.pgh.pa.us
Whole thread Raw
In response to Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
List pgsql-hackers
Yesterday lorikeet failed the select_parallel test in a new way:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-13%2020%3A28%3A33

2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR:  could not map dynamic shared memory segment
2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG:  worker process: parallel worker for PID 10072 (PID 11896) exited
withexit code 1 
2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG:  statement: select stringu1::int2 from tenk1 where unique1 = 1;
TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count
<=1024)", File: "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c", Line: 974) 
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG:  server process (PID 10072) was terminated by signal 6: Aborted
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:7] DETAIL:  Failed process was running: select stringu1::int2 from tenk1
whereunique1 = 1; 
2017-06-13 16:45:02.652 EDT [59404dec.2d9c:8] LOG:  terminating any other active server processes

So the first problem here is the lack of supporting information for the
'could not map' failure.  AFAICS, this must mean either that dsm_attach()
returned without ever calling dsm_impl_op() at all, or that
dsm_impl_op()'s Windows codepath encountered ERROR_ALREADY_EXISTS or
ERROR_ACCESS_DENIED.  It's far from clear why those cases should be
treated as a silent fail.  It's even less clear why dsm_attach's early
exit cases don't produce any messages.  But since we're left not knowing
what happened, the messaging design here is clearly inadequate.

The subsequent assertion crash came from here:
   /*    * If this is a parallel worker, check whether there are already too many    * parallel workers; if so, don't
registeranother one.  Our view of    * parallel_terminate_count may be slightly stale, but that doesn't really    *
matter:we would have gotten the same result if we'd arrived here    * slightly earlier anyway.  There's no help for it,
either,since the    * postmaster must not take locks; a memory barrier wouldn't guarantee    * anything useful.    */
if(parallel && (BackgroundWorkerData->parallel_register_count -
BackgroundWorkerData->parallel_terminate_count)>=       max_parallel_workers)   {
Assert(BackgroundWorkerData->parallel_register_count-              BackgroundWorkerData->parallel_terminate_count <=
         MAX_PARALLEL_WORKER_LIMIT);       LWLockRelease(BackgroundWorkerLock);       return false;   } 

What I suspect happened is that parallel_register_count was less than
parallel_terminate_count; since those counters are declared as uint32,
the negative difference would have been treated as a large unsigned value,
triggering the assertion.  It's not very clear how that happened, but my
bet is that the postmaster incremented parallel_terminate_count more than
once while cleaning up after the crashed worker.  It looks to me like
there's nothing stopping BackgroundWorkerStateChange from incrementing it
and then the eventual ForgetBackgroundWorker call from incrementing it
again.  I haven't traced through things to identify why this might only
occur in a worker-failure scenario, but surely we want to make sure that
the counter increment happens once and only once per worker.

I'm also noting that ForgetBackgroundWorker is lacking a memory barrier
between incrementing parallel_terminate_count and marking the slot free.
Maybe it doesn't need one, but since there is one in
BackgroundWorkerStateChange, it looks weird to not have it here.
        regards, tom lane



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [HACKERS] logical replication busy-waiting on a lock
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Adding support for Default partition in partitioning