Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests |
Date | |
Msg-id | 8671.1497468828@sss.pgh.pa.us Whole thread Raw |
In response to | Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
|
List | pgsql-hackers |
Yesterday lorikeet failed the select_parallel test in a new way: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2017-06-13%2020%3A28%3A33 2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR: could not map dynamic shared memory segment 2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG: worker process: parallel worker for PID 10072 (PID 11896) exited withexit code 1 2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG: statement: select stringu1::int2 from tenk1 where unique1 = 1; TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count <=1024)", File: "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c", Line: 974) 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG: server process (PID 10072) was terminated by signal 6: Aborted 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:7] DETAIL: Failed process was running: select stringu1::int2 from tenk1 whereunique1 = 1; 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:8] LOG: terminating any other active server processes So the first problem here is the lack of supporting information for the 'could not map' failure. AFAICS, this must mean either that dsm_attach() returned without ever calling dsm_impl_op() at all, or that dsm_impl_op()'s Windows codepath encountered ERROR_ALREADY_EXISTS or ERROR_ACCESS_DENIED. It's far from clear why those cases should be treated as a silent fail. It's even less clear why dsm_attach's early exit cases don't produce any messages. But since we're left not knowing what happened, the messaging design here is clearly inadequate. The subsequent assertion crash came from here: /* * If this is a parallel worker, check whether there are already too many * parallel workers; if so, don't registeranother one. Our view of * parallel_terminate_count may be slightly stale, but that doesn't really * matter:we would have gotten the same result if we'd arrived here * slightly earlier anyway. There's no help for it, either,since the * postmaster must not take locks; a memory barrier wouldn't guarantee * anything useful. */ if(parallel && (BackgroundWorkerData->parallel_register_count - BackgroundWorkerData->parallel_terminate_count)>= max_parallel_workers) { Assert(BackgroundWorkerData->parallel_register_count- BackgroundWorkerData->parallel_terminate_count <= MAX_PARALLEL_WORKER_LIMIT); LWLockRelease(BackgroundWorkerLock); return false; } What I suspect happened is that parallel_register_count was less than parallel_terminate_count; since those counters are declared as uint32, the negative difference would have been treated as a large unsigned value, triggering the assertion. It's not very clear how that happened, but my bet is that the postmaster incremented parallel_terminate_count more than once while cleaning up after the crashed worker. It looks to me like there's nothing stopping BackgroundWorkerStateChange from incrementing it and then the eventual ForgetBackgroundWorker call from incrementing it again. I haven't traced through things to identify why this might only occur in a worker-failure scenario, but surely we want to make sure that the counter increment happens once and only once per worker. I'm also noting that ForgetBackgroundWorker is lacking a memory barrier between incrementing parallel_terminate_count and marking the slot free. Maybe it doesn't need one, but since there is one in BackgroundWorkerStateChange, it looks weird to not have it here. regards, tom lane
pgsql-hackers by date: