Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid - Mailing list pgsql-hackers
From | Noah Misch |
---|---|
Subject | Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid |
Date | |
Msg-id | 20190408064141.GA2016666@rfd.leadboat.com Whole thread Raw |
In response to | Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid (Noah Misch <noah@leadboat.com>) |
Responses |
Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid
Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid |
List | pgsql-hackers |
On Thu, Apr 04, 2019 at 07:53:19AM -0700, Noah Misch wrote: > On Wed, Apr 03, 2019 at 07:05:43PM -0700, Noah Misch wrote: > > Pushed, but that broke two buildfarm members: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2019-04-04%2000%3A33%3A14 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2019-04-04%2000%3A33%3A13 > > > > I think the problem arose because these animals run on the same machine, and > > their test execution was synchronized to the second. Two copies of the new > > test ran concurrently. It doesn't tolerate that, owing to expectations about > > which shared memory keys are in use. My initial thought is to fix this by > > having a third postmaster that runs throughout the test and represents > > ownership of a given port. If that postmaster gets something other than the > > first shm key pertaining to its port, switch ports and try again. > > > > I'll also include fixes for the warnings Andres reported on the > > pgsql-committers thread. > > This thread's 2019-04-03 patches still break buildfarm members in multiple > ways. I plan to revert them. I'll wait a day or two before doing that, in > case more failure types show up. Notable classes of buildfarm failure: - AIX animals failed two ways. First, I missed a "use" statement such that poll_start() would fail if it needed more than one attempt. Second, I assumed $pid would be gone as soon as kill(9, $pid) returned[1]. - komodoensis and idiacanthus failed due to 16ee6ea not fully resolving the problems with concurrent execution. I reproduced the various concurrency bugs by setting up four vpath build trees and looping the one test in each: for dir in 0 1 2 3; do (until [ -f /tmp/stopprove ]; do make -C $dir/src/test/recovery installcheck PROVE_TESTS=t/017_shm.pl;done) & done; wait; rm /tmp/stopprove - elver failed due to semaphore exhaustion. I'm reducing max_connections. - lorikeet's FailedAssertion("!(vmq->mq_sender == ((void *)0))" looked suspicious, but this happened six other times in the past year[2], always on v10 lorikeet. - Commit 0aa0ccf, a wrong back-patch, saw 100% failure of the new test. While it didn't cause a buildfarm failure, I'm changing the non-test code to treat shmat() EACCESS as SHMSTATE_FOREIGN, so we ignore that key and move to another. In the previous version, I treated it as SHMSTATE_ANALYSIS_FAILURE and blocked startup. In HEAD today, shmat() failure blocks startup if and only if we got the shmid from postmaster.pid; there's no distinction between EACCES and other causes. Attached v4 fixes all the above. I've also attached the incremental diff versus the code I reverted. [1] POSIX says "sig or at least one pending unblocked signal shall be delivered to the sending thread before kill() returns." I doubt the postmaster had another signal pending often enough to explain the failures, so AIX probably doesn't follow POSIX in this respect. [2] All examples in the last year: sysname │ stage │ branch │ snapshot │ url ──────────┼────────────────┼───────────────┼─────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────── lorikeet │ InstallCheck-C │ REL_10_STABLE │ 2018-05-04 09:49:55 │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2018-05-04%2009:49:55 lorikeet │ InstallCheck-C │ REL_10_STABLE │ 2018-05-05 13:15:24 │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2018-05-05%2013:15:24 lorikeet │ InstallCheck-C │ REL_10_STABLE │ 2018-05-06 09:33:35 │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2018-05-06%2009:33:35 lorikeet │ InstallCheck-C │ REL_10_STABLE │ 2018-05-15 20:52:36 │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2018-05-15%2020:52:36 lorikeet │ Check │ REL_10_STABLE │ 2019-02-20 10:40:40 │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2019-02-20%2010:40:40 lorikeet │ InstallCheck-C │ REL_10_STABLE │ 2019-03-06 09:31:24 │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2019-03-06%2009:31:24 lorikeet │ Check │ REL_10_STABLE │ 2019-04-04 09:47:02 │ https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2019-04-04%2009:47:02
Attachment
pgsql-hackers by date: