Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid - Mailing list pgsql-hackers

From Noah Misch
Subject Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid
Date
Msg-id 20190408064141.GA2016666@rfd.leadboat.com
Whole thread Raw
In response to Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid  (Noah Misch <noah@leadboat.com>)
Responses Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid
Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid
List pgsql-hackers
On Thu, Apr 04, 2019 at 07:53:19AM -0700, Noah Misch wrote:
> On Wed, Apr 03, 2019 at 07:05:43PM -0700, Noah Misch wrote:
> > Pushed, but that broke two buildfarm members:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=idiacanthus&dt=2019-04-04%2000%3A33%3A14
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2019-04-04%2000%3A33%3A13
> > 
> > I think the problem arose because these animals run on the same machine, and
> > their test execution was synchronized to the second.  Two copies of the new
> > test ran concurrently.  It doesn't tolerate that, owing to expectations about
> > which shared memory keys are in use.  My initial thought is to fix this by
> > having a third postmaster that runs throughout the test and represents
> > ownership of a given port.  If that postmaster gets something other than the
> > first shm key pertaining to its port, switch ports and try again.
> > 
> > I'll also include fixes for the warnings Andres reported on the
> > pgsql-committers thread.
> 
> This thread's 2019-04-03 patches still break buildfarm members in multiple
> ways.  I plan to revert them.  I'll wait a day or two before doing that, in
> case more failure types show up.

Notable classes of buildfarm failure:

- AIX animals failed two ways.  First, I missed a "use" statement such that
  poll_start() would fail if it needed more than one attempt.  Second, I
  assumed $pid would be gone as soon as kill(9, $pid) returned[1].
- komodoensis and idiacanthus failed due to 16ee6ea not fully resolving the
  problems with concurrent execution.  I reproduced the various concurrency
  bugs by setting up four vpath build trees and looping the one test in each:
    for dir in 0 1 2 3; do (until [ -f /tmp/stopprove ]; do make -C $dir/src/test/recovery installcheck
PROVE_TESTS=t/017_shm.pl;done) & done; wait; rm /tmp/stopprove
 
- elver failed due to semaphore exhaustion.  I'm reducing max_connections.
- lorikeet's FailedAssertion("!(vmq->mq_sender == ((void *)0))" looked
  suspicious, but this happened six other times in the past year[2], always on
  v10 lorikeet.
- Commit 0aa0ccf, a wrong back-patch, saw 100% failure of the new test.

While it didn't cause a buildfarm failure, I'm changing the non-test code to
treat shmat() EACCESS as SHMSTATE_FOREIGN, so we ignore that key and move to
another.  In the previous version, I treated it as SHMSTATE_ANALYSIS_FAILURE
and blocked startup.  In HEAD today, shmat() failure blocks startup if and
only if we got the shmid from postmaster.pid; there's no distinction between
EACCES and other causes.

Attached v4 fixes all the above.  I've also attached the incremental diff
versus the code I reverted.


[1] POSIX says "sig or at least one pending unblocked signal shall be
delivered to the sending thread before kill() returns."  I doubt the
postmaster had another signal pending often enough to explain the failures, so
AIX probably doesn't follow POSIX in this respect.

[2] All examples in the last year:
 sysname  │     stage      │    branch     │      snapshot       │                                            url

──────────┼────────────────┼───────────────┼─────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────
 lorikeet │ InstallCheck-C │ REL_10_STABLE │ 2018-05-04 09:49:55 │
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2018-05-04%2009:49:55
 lorikeet │ InstallCheck-C │ REL_10_STABLE │ 2018-05-05 13:15:24 │
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2018-05-05%2013:15:24
 lorikeet │ InstallCheck-C │ REL_10_STABLE │ 2018-05-06 09:33:35 │
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2018-05-06%2009:33:35
 lorikeet │ InstallCheck-C │ REL_10_STABLE │ 2018-05-15 20:52:36 │
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2018-05-15%2020:52:36
 lorikeet │ Check          │ REL_10_STABLE │ 2019-02-20 10:40:40 │
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2019-02-20%2010:40:40
 lorikeet │ InstallCheck-C │ REL_10_STABLE │ 2019-03-06 09:31:24 │
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2019-03-06%2009:31:24
 lorikeet │ Check          │ REL_10_STABLE │ 2019-04-04 09:47:02 │
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2019-04-04%2009:47:02

Attachment

pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: ToDo: show size of partitioned table
Next
From: Michael Paquier
Date:
Subject: Re: Re: A separate table level option to control compression