Re: Unexpected "shared memory block is still in use" - Mailing list pgsql-hackers

From Noah Misch
Subject Re: Unexpected "shared memory block is still in use"
Date
Msg-id 20190510072213.GA1098093@rfd.leadboat.com
Whole thread Raw
In response to Re: Unexpected "shared memory block is still in use"  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Unexpected "shared memory block is still in use"
List pgsql-hackers
On Thu, May 09, 2019 at 06:47:58PM -0400, Tom Lane wrote:
> I wrote:
> > However, I have a new theory, after noticing that c09850992 moved the
> > check for shm_nattch == 0.  Previously, if a shmem segment had zero attach
> > count, it was unconditionally considered not-a-threat.  Now, we'll try
> > shmat() anyway, and if that fails for any reason other than EACCES, we say
> > SHMSTATE_ANALYSIS_FAILURE which leads to the described error report.
> > So I suspect that what we hit was a race condition whereby some other
> > parallel test was using the same shmem ID and we managed to see its
> > segment successfully in shmctl but then it was gone by the time we did
> > shmat.  This leads me to think that EINVAL and EIDRM failures from
> > shmat had better be considered SHMSTATE_ENOENT not
> > SHMSTATE_ANALYSIS_FAILURE.
> > In principle this is a longstanding race condition, but I wonder
> > whether we made it more probable by moving the shm_nattch check.
> 
> Hah --- this is a real race condition, and I can demonstrate it very
> easily by inserting a sleep right there, as in the attached
> for-testing-only patch.
> 
> The particular parallelism level I use is
> 
> make -s check-world -j4 PROVE_FLAGS='-j4 --quiet --nocolor --nocount'
> 
> on a dual-socket 4-cores-per-socket Xeon machine.  With that command and
> this patch, I frequently get multiple failures per run, and they all
> report either EINVAL or EIDRM.
> 
> The patch generally reports that nattch had been 1, so my thought that
> that change might've made it worse seems unfounded.  But we have
> absolutely got a hittable race condition here.  The real fix should
> be on the order of
> 
>         if (errno == EACCES)
>             return SHMSTATE_FOREIGN;
> +        else if (errno == EINVAL || errno == EIDRM)
> +            return SHMSTATE_ENOENT;
>         else
>             return SHMSTATE_ANALYSIS_FAILURE;
> 
> (plus comments of course).

Looks good.  That is basically a defect in commit c09850992; the race passed
from irrelevance to relevance when that commit subjected more segments to the
test.  Thanks for diagnosing it.



pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: POC: Cleaning up orphaned files using undo logs
Next
From: Noah Misch
Date:
Subject: Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid