Re: Weak-memory specific problem in ResetLatch/WaitLatch (follow-up analysis) - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Weak-memory specific problem in ResetLatch/WaitLatch (follow-up analysis)
Date
Msg-id CA+TgmobGdJ_2hC5rQ06MVH7ZtTrp-QEFBt4a+3BEaDKa4+6XbQ@mail.gmail.com
Whole thread Raw
In response to Re: Weak-memory specific problem in ResetLatch/WaitLatch (follow-up analysis)  (Michael Tautschnig <mt@debian.org>)
List pgsql-hackers
On Sat, Mar 24, 2012 at 1:01 PM, Michael Tautschnig <mt@debian.org> wrote:
> Here, "the two writes on Worker 0" corresponds to lines 15 and 16. And indeed
> line 16 is exactly the call to SetLatch. For solving problem 1, the mp idiom,
> the following options are possible (in all cases stronger synchronisation
> primitives may be used, i.e., the strongest Power barrier, sync, may be used, or
> lwsync may be used instead of an address dependency):
>
> 1. An lwsync at the beginning of SetLatch, and lwsync in ResetLatch (preferably
> after the write).
> 2. An lwsync at the beginning of SetLatch, and an address dependency in
> ResetLatch.
>
> To address the second problem, the lb idiom, an address dependency has to be put
> either in WaitLatch or SetLatch.
>
> To fix both problems, the performance-wise cheapest option would thus be placing
> an address dependency in ResetLatch and an lwsync in SetLatch. For practical
> reasons, however, placing an lwsync in both places (at the beginning of SetLatch
> and after the write in ResetLatch) might be preferable, as address dependencies
> may be optimised away by the C compiler or require inline assembly in a form not
> as easy to factor out as lwsync, plus the interface of ResetLatch would have to
> be amended.
>
> In summary, we were thus able to show that both points marked with "XXX there
> really ought to be a memory barrier" in
>
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=4e15a4db5e65e43271f8d20750d6500ab12632d0
>
> are the appropriate points to place memory synchronisation primitives, and
> picking an lwsync-equivalent in both cases is sound and does not require any
> other modifications.

It's interesting that you've concluded that memory barriers are needed
in exactly the two places that Tom concluded they were needed.  I
think your analysis of what type of memory barrier is required is
faulty, however.  In your version of the code, setting the latch is
represented by a single store.  However, SetLatch() is more
complicated than that - it does a load, and depending on the results
of the load, it does a store and then maybe a system call.  I suspect
that if you change line 16 to read:

if (!latch[i+1 % WORKERS]) latch[i+1 % WORKERS] = 1;

...then your tool will tell you that you need a full barrier before
that rather than just a store/store barrier.

After staring at this for a while, I am fairly certain that
ResetLatch() needs a full barrier, too.  It seems to me that the
formal hazard you're guarding against by inserting an lwsync here is
the load/load dependency between loading the latch and loading the
flag.  In your version, that's probably OK, but in real life, it's
not, because WaitLatch() can wake up for a variety of reasons, not
just because the latch has been set.  So the following is possible:
worker #1 executes line 14, worker #2 wakes up after line 19 and
executes line 10 and then performs the load on line 12 before the
store on line 11 (since the lwsync you propose doesn't act as a
store/load barrier), worker #1 executes lines 15 and 16, worker #2 now
does the store on line 11.  At this point we are hosed, because worker
#2 has clobbered worker #1's attempt to set the latch without seeing
that the flag is set, and everybody goes into the tank and waits
forever (or until worker #2 receives another wake-up from some other
source, at which point we'll be back in business).

I proposed before that a barrier was also needed at the start of
WaitLatch(), to guard against a load/load dependency between loading
the flag (line 12) and loading the latch (line 19).  Our version of
WaitLatch doesn't busy-wait, so the concern is approximately that we
could do the load at line 19, conclude that no waiting is needed, then
do the load at line 12, do the store at line 14, and then go to sleep.That actually can't happen with this exact code,
becauseif we were
 
to execute line 14 then we'd also hit the proposed lwsync at line 16
and so the loads would happen in order.  But lines 15-16 need not be
there: the most common use of this machinery is to hand of requests
from a foreground process to a background process, and the background
process need not be friendly enough to supply a barrier after checking
the flag and before calling WaitLatch().  I think a load/load barrier
here would be enough, but we've actually got a full barrier, because
WaitLatch calls drainSelfPipe() before checking the flag, and as noted
in the comments we assume that a system call acts as a full barrier.
So there's no live bug here, but I think it's worth noting that you
can't conclude that WaitLatch() doesn't need a barrier on the basis of
this simplified example.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Re: [COMMITTERS] pgsql: Replace empty locale name with implied value in CREATE DATABASE
Next
From: Andrew Dunstan
Date:
Subject: Odd out of memory problem.