On 06/29/2017 08:11 AM, Thomas Munro wrote:
> On Thu, Jun 29, 2017 at 4:27 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> /*
>> * Acquiring the lock is not needed, the latch ensures proper
>> * barriers. If it looks like we're done, we must really be done,
>> * because once walsender changes the state to SYNC_REP_WAIT_COMPLETE,
>> * it will never update it again, so we can't be seeing a stale value
>> * in that case.
>> */
>
> Yeah, counting on the latch for free barriers doesn't work if you
> happen to see SYNC_REP_WAIT_COMPLETE first time through the loop, or
> if you see it after a spurious signal woke you and then it's
> immediately set to SYNC_REP_WAIT_COMPLETE. In those cases, the
> following Assert statement is making an assertion about cache
> coherency that doesn't work even on a friendly TSO system.
Ah yes, that's clearly broken.
> Can you reproduce the problem with this experimental patch applied?
I was able to reproduce this, after adding some sleeps (see attached).
It must be pretty hard to hit it normally, or we would've gotten reports
earlier. Although it only really shows with assertions enabled, so that
could also be why.
In your patch, I think the pg_read_barrier() is only necessary, if
assertions are enabled. Then again, better safe than sorry, and at least
on my x86-64 machine with gcc, it doesn't actually change the produced
machine code at all.
I committed your patch and backported to 9.5, with some additional
comments. Thanks for the report Const!
- Heikki
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs