On Tue, Aug 11, 2020 at 7:55 AM Asim Praveen <pasim@vmware.com> wrote:
> There is no out-of-order execution hazard in the scenario you are describing, memory barriers don’t seem to fit.
Usinglocks to synchronise checkpointer process and a committing backend process is the right way. We have made a
consciousdecision to bypass the lock, which looks correct in this case.
Yeah, I am not immediately seeing why a memory barrier would help anything here.
> As an aside, there is a small (?) window where a change to synchronous_standby_names GUC is partially propagated
amongcommitting backends, checkpointer and walsender. Such a window may result in walsender declaring a standby as
synchronouswhile a commit backend fails to wait for it in SyncRepWaitForLSN. The root cause is walsender uses
sync_standby_priority,a per-walsender variable to tell if a standby is synchronous. It is updated when walsender
processesa config change. Whereas sync_standbys_defined, a variable updated by checkpointer, is used by committing
backendsto determine if they need to wait. If checkpointer is busy flushing buffers, it may take longer than walsender
toreflect a change in sync_standbys_defined. This is a low impact problem, should be ok to live with it.
I think this gets to the root of the issue. If we check the flag
without a lock, we might see a slightly stale value. But, considering
that there's no particular amount of time within which configuration
changes are guaranteed to take effect, maybe that's OK. However, there
is one potential gotcha here: if the walsender declares the standby to
be synchronous, a user can see that, right? So maybe there's this
problem: a user sees that the standby is synchronous and expects a
transaction committing afterward to provoke a wait, but really it
doesn't. Now the user is unhappy, feeling that the system didn't
perform according to expectations.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company