For failover to work correctly, if someone changes the GUC
synchronous_standby_names to enable sync replication, then we need to
understand the exact moment when backends will begin to block in order
to correctly determine when we can failover without data loss.
There's an older mailing list thread that discusses one aspect of this
https://www.postgresql.org/message-id/flat/CABrsG8j3kPD%2Bkbbsx_isEpFvAgaOBNGyGpsqSjQ6L8vwVUaZAQ%40mail.gmail.com
I've also gone through the code for SyncRepWaitForLSN() and worked
backwards to where the checkpointer sets sync_standbys_defined. But I
have a question which I couldn't answer so far.
It looks like sync_standbys_defined is only updated by the checkpointer
process. Is there a short period of time where the pg_stat_replication
view would show sync_state=sync and state=streaming, but the
checkpointer has not yet updated sync_standbys_defined?
I'm wondering if this is a race condition where COMMITs are not being
blocked for replication but external tools which rely on
pg_stat_replication would think it's safe to failover with zero data
loss?
-Jeremy