On Wed, Jun 12, 2019 at 01:42:01PM -0400, Alvaro Herrera wrote:
> I looked at the buildfarm failures for the recoveryCheck stage. It
> looks like there is only one failure for branch master after this
> commit, which was chipmunk saying:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2019-05-12%2020%3A37%3A11
> AFAICS this is wholly unrelated to the problem at hand.
Yes, that's unrelated.
This failure is interesting, still it has happened only once per what
I can see. I think that this points out a rare race condition in test
007_sync_rep.pl because of this sequence which reorders the standbys:
# Stop and start standbys to rearrange the order of standbys
# in WalSnd array. Now, if standbys have the same priority,
# standby2 is selected preferentially and standby3 is next.
$node_standby_1->stop;
$node_standby_2->stop;
$node_standby_3->stop;
$node_standby_2->start;
$node_standby_3->start;
The failure actually indicates that even if standby3 has started
after standby2, standby3 has registered back into the WAL sender array
of the primary before standby2 had the occasion to do it, but we have
no guarantee that the ordering is actually right. So this has messed
up with the expected set of sync standbys when these are not directly
listed. I think that this can happen as well when initializing the
standbys at the top of the test, but the window is much, much
narrower and basically impossible to reach.
Using a combo of safe_psql('checkpoint') + wait_for_catchup() makes
the test faster, but that's much more costly than using just
poll_query_until() on pg_stat_replication to make sure that each
standby is registered on the primary's WAL sender array. In short,
something like the attached should improve the stability of the test.
Thoughts?
--
Michael