I just noticed something very interesting: in a couple of recent
buildfarm runs with this failure, the pg_stat_activity printout
no longer shows the extra walsender:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-03-24%2017%3A50%3A10
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2022-03-23%2011%3A00%3A05
This is just two of the 33 such failures in the past ten days,
so maybe it's not surprising that we didn't see it already.
(I got bored before looking back further than that.)
What this suggests to me is that maybe the extra walsender is
indeed not blocked on anything, but is just taking its time
about exiting. In these two runs, as well as in all the
non-failing runs, it had enough time to do so.
I suggest that we add a couple-of-seconds sleep in front of
the query that collects walsender PIDs, and maybe a couple more
seconds before the pg_stat_activity probe in the failure path,
and see if the behavior changes at all. That should be enough
to confirm or disprove this idea pretty quickly. If it is
right, a permanent fix could be to wait for the basebackup's
walsender to disappear from node_primary3's pg_stat_activity
before we start the one for node_standby3.
regards, tom lane