Re: Race conditions in 019_replslot_limit.pl - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Race conditions in 019_replslot_limit.pl
Date
Msg-id 3042597.1648148740@sss.pgh.pa.us
Whole thread Raw
In response to Re: Race conditions in 019_replslot_limit.pl  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Race conditions in 019_replslot_limit.pl
List pgsql-hackers
I just noticed something very interesting: in a couple of recent
buildfarm runs with this failure, the pg_stat_activity printout
no longer shows the extra walsender:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2022-03-24%2017%3A50%3A10
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=xenodermus&dt=2022-03-23%2011%3A00%3A05

This is just two of the 33 such failures in the past ten days,
so maybe it's not surprising that we didn't see it already.
(I got bored before looking back further than that.)

What this suggests to me is that maybe the extra walsender is
indeed not blocked on anything, but is just taking its time
about exiting.  In these two runs, as well as in all the
non-failing runs, it had enough time to do so.

I suggest that we add a couple-of-seconds sleep in front of
the query that collects walsender PIDs, and maybe a couple more
seconds before the pg_stat_activity probe in the failure path,
and see if the behavior changes at all.  That should be enough
to confirm or disprove this idea pretty quickly.  If it is
right, a permanent fix could be to wait for the basebackup's
walsender to disappear from node_primary3's pg_stat_activity
before we start the one for node_standby3.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: pg_walinspect - a new extension to get raw WAL data and WAL stats
Next
From: Andrew Dunstan
Date:
Subject: Re: Granting SET and ALTER SYSTE privileges for GUCs