Re: Race conditions in 019_replslot_limit.pl - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Race conditions in 019_replslot_limit.pl
Date
Msg-id 20220325000234.uzsbjzk4evshdizl@alap3.anarazel.de
Whole thread Raw
In response to Re: Race conditions in 019_replslot_limit.pl  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2022-03-24 15:05:40 -0400, Tom Lane wrote:
> I just noticed something very interesting: in a couple of recent
> buildfarm runs with this failure, the pg_stat_activity printout
> no longer shows the extra walsender:

Oh. That is interesting. Thanks for catching that.


> What this suggests to me is that maybe the extra walsender is
> indeed not blocked on anything, but is just taking its time
> about exiting.  In these two runs, as well as in all the
> non-failing runs, it had enough time to do so.

Still odd that it started with the bugfix I committed. And the locking pattern
in [1] still seems suspicious. But the obove does seem to suggest those might
just have been red herrings.


I had previously tried to make process exit slow by inserting delays in
various points and only succeeded in tests failing due too many
connections. For some reason I didn't increase max_connections...

I now increased the test's max_connections and managed to get a kind of
similar failure by sticking
                elog(LOG, "received disconnect");
                pg_usleep(220000);
into the 'X' case in PostgresMain().

However, I always see three pids or the expected one pid, not two like the BF
cases. But it's not too surprising that such an artificial, uniform, slowdown
would have different symptoms than reality.


> I suggest that we add a couple-of-seconds sleep in front of
> the query that collects walsender PIDs, and maybe a couple more
> seconds before the pg_stat_activity probe in the failure path,
> and see if the behavior changes at all.

Makes sense. I was previously thinking that it'd make sense to sleep for a bit
before the ->stop('fast'); calls, so that we can see whether it's just the
shutdown unblocking things (which [1] seems to suggest).

How about the attached variation, which retries (for 15s, with 100ms sleeps)
if there are multiple walsenders, printing the debugging info each time? It'll
still fail the test if later iterations find just one walsender, which seems
the right behaviour for now.


> That should be enough to confirm or disprove this idea pretty quickly.  If
> it is right, a permanent fix could be to wait for the basebackup's walsender
> to disappear from node_primary3's pg_stat_activity before we start the one
> for node_standby3.

For some tests a "confimed disconnect" mode would be useful. Basically the
client waiting until it receives EOF after sending 'X'.

Greetings,

Andres Freund

[1] https://postgr.es/m/20220226072325.wtvkwvvga2wc3nkn%40alap3.anarazel.de

Attachment

pgsql-hackers by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors