Re: Race conditions in 019_replslot_limit.pl - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Race conditions in 019_replslot_limit.pl
Date
Msg-id 20220327213219.smdvfkq2fl74flow@alap3.anarazel.de
Whole thread Raw
In response to Re: Race conditions in 019_replslot_limit.pl  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2022-03-27 15:28:05 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > How about the attached variation, which retries (for 15s, with 100ms sleeps)
> > if there are multiple walsenders, printing the debugging info each time? It'll
> > still fail the test if later iterations find just one walsender, which seems
> > the right behaviour for now.
> 
> We have now four instances of failures with this version of the test,
> and in every case the second iteration succeeded.  Is that enough
> evidence yet?

I still feel like there's something off here. But that's probably not enough
to keep causing failures. I'm inclined to leave the debugging in for a bit
longer, but not fail the test anymore?


The way the temporary slot removal hangs for a while seems just off:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2022-03-27%2017%3A04%3A18
  2022-03-27 13:30:56.993 EDT [2022-03-27 13:30:56 EDT 750695:20] 019_replslot_limit.pl DEBUG:  replication slot drop:
pg_basebackup_750695:removed on-disk
 
  ...
  2022-03-27 13:30:57.456 EDT [2022-03-27 13:30:57 EDT 750759:3] [unknown] LOG:  connection authorized: user=andrew
database=postgresapplication_name=019_replslot_limit.pl
 
  2022-03-27 13:30:57.466 EDT [2022-03-27 13:30:57 EDT 750759:4] 019_replslot_limit.pl LOG:  statement: SELECT * FROM
pg_stat_activity
  .
  2022-03-27 13:30:57.474 EDT [2022-03-27 13:30:56 EDT 750679:87] DEBUG:  server process (PID 750759) exited with exit
code0
 
  2022-03-27 13:30:57.507 EDT [2022-03-27 13:30:56 EDT 750695:21] 019_replslot_limit.pl DEBUG:  replication slot drop:
pg_basebackup_750695:done
 

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2022-03-25%2009%3A00%3A09
  2022-03-25 10:13:30.364 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG:  replication slot drop:
pg_basebackup_4022819:begin
 
  2022-03-25 10:13:30.364 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG:  replication slot drop:
pg_basebackup_4022819:removed on-disk
 
  ...
  2022-03-25 10:13:31.038 CET [4022841][client backend][5/7:0][[unknown]] LOG:  connection authorized: user=bf
database=postgresapplication_name=019_replslot_limit.pl
 
  2022-03-25 10:13:31.039 CET [4022841][client backend][5/8:0][019_replslot_limit.pl] LOG:  statement: SELECT * FROM
pg_stat_activity
  ...
  2022-03-25 10:13:31.045 CET [4022807][postmaster][:0][] DEBUG:  server process (PID 4022841) exited with exit code 0
  2022-03-25 10:13:31.056 CET [4022819][walsender][4/0:0][019_replslot_limit.pl] DEBUG:  replication slot drop:
pg_basebackup_4022819:done
 

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2022-03-25%2008%3A02%3A05
  2022-03-25 09:15:20.558 CET [3730425][walsender][4/0:0][019_replslot_limit.pl] DEBUG:  replication slot drop:
pg_basebackup_3730425:removed on-disk
 
  ...
  2022-03-25 09:15:20.803 CET [3730461][client backend][5/7:0][[unknown]] LOG:  connection authorized: user=bf
database=postgresapplication_name=019_replslot_limit.pl
 
  2022-03-25 09:15:20.804 CET [3730461][client backend][5/8:0][019_replslot_limit.pl] LOG:  statement: SELECT * FROM
pg_stat_activity
  ...
  2022-03-25 09:15:20.834 CET [3730381][postmaster][:0][] DEBUG:  server process (PID 3730461) exited with exit code 0
  2022-03-25 09:15:20.861 CET [3730425][walsender][4/0:0][019_replslot_limit.pl] DEBUG:  replication slot drop:
pg_basebackup_3730425:done
 

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2022-03-25%2003%3A00%3A18
  2022-03-25 04:14:03.025 CET [2674398][walsender][4/0:0][019_replslot_limit.pl] DEBUG:  replication slot drop:
pg_basebackup_2674398:removed on-disk
 
  ...
  2022-03-25 04:14:03.264 CET [2674463][client backend][5/7:0][[unknown]] LOG:  connection authorized: user=bf
database=postgresapplication_name=019_replslot_limit.pl
 
  2022-03-25 04:14:03.265 CET [2674463][client backend][5/8:0][019_replslot_limit.pl] LOG:  statement: SELECT * FROM
pg_stat_activity
  ...
  2022-03-25 04:14:03.270 CET [2674384][postmaster][:0][] DEBUG:  server process (PID 2674463) exited with exit code 0
  ...
  2022-03-25 04:14:03.324 CET [2674398][walsender][4/0:0][019_replslot_limit.pl] DEBUG:  replication slot drop:
pg_basebackup_2674398:done
 

We are able to start / finish several new connections between the two debug
elog()sin ReplicationSlotDropPtr()?


I wonder if there's something odd going on with ConditionVariableBroadcast().

Might be useful to add another debug message before/after
ConditionVariableBroadcast() and rmtree(). If the delay is due to rmtree()
being slow under concurrent tests I'd feel less concerned (although that
machine has dual NVMe drives...).


I really wish I could reproduce the failure. I went through a few hundred
cycles of that test in a separate checkout on that machine.


> I'd like to silence this noise so that we can start tracking
> lower-probability failure modes, like say these:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2022-03-26%2002%3A59%3A03

That one was a missing compiler flag. I see that failure reproducibly locally
when using asan with clang, unless I use -fsanitize-address-use-after-return=never. gcc has a different default
for the option, which is why I hadn't configured it.

If I understand correctly, the problem is that
-fsanitize-address-use-after-return uses an alternative stack.  Sometimes our
stack depths functions get called with the "proper" stack, and sometimes with
a "shadow" stack. Which breaks our stack depth checking.


> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2022-03-26%2015%3A53%3A51

This one I have no idea about yet. I assume it's just a race in a new test...

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [PATCH] Accept IP addresses in server certificate SANs
Next
From: Tom Lane
Date:
Subject: Re: Race conditions in 019_replslot_limit.pl