Re: Race conditions in 019_replslot_limit.pl - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Race conditions in 019_replslot_limit.pl
Date
Msg-id 3518071.1645060943@sss.pgh.pa.us
Whole thread Raw
In response to Re: Race conditions in 019_replslot_limit.pl  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
> I think the test is telling us that something may be broken. We shouldn't
> silence that without at least some understanding what it is.

I looked at the recent failure on komodoensis [1], and I think what is
happening is just that the walsender for the basebackup run (launched
at 019_replslot_limit.pl line 325) hasn't exited yet at the point where
we do a blind
"SELECT pid FROM pg_stat_activity WHERE backend_type = 'walsender'"
and expect that we're only going to see the walsender launched for
the standby at line 331.  The two PIDs reported in the failure
correspond to this postmaster log trace:

2022-02-16 23:06:29.596 CET [620d7565.38dd62:1] LOG:  connection received: host=[local]
2022-02-16 23:06:29.596 CET [620d7565.38dd62:2] LOG:  replication connection authorized: user=bf
application_name=019_replslot_limit.pl
2022-02-16 23:06:29.596 CET [620d7565.38dd62:3] LOG:  received replication command: SHOW data_directory_mode
2022-02-16 23:06:29.596 CET [620d7565.38dd62:4] STATEMENT:  SHOW data_directory_mode
2022-02-16 23:06:29.596 CET [620d7565.38dd62:5] LOG:  received replication command: CREATE_REPLICATION_SLOT
"pg_basebackup_3726690"TEMPORARY PHYSICAL ( RESERVE_WAL) 
2022-02-16 23:06:29.596 CET [620d7565.38dd62:6] STATEMENT:  CREATE_REPLICATION_SLOT "pg_basebackup_3726690" TEMPORARY
PHYSICAL( RESERVE_WAL) 
2022-02-16 23:06:29.597 CET [620d7565.38dd62:7] LOG:  received replication command: IDENTIFY_SYSTEM
2022-02-16 23:06:29.597 CET [620d7565.38dd62:8] STATEMENT:  IDENTIFY_SYSTEM
2022-02-16 23:06:29.597 CET [620d7565.38dd62:9] LOG:  received replication command: START_REPLICATION SLOT
"pg_basebackup_3726690"0/600000 TIMELINE 1 
2022-02-16 23:06:29.597 CET [620d7565.38dd62:10] STATEMENT:  START_REPLICATION SLOT "pg_basebackup_3726690" 0/600000
TIMELINE1 

and this one:

2022-02-16 23:06:29.687 CET [620d7565.38dd6f:1] LOG:  connection received: host=[local]
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:2] LOG:  replication connection authorized: user=bf
application_name=standby_3
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:3] LOG:  received replication command: IDENTIFY_SYSTEM
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:4] STATEMENT:  IDENTIFY_SYSTEM
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:5] LOG:  received replication command: START_REPLICATION SLOT "rep3"
0/700000TIMELINE 1 
2022-02-16 23:06:29.687 CET [620d7565.38dd6f:6] STATEMENT:  START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1

There's no disconnection log entry for either, which I suppose means
that somebody didn't bother logging disconnection for walsenders ...
shouldn't we fix that?  But in any case, I don't see anything
interesting here, just a query that needs to be more selective.
Perhaps we can look for application_name=standby_3?

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2022-02-16%2021%3A16%3A04



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Nonrandom scanned_pages distorts pg_class.reltuples set by VACUUM
Next
From: Tom Lane
Date:
Subject: Re: libpq async duplicate error results