Re: Race conditions in 019_replslot_limit.pl - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Race conditions in 019_replslot_limit.pl
Date
Msg-id 20220225192941.hqnvefgdzaro6gzg@alap3.anarazel.de
Whole thread Raw
In response to Re: Race conditions in 019_replslot_limit.pl  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

On 2022-02-22 18:06:24 -0800, Andres Freund wrote:
> On 2022-02-18 15:14:15 -0800, Andres Freund wrote:
> > I'm running out of ideas for how to try to reproduce this. I think we might
> > need some additional debugging information to get more information from the
> > buildfarm.
> 
> > I'm thinking of adding log_min_messages=DEBUG2 to primary3, passing --verbose
> > to pg_basebackup in $node_primary3->backup(...).
> >
> > It might also be worth adding DEBUG2 messages to ReplicationSlotShmemExit(),
> > ReplicationSlotCleanup(), InvalidateObsoleteReplicationSlots().
> 
> Planning to commit something like the attached.

This did provide us a bit more detail.

Seems to suggest something is holding a problematic lock in a way that I do not understand yet:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2022-02-23%2013%3A47%3A20&stg=recovery-check
2022-02-23 09:09:52.299 EST [2022-02-23 09:09:52 EST 1997084:6] 019_replslot_limit.pl LOG:  received replication
command:CREATE_REPLICATION_SLOT "pg_basebackup_1997084" TEMPORARY PHYSICAL ( RESERVE_WAL)
 
...
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:14] 019_replslot_limit.pl DEBUG:  shmem_exit(0): 4
before_shmem_exitcallbacks to make
 
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:15] 019_replslot_limit.pl DEBUG:  replication slot exit
hook,without active slot
 
2022-02-23 09:09:52.518 EST [2022-02-23 09:09:52 EST 1997084:16] 019_replslot_limit.pl DEBUG:  temporary replication
slotcleanup: begin
 

last message from 1997084 until the immediate shutdown. Just checking for
temporary replication slots that need to be dropped requires
ReplicationSlotControlLock. Actually dropping also requires
ReplicationSlotAllocationLock

1997084 does have to a temporary replication slot to clean up.


2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:35] 019_replslot_limit.pl DEBUG:  shmem_exit(0): 4
before_shmem_exitcallbacks to make
 
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:36] 019_replslot_limit.pl DEBUG:  replication slot exit
hook,without active slot
 
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:37] 019_replslot_limit.pl DEBUG:  temporary replication
slotcleanup: begin
 
2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:38] 019_replslot_limit.pl DEBUG:  temporary replication
slotcleanup: done
 

1997083 succeeds in doing the cleanup. It does not have a temporary
replication slot. Making it look like somehow ReplicationSlotAllocationLock
hasn't been released.


2022-02-23 09:09:52.519 EST [2022-02-23 09:09:52 EST 1997083:39] 019_replslot_limit.pl DEBUG:  shmem_exit(0): 7
on_shmem_exitcallbacks to make
 
...
2022-02-23 09:09:53.076 EST [2022-02-23 09:09:52 EST 1997072:87] LOG:  received immediate shutdown request
...
2022-02-23 09:09:53.095 EST [2022-02-23 09:09:52 EST 1997072:90] DEBUG:  server process (PID 1997084) exited with exit
code2
 

It's *possible*, but not likely, that somehow 1997084 just doesn't get
scheduled for a prolonged amount of time.


We could be more certain if we shut down the cluster in fast rather than
immediate mode. So I'm thinking of doing something like

# We've seen occasionales cases where multiple walsender pids are active. An
# immediate shutdown may hide evidence of a locking bug. So if multiple
# walsenders are observed, shut down in fast mode, and collect some more
# information.
if (not like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid"))
{
    my ($stdout, $stderr);
    $node_primary3->psql('postgres',
                         "\\a\\t\nSELECT * FROM pg_stat_activity",
                         stdout => \$stdout, stderr => \$stderr);
    diag $stdout, $stderr;
    $node_primary3->stop('fast');
    $node_standby3->stop('fast');
    die "could not determine walsender pid, can't continue";
}

Does that make sense? Better ideas?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: "Euler Taveira"
Date:
Subject: Re: Design of pg_stat_subscription_workers vs pgstats
Next
From: Andres Freund
Date:
Subject: Re: Design of pg_stat_subscription_workers vs pgstats