Replication slot drop message is sent after pgstats shutdown. - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Replication slot drop message is sent after pgstats shutdown.
Date
Msg-id CAD21AoBgSTF8gp1SKojKRu9dqzN4p1Ob6Mh=QgVhGfLO1NtUYA@mail.gmail.com
Whole thread Raw
Responses Re: Replication slot drop message is sent after pgstats shutdown.
List pgsql-hackers
Hi all,

I found another pass where we report stats after the stats collector
shutdown. The reproducer and the backtrace I got are here:

1. psql -c "begin; create table a (a int); select pg_sleep(30); commit;" &
2. pg_recvlogical --create-slot -S slot -d postgres &
3. stop the server

TRAP: FailedAssertion("pgstat_is_initialized && !pgstat_is_shutdown",
File: "pgstat.c", Line: 4752, PID: 62789)
0   postgres                            0x000000010a8ed79a
ExceptionalCondition + 234
1   postgres                            0x000000010a5e03d2
pgstat_assert_is_up + 66
2   postgres                            0x000000010a5e1dc4 pgstat_send + 20
3   postgres                            0x000000010a5e1d5c
pgstat_report_replslot_drop + 108
4   postgres                            0x000000010a64c796
ReplicationSlotDropPtr + 838
5   postgres                            0x000000010a64c0e9
ReplicationSlotDropAcquired + 89
6   postgres                            0x000000010a64bf23
ReplicationSlotRelease + 99
7   postgres                            0x000000010a6d60ab ProcKill + 219
8   postgres                            0x000000010a6a350c shmem_exit + 444
9   postgres                            0x000000010a6a326a
proc_exit_prepare + 122
10  postgres                            0x000000010a6a3163 proc_exit + 19
11  postgres                            0x000000010a8ee665 errfinish + 1109
12  postgres                            0x000000010a6e3535
ProcessInterrupts + 1445
13  postgres                            0x000000010a65f654
WalSndWaitForWal + 164
14  postgres                            0x000000010a65edb2
logical_read_xlog_page + 146
15  postgres                            0x000000010a22c336
ReadPageInternal + 518
16  postgres                            0x000000010a22b860 XLogReadRecord + 320
17  postgres                            0x000000010a619c67
DecodingContextFindStartpoint + 231
18  postgres                            0x000000010a65c105
CreateReplicationSlot + 1237
19  postgres                            0x000000010a65b64c
exec_replication_command + 1180
20  postgres                            0x000000010a6e6d2b PostgresMain + 2459
21  postgres                            0x000000010a5ef1a9 BackendRun + 89
22  postgres                            0x000000010a5ee6fd BackendStartup + 557
23  postgres                            0x000000010a5ed487 ServerLoop + 759
24  postgres                            0x000000010a5eac22 PostmasterMain + 6610
25  postgres                            0x000000010a4c32d3 main + 819
26  libdyld.dylib                       0x00007fff73477cc9 start + 1

At step #2, wal sender waits for another transaction started at step
#1 to complete after creating the replication slot. When the server is
stopping, wal sender process drops the slot on releasing the slot
since it's still RS_EPHEMERAL. Then, after dropping the slot we report
the message for dropping the slot (see ReplicationSlotDropPtr()).
These are executed in ReplicationSlotRelease() called by ProcKill()
which is called during calling on_shmem_exit callbacks, which is after
shutting down pgstats during before_shmem_exit callbacks. I’ve not
tested yet but I think this can potentially happen also when dropping
a temporary slot. ProcKill() also calls ReplicationSlotCleanup() to
clean up temporary slots.

There are some ideas to fix this issue but I don’t think it’s a good
idea to move either ProcKill() or the slot releasing code to
before_shmem_exit in this case, like we did for other similar
issues[1][2]. Reporting the slot dropping message on dropping the slot
isn’t necessarily essential actually since autovacuums periodically
check already-dropped slots and report to drop the stats. So another
idea would be to move pgstat_report_replslot_drop() to a higher layer
such as ReplicationSlotDrop() and ReplicationSlotsDropDBSlots() that
are not called during callbacks. The replication slot stats are
dropped when it’s dropped via commands such as
pg_drop_replication_slot() and DROP_REPLICATION_SLOT. On the other
hand, for temporary slots and ephemeral slots, we rely on autovacuums
to drop their stats. Even if we delay to drop the stats for those
slots, pg_stat_replication_slots don’t show the stats for
already-dropped slots.

Any other ideas?

Regards,

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=675c945394b36c2db0e8c8c9f6209c131ce3f0a8
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=dcac5e7ac157964f71f15d81c7429130c69c3f9b

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



pgsql-hackers by date:

Previous
From: "houzj.fnst@fujitsu.com"
Date:
Subject: RE: Added missing invalidations for all tables publication
Next
From: Andres Freund
Date:
Subject: Re: archive status ".ready" files may be created too early