more descriptive message for process termination due to max_slot_wal_keep_size - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject more descriptive message for process termination due to max_slot_wal_keep_size
Date
Msg-id 20211214.130456.2233153190058148084.horikyota.ntt@gmail.com
Whole thread Raw
Responses Re: more descriptive message for process termination due to max_slot_wal_keep_size
Re: more descriptive message for process termination due to max_slot_wal_keep_size
List pgsql-hackers
Hello.

As complained in pgsql-bugs [1], when a process is terminated due to
max_slot_wal_keep_size, the related messages don't mention the root
cause for *the termination*.  Note that the third message does not
show for temporary replication slots.

[pid=a] LOG:  terminating process x to release replication slot "s"
[pid=x] LOG:  FATAL:  terminating connection due to administrator command
[pid=a] LOG:  invalidting slot "s" because its restart_lsn X/X exceeds max_slot_wal_keep_size

The attached patch attaches a DETAIL line to the first message.

> [17605] LOG:  terminating process 17614 to release replication slot "s1"
+ [17605] DETAIL:  The slot's restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size.
> [17614] FATAL:  terminating connection due to administrator command
> [17605] LOG:  invalidating slot "s1" because its restart_lsn 0/2C0000A0 exceeds max_slot_wal_keep_size

Somewhat the second and fourth lines look inconsistent each other but
that wouldn't be such a problem.  I don't think we want to concatenate
the two lines together as the result is a bit too long.

> LOG:  terminating process 17614 to release replication slot "s1" because it's restart_lsn 0/2C0000A0 exceeds
max_slot_wal_keep_size.

What do you think about this?

[1] https://www.postgresql.org/message-id/20211214.101137.379073733372253470.horikyota.ntt%40gmail.com

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From b0c27dc80aff37ef984592b79f1dd20d052299fa Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Tue, 14 Dec 2021 10:50:00 +0900
Subject: [PATCH] Make an error message about process termination more
 descriptive

If checkpointer kills a process due to a temporary replication slot
exceeding max_slot_wal_keep_size, the messages fails to describe the
cause of the termination.  It is because the message that describes
the reason that is emitted for persistent slots does not show for
temporary slots.  Add a DETAIL line to the message common to all types
of slot to describe the cause.

Reported-by: Alex Enachioaie <alex@altmetric.com>
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
---
 src/backend/replication/slot.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 90ba9b417d..cba9a29113 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1254,7 +1254,8 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             {
                 ereport(LOG,
                         (errmsg("terminating process %d to release replication slot \"%s\"",
-                                active_pid, NameStr(slotname))));
+                                active_pid, NameStr(slotname)),
+                         errdetail("The slot's restart_lsn %X/%X exceeds max_slot_wal_keep_size.",
LSN_FORMAT_ARGS(restart_lsn))));
 
                 (void) kill(active_pid, SIGTERM);
                 last_signaled_pid = active_pid;
-- 
2.27.0


pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Adding CI to our tree
Next
From: Andres Freund
Date:
Subject: Re: Adding CI to our tree