Re: more descriptive message for process termination due to max_slot_wal_keep_size - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: more descriptive message for process termination due to max_slot_wal_keep_size
Date
Msg-id 20220302.153719.171492355527427161.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: more descriptive message for process termination due to max_slot_wal_keep_size  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: more descriptive message for process termination due to max_slot_wal_keep_size  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Re: more descriptive message for process termination due to max_slot_wal_keep_size  ("Drouvot, Bertrand" <bdrouvot@amazon.com>)
List pgsql-hackers
At Tue, 04 Jan 2022 10:29:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> So what do you say if I propose the following?
> 
> LOG:  terminating process %d to release replication slot \"%s\"
> because its restart_lsn %X/%X exceeds the limit %X/%X
> HINT: You might need to increase max_slot_wal_keep_size.

This version emits the following message.

[35785:checkpointer] LOG:  terminating process 36368 to release replication slot "s1" because its restart_lsn
0/1F000148exceeds the limit 0/21000000
 
[35785:checkpointer] HINT:  You might need to increase max_slot_wal_keep_size.
[36368:walsender] FATAL:  terminating connection due to administrator command
[36368:walsender] STATEMENT:  START_REPLICATION SLOT "s1" 0/1F000000 TIMELINE 1
[35785:checkpointer] LOG:  invalidating slot "s1" because its restart_lsn 0/1F000148 exceeds the limit 0/21000000
[35785:checkpointer] HINT:  You might need to increase max_slot_wal_keep_size.

We can omit the HINT line from the termination log for non-persistent
slots but I think we don't want to bother that considering its low
frequency.

The CI was confused by the mixed patches for multiple PG versions. In
this version the patchset for master are attached as .patch and that
for PG13 as .txt.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 4909db2893f0b33ab6bca1ffc3ad802cd159c577 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 24 Dec 2021 12:52:07 +0900
Subject: [PATCH v4 1/2] Make a message on process termination more dscriptive

The message at process termination due to slot limit doesn't provide
the reason. In the major scenario the message is followed by another
message about slot invalidatation, which shows the detail for the
termination.  However the second message is missing if the slot is
temporary one.

Augment the first message with the reason same as the second message.

Backpatch through to 13 where the message was introduced.

Reported-by: Alex Enachioaie <alex@altmetric.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
Backpatch-through: 13
---
 src/backend/replication/slot.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index caa6b29756..92f19bcb35 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1300,8 +1300,9 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             if (last_signaled_pid != active_pid)
             {
                 ereport(LOG,
-                        (errmsg("terminating process %d to release replication slot \"%s\"",
-                                active_pid, NameStr(slotname))));
+                        (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn
%X/%Xexceeds max_slot_wal_keep_size",
 
+                                active_pid, NameStr(slotname),
+                                LSN_FORMAT_ARGS(restart_lsn))));
 
                 (void) kill(active_pid, SIGTERM);
                 last_signaled_pid = active_pid;
-- 
2.27.0

From f31014f85c7dec160bd42e1c48f1c28a7dc22af2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 24 Dec 2021 12:58:25 +0900
Subject: [PATCH v4 2/2] Add detailed information to slot-invalidation messages

The messages says just "exceeds max_slot_wal_keep_size" but doesn't
tell the concrete LSN of the limit. That information helps operators'
understanding on the issue.

Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
---
 src/backend/replication/slot.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 92f19bcb35..be21b62add 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1300,9 +1300,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             if (last_signaled_pid != active_pid)
             {
                 ereport(LOG,
-                        (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn
%X/%Xexceeds max_slot_wal_keep_size",
 
+                        (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn
%X/%Xexceeds the limit %X/%X",
 
                                 active_pid, NameStr(slotname),
-                                LSN_FORMAT_ARGS(restart_lsn))));
+                                LSN_FORMAT_ARGS(restart_lsn),
+                                LSN_FORMAT_ARGS(oldestLSN)),
+                         errhint("You might need to increase max_slot_wal_keep_size.")));
 
                 (void) kill(active_pid, SIGTERM);
                 last_signaled_pid = active_pid;
@@ -1345,9 +1347,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             ReplicationSlotRelease();
 
             ereport(LOG,
-                    (errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds max_slot_wal_keep_size",
+                    (errmsg("invalidating slot \"%s\" because its restart_lsn %X/%X exceeds the limit %X/%X",
                             NameStr(slotname),
-                            LSN_FORMAT_ARGS(restart_lsn))));
+                            LSN_FORMAT_ARGS(restart_lsn),
+                            LSN_FORMAT_ARGS(oldestLSN)),
+                     errhint("You might need to increase max_slot_wal_keep_size.")));
 
             /* done with this slot for now */
             break;
-- 
2.27.0

From 6c5a680842521b26c0c899c3d0675bd53e58ac11 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 24 Dec 2021 13:23:54 +0900
Subject: [PATCH v4] Make a message on process termination more dscriptive

The message at process termination due to slot limit doesn't provide
the reason. In the major scenario the message is followed by another
message about slot invalidatation, which shows the detail for the
termination.  However the second message is missing if the slot is
temporary one.

Augment the first message with the reason same as the second message.

Backpatch through to 13 where the message was introduced.

Reported-by: Alex Enachioaie <alex@altmetric.com>
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://www.postgresql.org/message-id/17327-89d0efa8b9ae6271%40postgresql.org
Backpatch-through: 13
---
 src/backend/replication/slot.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 02047ea920..15b8934ae2 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1228,8 +1228,10 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlot *s, XLogRecPtr oldestLSN,
             if (last_signaled_pid != active_pid)
             {
                 ereport(LOG,
-                        (errmsg("terminating process %d to release replication slot \"%s\"",
-                                active_pid, NameStr(slotname))));
+                        (errmsg("terminating process %d to release replication slot \"%s\" because its restart_lsn
%X/%Xexceeds max_slot_wal_keep_size",
 
+                                active_pid, NameStr(slotname),
+                                (uint32) (restart_lsn >> 32),
+                                (uint32) restart_lsn)));
 
                 (void) kill(active_pid, SIGTERM);
                 last_signaled_pid = active_pid;
-- 
2.27.0


pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Use "WAL segment" instead of "log segment" consistently in user-facing messages
Next
From: Michael Paquier
Date:
Subject: Re: make tuplestore helper function