Re: Review for GetWALAvailability() - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Review for GetWALAvailability()
Date
Msg-id 20200625.172803.429475667684022055.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Review for GetWALAvailability()  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Responses Re: Review for GetWALAvailability()
List pgsql-hackers
At Thu, 25 Jun 2020 14:35:34 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2020/06/25 12:57, Alvaro Herrera wrote:
> > On 2020-Jun-25, Fujii Masao wrote:
> > 
> >>     /*
> >>      * Find the oldest extant segment file. We get 1 until checkpoint removes
> >>      * the first WAL segment file since startup, which causes the status being
> >>      * wrong under certain abnormal conditions but that doesn't actually harm.
> >>      */
> >>     oldestSeg = XLogGetLastRemovedSegno() + 1;
> >>
> >> I see the point of the above comment, but this can cause wal_status to
> >> be
> >> changed from "lost" to "unreserved" after the server restart. Isn't
> >> this
> >> really confusing? At least it seems better to document that behavior.
> > Hmm.
> >
> >> Or if we *can ensure* that the slot with invalidated_at set always
> >> means
> >> "lost" slot, we can judge that wal_status is "lost" without using
> >> fragile
> >> XLogGetLastRemovedSegno(). Thought?
> > Hmm, this sounds compelling -- I think it just means we need to ensure
> > we reset invalidated_at to zero if the slot's restart_lsn is set to a
> > correct position afterwards.
> 
> Yes.

It is error-prone restriction, as discussed before.

Without changing updator-side, invalid restart_lsn AND valid
invalidated_at can be regarded as the lost state. With the following
change suggested by Fujii-san we can avoid the confusing status.

With attached first patch on top of the slot-dirtify fix below, we get
"lost" for invalidated slots after restart.

> > I don't think we have any operation that
> > does that, so it should be safe -- hopefully I didn't overlook
> > anything?
> 
> We need to call ReplicationSlotMarkDirty() and ReplicationSlotSave()
> just after setting invalidated_at and restart_lsn in
> InvalidateObsoleteReplicationSlots()?
> Otherwise, restart_lsn can go back to the previous value after the
> restart.
> 
> diff --git a/src/backend/replication/slot.c
> b/src/backend/replication/slot.c
> index e8761f3a18..5584e5dd2c 100644
> --- a/src/backend/replication/slot.c
> +++ b/src/backend/replication/slot.c
> @@ -1229,6 +1229,13 @@ restart:
>                 s->data.invalidated_at = s->data.restart_lsn;
>                 s->data.restart_lsn = InvalidXLogRecPtr;
>                 SpinLockRelease(&s->mutex);
> +
> +               /*
> + * Save this invalidated slot to disk, to ensure that the slot
> +                * is still invalid even after the server restart.
> +                */
> +               ReplicationSlotMarkDirty();
> +               ReplicationSlotSave();
>                 ReplicationSlotRelease();
>                   /* if we did anything, start from scratch */
> 
> Maybe we don't need to do this if the slot is temporary?

The only difference of temprary slots from persistent one seems to be
an attribute "persistency". Actually,
create_physica_replication_slot() does the aboves for temporary slots.

> > Neither copy nor advance seem to work with a slot that has invalid
> > restart_lsn.
> > 
> >> Or XLogGetLastRemovedSegno() should be fixed so that it returns valid
> >> value even after the restart?
> > This seems more work to implement.
> 
> Yes.

The confusing status can be avoided without fixing it, but I prefer to
fix it.  As Fujii-san suggested upthread, couldn't we remember
lastRemovedSegNo in the contorl file? (Yeah, it cuases a bump of
PG_CONTROL_VERSION and CATALOG_VERSION_NO?).

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 7c8803f2bd7267d166f8a6f1a4ca5b129aa5ae96 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 25 Jun 2020 17:03:24 +0900
Subject: [PATCH 1/2] Make slot invalidation persistent

---
 src/backend/replication/slot.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index e8761f3a18..26f874e3cb 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1229,7 +1229,15 @@ restart:
         s->data.invalidated_at = s->data.restart_lsn;
         s->data.restart_lsn = InvalidXLogRecPtr;
         SpinLockRelease(&s->mutex);
+
+        /*
+         * Save this invalidated slot to disk, to ensure that the slot
+         * is still invalid even after the server restart.
+         */
+        ReplicationSlotMarkDirty();
+        ReplicationSlotSave();
         ReplicationSlotRelease();
+        
 
         /* if we did anything, start from scratch */
         CHECK_FOR_INTERRUPTS();
-- 
2.18.4

From 0792c41c915b87474958689a2acd5c214b393015 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 25 Jun 2020 17:04:01 +0900
Subject: [PATCH 2/2] Show correct value in pg_replication_slots.wal_status
 after restart

The column may show bogus staus until checkpoint removes at least one
WAL segment. This patch changes the logic to detect the state so that
the column shows the correct status after restart.
---
 src/backend/replication/slotfuncs.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index fca18ffae5..889d6f53b6 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -283,7 +283,6 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
         bool        nulls[PG_GET_REPLICATION_SLOTS_COLS];
         WALAvailability walstate;
         XLogSegNo    last_removed_seg;
-        XLogRecPtr    targetLSN;
         int            i;
 
         if (!slot->in_use)
@@ -348,10 +347,19 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
          * invalidated; otherwise slots would appear as invalid without any
          * more clues as to what happened.
          */
-        targetLSN = XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) ?
-            slot_contents.data.invalidated_at :
-            slot_contents.data.restart_lsn;
-        walstate = GetWALAvailability(targetLSN);
+
+        /*
+         * valid invalidated_at means the slot have invalidated before. If
+         * restart_lsn is invalid addition to that, that means the slot has
+         * lost reqruied segments. We could just pass invalidated_at to
+         * GetWALAvailability if lastRemovedSegNo has been initialized, but the
+         * result is not reliable until checkpoint removes some segments.
+         */
+        if (XLogRecPtrIsInvalid(slot_contents.data.restart_lsn) &&
+            !XLogRecPtrIsInvalid(slot_contents.data.invalidated_at))
+            walstate = WALAVAIL_REMOVED;
+        else
+            walstate = GetWALAvailability(slot_contents.data.restart_lsn);
 
         switch (walstate)
         {
-- 
2.18.4


pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Missing some ifndef FRONTEND at the top of logging.c and file_utils.c
Next
From: Vik Fearing
Date:
Subject: Re: Why forbid "INSERT INTO t () VALUES ();"