Re: [HACKERS] Restricting maximum keep segments by repslots - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: [HACKERS] Restricting maximum keep segments by repslots
Date
Msg-id 20200408.141956.891237856186513376.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: [HACKERS] Restricting maximum keep segments by repslots  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: [HACKERS] Restricting maximum keep segments by repslots  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Re: [HACKERS] Restricting maximum keep segments by repslots  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Re: [HACKERS] Restricting maximum keep segments by repslots  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
At Wed, 08 Apr 2020 09:37:10 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > I pushed version 26, with a few further adjustments.
> > 
> > I think what we have now is sufficient, but if you want to attempt this
> > "invalidated" flag on top of what I pushed, be my guest.
> 
> I don't think the invalidation flag is essential but it can prevent
> unanticipated behavior, in other words, it makes us feel at ease:p
> 
> After the current master/HEAD, the following steps causes assertion
> failure in xlogreader.c.
..
> I will look at it.

Just avoiding starting replication when restart_lsn is invalid is
sufficient (the attached, which is equivalent to a part of what the
invalidated flag did). I thing that the error message needs a Hint but
it looks on the subscriber side as:

[22086] 2020-04-08 10:35:04.188 JST ERROR:  could not receive data from WAL stream: ERROR:  replication slot "s1" is
invalidated
        HINT:  The slot exceeds the limit by max_slot_wal_keep_size.

I don't think it is not clean.. Perhaps the subscriber should remove
the trailing line of the message from the publisher?

> On the other hand, physical replication doesn't break by invlidation.
> 
> Primary: postgres.conf
> max_slot_wal_keep_size=0
> Standby: postgres.conf
> primary_conninfo='connect to master'
> primary_slot_name='x1'
> 
> (start the primary)
> P=> select pg_create_physical_replication_slot('x1');
> (start the standby)
> S=> create table tt(); drop table tt; select pg_switch_wal(); checkpoint;

If we don't mind that standby can reconnect after a walsender
termination due to the invalidation, we don't need to do something for
this.  Restricting max_slot_wal_keep_size to be larger than a certain
threshold would reduce the chance we see that behavior.

I saw another issue, the following sequence on the primary freezes
when invalidation happens.

=# create table tt(); drop table tt; select pg_switch_wal();create table tt(); drop table tt; select
pg_switch_wal();createtable tt(); drop table tt; select pg_switch_wal(); checkpoint;
 

The last checkpoint command is waiting for CV on
CheckpointerShmem->start_cv in RequestCheckpoint(), while Checkpointer
is waiting for the next latch at the end of
CheckpointerMain. new_started doesn't move but it is the same value
with old_started.

That freeze didn't happen when I removed
ConditionVariableSleep(&s->active_cv) in
InvalidateObsoleteReplicationSlots.

I continue investigating it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From b3f7e2d94b8ea9b5f3819fcf47c0e1ba57355b87 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 8 Apr 2020 14:03:01 +0900
Subject: [PATCH] walsender crash fix

---
 src/backend/replication/walsender.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 06e8b79036..707de65f4b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1170,6 +1170,13 @@ StartLogicalReplication(StartReplicationCmd *cmd)
     pq_flush();
 
     /* Start reading WAL from the oldest required WAL. */
+    if (MyReplicationSlot->data.restart_lsn == InvalidXLogRecPtr)
+        ereport(ERROR,
+                (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                 errmsg("replication slot \"%s\" is invalidated",
+                        cmd->slotname),
+                 errhint("The slot exceeds the limit by max_slot_wal_keep_size.")));
+
     XLogBeginRead(logical_decoding_ctx->reader,
                   MyReplicationSlot->data.restart_lsn);
 
-- 
2.18.2


pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: backup manifests
Next
From: Amit Kapila
Date:
Subject: Re: pg_stat_statements issue with parallel maintenance (Was Re: WALusage calculation patch)