Re: WAL segments removed from primary despite the fact that logical replication slot needs it. - Mailing list pgsql-bugs

From Kyotaro Horiguchi
Subject Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
Date
Msg-id 20221019.110032.2063205133677772338.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: WAL segments removed from primary despite the fact that logical replication slot needs it.  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
List pgsql-bugs
At Tue, 18 Oct 2022 16:51:26 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in 
> On Mon, Oct 17, 2022 at 12:23 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Sun, 16 Oct 2022 10:35:17 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
> > > On Wed, Oct 5, 2022 at 8:54 PM hubert depesz lubaczewski
> > > > So, 4 files are missing.
> > > >
> > > > These were archived properly, and I tried to restore them from archive, and put
> > > > them in pg_wal, but even then pg12 was rejecting tries to connect to focal14
> > > > slot with the same message about "requested WAL segment
> > > > 000000010000CA0A00000049 has already been removed"
> > > >
> > >
> > > I think you are seeing this behavior because we update the
> > > lastRemovedSegNo before removing files in RemoveOldXlogFiles() and
> > > then we use that to give the error you are seeing.
> >
> > lastRemovedSegNo is updated once per one segment of removal.  Four
> > files are lost in this case.
> >
> 
> I didn't understand your response. I was saying the one possible
> reason why even after restoring files from the archive the error
> appears is because of the lastRemovedSegNo related check in function
> CheckXLogRemoved() and we update its value while removing old xlog
> files. From this behavior, it appears that somehow the server has only
> removed those files even though the reason is not clear yet.

I meant that if PostgreSQL did something wrong (that I don't
understand at all) there, the difference between lastRemovedSegNo and
the last actually removed segment won't be more than 1.

CheckXLogRevmoed warns for a segment logically already removed but was
physically able to be opend.  On the other hand WalSndSegmentOpen
emits the same error earlier than CHeckXLogRemoved warns if the
segment is actually ENOENT regardless of the value of
lastRemovedSegNo.

One point that bothers me is it seems that the walsender was
killed. If the file is removed while the walsender is working, logical
replication worker receives the error and emits "ERROR: could not
receive data...has been removed" instead of suddenly disconnected as
in this case.  Considering this together with the possibility that the
segments were removed by someone else, I suspencted virus scanners but
that has found to be false..  (I still don't know there's any virus
scanner that kills processes grabbing a suspectedly-malicious file.)


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: Aw: Re:  BUG #17647: 12.12 package has difference on ubuntu 18.04
Next
From: Andy Fan
Date:
Subject: Re: BUG #17650: For the sixth time, the clipping function in the 120 partition table planning stage fails