Re: Replication slot stats misgivings - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Replication slot stats misgivings
Date
Msg-id CAD21AoAZ8aPHmCY+rcKRcB6680qUsgNU5f+X=w31xjM6WgDHVw@mail.gmail.com
Whole thread Raw
In response to Re: Replication slot stats misgivings  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Replication slot stats misgivings  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Wed, Mar 24, 2021 at 7:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Mar 23, 2021 at 10:54 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2021-03-23 23:37:14 +0900, Masahiko Sawada wrote:
> >
> > > > > Maybe we can compare the slot name in the
> > > > > received message to the name in the element of replSlotStats. If they
> > > > > don’t match, we swap entries in replSlotStats to synchronize the index
> > > > > of the replication slot in ReplicationSlotCtl->replication_slots and
> > > > > replSlotStats. If we cannot find the entry in replSlotStats that has
> > > > > the name in the received message, it probably means either it's a new
> > > > > slot or the previous create message is dropped, we can create the new
> > > > > stats for the slot. Is that what you mean, Andres?
> >
> > That doesn't seem great. Slot names are imo a poor identifier for
> > something happening asynchronously. The stats collector regularly
> > doesn't process incoming messages for periods of time because it is busy
> > writing out the stats file. That's also when messages to it are most
> > likely to be dropped (likely because the incoming buffer is full).
> >
>
> Leaving aside restart case, without some sort of such sanity checking,
> if both drop (of old slot) and create (of new slot) messages are lost
> then we will start accumulating stats in old slots. However, if only
> one of them is lost then there won't be any such problem.
>
> > Perhaps we could have RestoreSlotFromDisk() send something to the stats
> > collector ensuring the mapping makes sense?
> >
>
> Say if we send just the index location of each slot then probably we
> can setup replSlotStats. Now say before the restart if one of the drop
> messages was missed (by stats collector) and that happens to be at
> some middle location, then we would end up restoring some already
> dropped slot, leaving some of the still required ones. However, if
> there is some sanity identifier like name along with the index, then I
> think that would have worked for such a case.

Even such messages could also be lost? Given that any message could be
lost under a UDP connection, I think we cannot rely on a single
message. Instead, I think we need to loosely synchronize the indexes
while assuming the indexes in replSlotStats and
ReplicationSlotCtl->replication_slots are not synchronized.

>
> I think it would have been easier if we would have some OID type of
> identifier for each slot. But, without that may be index location of
> ReplicationSlotCtl->replication_slots and slotname combination can
> reduce the chances of slot stats go wrong quite less even if not zero.
> If not name, do we have anything else in a slot that can be used for
> some sort of sanity checking?

I don't see any useful information in a slot for sanity checking.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: psql lacking clearerr()