Re: Replication slot stats misgivings - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Replication slot stats misgivings
Date
Msg-id CAA4eK1Lyni4XaK+-dfy6Lix_X0JbfWqrH8mrtrx2h0QV_NvNpQ@mail.gmail.com
Whole thread Raw
In response to Re: Replication slot stats misgivings  (vignesh C <vignesh21@gmail.com>)
Responses Re: Replication slot stats misgivings
List pgsql-hackers
On Thu, Apr 1, 2021 at 3:43 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, Mar 31, 2021 at 11:32 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Tue, Mar 30, 2021 at 11:00 AM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Hi,
> > >
> > > On 2021-03-30 10:13:29 +0530, vignesh C wrote:
> > > > On Tue, Mar 30, 2021 at 6:28 AM Andres Freund <andres@anarazel.de> wrote:
> > > > > Any chance you could write a tap test exercising a few of these cases?
> > > >
> > > > I can try to write a patch for this if nobody objects.
> > >
> > > Cool!
> > >
> >
> > Attached a patch which has the test for the first scenario.
> >
> > > > > E.g. things like:
> > > > >
> > > > > - create a few slots, drop one of them, shut down, start up, verify
> > > > >   stats are still sane
> > > > > - create a few slots, shut down, manually remove a slot, lower
> > > > >   max_replication_slots, start up
> > > >
> > > > Here by "manually remove a slot", do you mean to remove the slot
> > > > manually from the pg_replslot folder?
> > >
> > > Yep - thereby allowing max_replication_slots after the shutdown/start to
> > > be lower than the number of slots-stats objects.
> >
> > I have not included the 2nd test in the patch as the test fails with
> > following warnings and also displays the statistics of the removed
> > slot:
> > WARNING:  problem in alloc set Statistics snapshot: detected write
> > past chunk end in block 0x55d038b8e410, chunk 0x55d038b8e438
> > WARNING:  problem in alloc set Statistics snapshot: detected write
> > past chunk end in block 0x55d038b8e410, chunk 0x55d038b8e438
> >
> > This happens because the statistics file has an additional slot
> > present even though the replication slot was removed.  I felt this
> > issue should be fixed. I will try to fix this issue and send the
> > second test along with the fix.
>
> I felt from the statistics collector process, there is no way in which
> we can identify if the replication slot is present or not because the
> statistic collector process does not have access to shared memory.
> Anything that the statistic collector process does independently by
> traversing and removing the statistics of the replication slot
> exceeding the max_replication_slot has its drawback of removing some
> valid replication slot's statistics data.
> Any thoughts on how we can identify the replication slot which has been dropped?
> Can someone point me to the shared stats patch link with which message
> loss can be avoided. I wanted to see a scenario where something like
> the slot is dropped but the statistics are not updated because of an
> immediate shutdown or server going down abruptly can occur or not with
> the shared stats patch.
>

I don't think it is easy to simulate a scenario where the 'drop'
message is dropped and I think that is why the test contains the step
to manually remove the slot. At this stage, you can probably provide a
test patch and a code-fix patch where it just drops the extra slots
from the stats file. That will allow us to test it with a shared
memory stats patch on which Andres and Horiguchi-San are working. If
we still continue to pursue with current approach then as Andres
suggested we might send additional information from
RestoreSlotFromDisk to keep it in sync.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Greg Rychlewski
Date:
Subject: Re: DROP INDEX docs - explicit lock naming
Next
From: Dave Page
Date:
Subject: sepgsql logging