Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Synchronizing slots from primary to standby
Date
Msg-id ZdyOlIIm3yiw5+GA@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
Hi,

On Mon, Feb 26, 2024 at 05:18:25PM +0530, Amit Kapila wrote:
> On Mon, Feb 26, 2024 at 12:59 PM Bertrand Drouvot
> <bertranddrouvot.pg@gmail.com> wrote:
> > > > 10 ===
> > > >
> > > > +                       if (slot->data.invalidated != RS_INVAL_NONE)
> > > > +                       {
> > > > +                               /*
> > > > +                                * Specified physical slot have been invalidated,
> > > > so no point
> > > > +                                * in waiting for it.
> > > >
> > > > We discovered in [1], that if the wal_status is "unreserved" then the slot is still
> > > > serving the standby. I think we should handle this case differently, thoughts?
> > >
> > > I think the 'invalidated' slot can still be used is a separate bug.
> > > Because
> > > once the slot is invalidated, it can neither protect WALs or ROWs from being
> > > removed even if the restart_lsn of the slot can be moved forward after being invalidated.
> > >
> > > If the standby can move restart_lsn forward for invalidated slots, then
> > > it should also set the 'invalidated' flag back to NONE, otherwise the slot
> > > cannot serve its purpose anymore. I also reported similar bug before[1].
> >
> ...
> >
> > My point is that I think we should behave like it's not a bug and then adapt the
> > code accordingly here (until the bug gets fixed).
> >
> 
> oh, I think this doesn't sound like a good idea to me. We should fix
> that bug independently rather than adding code in new features to
> consider the bug as a valid behavior.

Agree, but it all depends if there is a consensus of the other thread being a
bug or not.

I also think it is but there is this part of the code in pg_get_replication_slots()
that makes me think ones could think it is not.

"
            case WALAVAIL_REMOVED:

                /*
                 * If we read the restart_lsn long enough ago, maybe that file
                 * has been removed by now.  However, the walsender could have
                 * moved forward enough that it jumped to another file after
                 * we looked.  If checkpointer signalled the process to
                 * termination, then it's definitely lost; but if a process is
                 * still alive, then "unreserved" seems more appropriate.
                 *
"

Anyway, I also think it is a bug so agree to keep the check as it is currenlty (
and keep an eye on the other thread outcome too).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Nazir Bilal Yavuz
Date:
Subject: Re: Add missing error codes to PANIC/FATAL error reports in xlog.c and relcache.c
Next
From: Andy Fan
Date:
Subject: Re: Shared detoast Datum proposal