Re: Failure of subscription tests with topminnow - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Failure of subscription tests with topminnow
Date
Msg-id CAA4eK1+xqyWD9XZhyu0LjJ2L0oq4wgruPK70Ba7WOcZNG-_KQw@mail.gmail.com
Whole thread Raw
In response to Re: Failure of subscription tests with topminnow  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Failure of subscription tests with topminnow
List pgsql-hackers
On Wed, Aug 25, 2021 at 5:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > > On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > > > >
> > > > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > > But will poll function still poll or exit? Have you tried that?
> > > > >
> > > > > I have forced that condition with a changed query and found that the
> > > > > poll will not exit in case of a NULL return.
> > > > >
> > > >
> > > > What if the query in a poll is fired just before we get an error
> > > > "tap_sub ERROR:  replication slot "tap_sub" is active for PID 16336"?
> > > > Won't at that stage both old and new walsender's are present, so the
> > > > query might return true. You can check that via debugger by stopping
> > > > just before this error occurs and then check pg_stat_replication view.
> > >
> > > If this error happens then the PID is NOT updated as the pid in the
> > > Replication slot. I have checked this
> > > and explained this in my first email itself
> > >
> >
> > Sorry about the above email, I misunderstood. I was looking at
> > pg_stat_replication_slot rather than pg_stat_replication hence the confusion.
> > Amit is correct, just prior to the walsender erroring out, it briefly
> > appears in the
> > pg_stat_replication, and that is why this error happens. Sorry for the
> > confusion.
> > I just confirmed it, got both the walsenders stopped in the debugger:
> >
> > postgres=# select pid from pg_stat_replication where application_name = 'sub';
> >  pid
> > ------
> >  7899
> >  7993
> > (2 rows)
>
> IIUC the query[1] used for polling returns two rows in this case: {t,
> f} or {f, t}. But did poll_query_until() returned OK in this case even
> if we expected one row of 't'? My guess of how this issue happened is:
>

Yeah, we can check this but I guess as soon as it gets 't', the poll
query will exit.

> 1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
> passed (for some reason).
>

I think the reason for exit is that we get two rows with the same
application_name in pg_stat_replication.

> 2. all wal senders exited.
> 3. get the pid of wal sender with application_name 'tap_sub' but got nothing.
> 4. the second polling query resulted in a syntax error since $oldpid is null.
>

Your understanding of steps is the same as mine.


--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Separate out FileSet from SharedFileSet (was Re: pgsql: pgstat: Bring up pgstat in BaseInit() to fix uninitialized use o)
Next
From: Ajin Cherian
Date:
Subject: Re: Failure of subscription tests with topminnow