Re: Failure of subscription tests with topminnow - Mailing list pgsql-hackers

From Ajin Cherian
Subject Re: Failure of subscription tests with topminnow
Date
Msg-id CAFPTHDaaE5iadbWs1mFO=KskXq2SK9HdyX9AUmLBjFqd_smT1A@mail.gmail.com
Whole thread Raw
In response to Re: Failure of subscription tests with topminnow  (Masahiko Sawada <sawada.mshk@gmail.com>)
Responses Re: Failure of subscription tests with topminnow
List pgsql-hackers
On Thu, Aug 26, 2021 at 11:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Aug 25, 2021 at 11:04 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > I did a quick check with the following tap test code:
> > > >
> > > > $node_publisher->poll_query_until('postgres',
> > > >                                   qq(
> > > > select 1 != foo.column1 from (values(0), (1)) as foo;
> > > > ));
> > > >
> > > > The query returns {t, f} but poll_query_until() never finished. The
> > > > same is true when the query returns {f, t}.
> > > >
> >
> > Yes, this is true, I also see the same behaviour.
> >
> > >
> > > This means something different is going on in Ajin's setup. Ajin, can
> > > you please share how did you confirm your findings about poll_query?
> >
> > Relooking at my logs, I think what happens is this:
> >
> > 1. First walsender 'a' is running.
> > 2. Second walsender 'b' starts and attempts at acquiring the slot
> > finds that the slot is active for pid a.
> > 3. Now both walsenders are active, the query does not return.
> > 4. First walsender 'a' times out and exits.
> > 5. Now only the second walsender is active and the query returns OK
> > because pid != a.
> > 6. Second walsender exits with error.
> > 7. Another query attempts to get the pid of the running walsender for
> > tap_sub but returns null because both walsender exits.
> > 8. This null return value results in the next query erroring out and
> > the test failing.
>
> So this is slightly different than what we can see in the topminnow
> logs? According to the server logs, step #5 happened (at 18:44:38.016)
> before step #4 happened (at 18:44:38.043).
>

Luckily these logs have the disconnection times of the tap test client
sessions as well. (not sure why I don't see these when I run these
tests).

Step 5 could have happened anywhere between 18:44:38.016 and 18:44:38.063
18:44:38.016 CEST [16474:3] 001_rep_changes.pl LOG:  statement: SELECT
pid != 16336 FROM pg_stat_replication WHERE application_name =
'tap_sub';
:
:
18:44:38.063 CEST [16474:4] 001_rep_changes.pl LOG:  disconnection:
session time: 0:00:00.063 user=nm database=postgres host=[local]

When the query starts both walsenders are present but when the query
completes both walsenders are gone, the actual query evaluation could
have happened any time in between. This is the rare timing window that
causes this problem.

regards,
Ajin Cherian
Fujitsu Australia



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: row filtering for logical replication
Next
From: "tanghy.fnst@fujitsu.com"
Date:
Subject: RE: Added schema level support for publication.