On Thu, Aug 26, 2021 at 11:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Aug 25, 2021 at 11:04 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > I did a quick check with the following tap test code:
> > > >
> > > > $node_publisher->poll_query_until('postgres',
> > > > qq(
> > > > select 1 != foo.column1 from (values(0), (1)) as foo;
> > > > ));
> > > >
> > > > The query returns {t, f} but poll_query_until() never finished. The
> > > > same is true when the query returns {f, t}.
> > > >
> >
> > Yes, this is true, I also see the same behaviour.
> >
> > >
> > > This means something different is going on in Ajin's setup. Ajin, can
> > > you please share how did you confirm your findings about poll_query?
> >
> > Relooking at my logs, I think what happens is this:
> >
> > 1. First walsender 'a' is running.
> > 2. Second walsender 'b' starts and attempts at acquiring the slot
> > finds that the slot is active for pid a.
> > 3. Now both walsenders are active, the query does not return.
> > 4. First walsender 'a' times out and exits.
> > 5. Now only the second walsender is active and the query returns OK
> > because pid != a.
> > 6. Second walsender exits with error.
> > 7. Another query attempts to get the pid of the running walsender for
> > tap_sub but returns null because both walsender exits.
> > 8. This null return value results in the next query erroring out and
> > the test failing.
>
> So this is slightly different than what we can see in the topminnow
> logs? According to the server logs, step #5 happened (at 18:44:38.016)
> before step #4 happened (at 18:44:38.043).
>
Luckily these logs have the disconnection times of the tap test client
sessions as well. (not sure why I don't see these when I run these
tests).
Step 5 could have happened anywhere between 18:44:38.016 and 18:44:38.063
18:44:38.016 CEST [16474:3] 001_rep_changes.pl LOG: statement: SELECT
pid != 16336 FROM pg_stat_replication WHERE application_name =
'tap_sub';
:
:
18:44:38.063 CEST [16474:4] 001_rep_changes.pl LOG: disconnection:
session time: 0:00:00.063 user=nm database=postgres host=[local]
When the query starts both walsenders are present but when the query
completes both walsenders are gone, the actual query evaluation could
have happened any time in between. This is the rare timing window that
causes this problem.
regards,
Ajin Cherian
Fujitsu Australia