On Wed, Aug 25, 2021 at 11:04 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > I did a quick check with the following tap test code:
> > >
> > > $node_publisher->poll_query_until('postgres',
> > > qq(
> > > select 1 != foo.column1 from (values(0), (1)) as foo;
> > > ));
> > >
> > > The query returns {t, f} but poll_query_until() never finished. The
> > > same is true when the query returns {f, t}.
> > >
>
> Yes, this is true, I also see the same behaviour.
>
> >
> > This means something different is going on in Ajin's setup. Ajin, can
> > you please share how did you confirm your findings about poll_query?
>
> Relooking at my logs, I think what happens is this:
>
> 1. First walsender 'a' is running.
> 2. Second walsender 'b' starts and attempts at acquiring the slot
> finds that the slot is active for pid a.
> 3. Now both walsenders are active, the query does not return.
> 4. First walsender 'a' times out and exits.
> 5. Now only the second walsender is active and the query returns OK
> because pid != a.
> 6. Second walsender exits with error.
> 7. Another query attempts to get the pid of the running walsender for
> tap_sub but returns null because both walsender exits.
> 8. This null return value results in the next query erroring out and
> the test failing.
So this is slightly different than what we can see in the topminnow
logs? According to the server logs, step #5 happened (at 18:44:38.016)
before step #4 happened (at 18:44:38.043).
>
> >Can you additionally check the value of 'state' from
> >pg_stat_replication for both the old and new walsender sessions?
>
> Yes, will try this and post a patch tomorrow.
Thanks. I guess the state of the new walsender should be "startup"
whereas the old one should be "streaming".
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/