Thread: Failure of subscription tests with topminnow

Failure of subscription tests with topminnow

From

Michael Paquier

Date:

16 August 2021, 06:52:55

Hi all,

topminnow has just failed in a weird fashion:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2021-08-15%2013%3A24%3A58
# SELECT pid !=  FROM pg_stat_replication WHERE application_name = 'tap_sub';
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# ERROR:  syntax error at or near "FROM"
# LINE 1: SELECT pid !=  FROM pg_stat_replication WHERE application_na...

Looking at the logs, it seems like the problem boils down to an active
slot when using ALTER SUBSCRIPTION tap_sub CONNECTION:
2021-08-15 18:44:38.027 CEST [16473:2] ERROR:  could not start WAL
streaming: ERROR:  replication slot "tap_sub" is active for PID 16336

There is only one place in 001_rep_changes.pl where this is used.

Thoughts?
--
Michael

Attachment

signature.asc

Re: Failure of subscription tests with topminnow

From

Amit Kapila

Date:

16 August 2021, 11:32:47

On Mon, Aug 16, 2021 at 9:24 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> Hi all,
>
> topminnow has just failed in a weird fashion:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2021-08-15%2013%3A24%3A58
> # SELECT pid !=  FROM pg_stat_replication WHERE application_name = 'tap_sub';
> # expecting this output:
> # t
> # last actual query output:
> #
> # with stderr:
> # ERROR:  syntax error at or near "FROM"
> # LINE 1: SELECT pid !=  FROM pg_stat_replication WHERE application_na...
>
> Looking at the logs, it seems like the problem boils down to an active
> slot when using ALTER SUBSCRIPTION tap_sub CONNECTION:
> 2021-08-15 18:44:38.027 CEST [16473:2] ERROR:  could not start WAL
> streaming: ERROR:  replication slot "tap_sub" is active for PID 16336
>

The "ALTER SUBSCRIPTION tap_sub CONNECTION" would lead to restart the
walsender process. Now, here the problem seems to be that the previous
walsender process (16336) didn't exit and the new process started to
use the slot which leads to the error shown in the log. This is
evident from the below part of log where we can see that 16336 has
exited after new process started to use the slot.

2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG:  received
replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
0/16BEEB0 (proto_version '1', publication_names
'"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR:  replication
slot "tap_sub" is active for PID 16336
2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG:  disconnection:
session time: 0:00:00.036 user=nm database=postgres host=[local]
2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG:  disconnection:
session time: 0:00:06.367 user=nm database=postgres host=[local]

One idea to solve this is to first disable the subscription, wait for
the walsender process to exit, and then change the connection string
and re-enable the subscription.

-- 
With Regards,
Amit Kapila.

Re: Failure of subscription tests with topminnow

From

Ajin Cherian

Date:

24 August 2021, 08:34:40

On Mon, Aug 16, 2021 at 6:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> The "ALTER SUBSCRIPTION tap_sub CONNECTION" would lead to restart the
> walsender process. Now, here the problem seems to be that the previous
> walsender process (16336) didn't exit and the new process started to
> use the slot which leads to the error shown in the log. This is
> evident from the below part of log where we can see that 16336 has
> exited after new process started to use the slot.
>
> 2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG:  received
> replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
> 0/16BEEB0 (proto_version '1', publication_names
> '"tap_pub","tap_pub_ins_only"')
> 2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT:
> START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
> publication_names '"tap_pub","tap_pub_ins_only"')
> 2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR:  replication
> slot "tap_sub" is active for PID 16336
> 2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT:
> START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
> publication_names '"tap_pub","tap_pub_ins_only"')
> 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG:  disconnection:
> session time: 0:00:00.036 user=nm database=postgres host=[local]
> 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG:  disconnection:
> session time: 0:00:06.367 user=nm database=postgres host=[local]
>
> One idea to solve this is to first disable the subscription, wait for
> the walsender process to exit, and then change the connection string
> and re-enable the subscription.

I tried to simulate this by putting delays prior to the exit of the
walsender. Even though I see the same error
in the logs that the replication slot is active for the previous PID,
the test case does not fail in the way seen in this case here..

The new walsender tries to acquire the slot but seeing that there is
another PID that is active on the slot, it
errors and exits. At no point does this new failed walsender update
its pid for the slot. As a result, the below polled query
would not return a true value

$node_publisher->poll_query_until('postgres',
    "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub';"
) or die "Timed out while waiting for apply to restart";

In my runs I see that this query is repeated until a new walsender is
able to successfully acquire the slot.
I am not able to explain why this query returned true in the topminnow
tap test. This would imply that a walsender
was able to acquire the slot and update its pid but I don't see how.
We also know that if this polled query returns
a true (implying a pid other than $oldpid), then the next query in the
test is to try and identify the pid:

SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';

From the topminnow logs we know that this query returned a "NULL", as
the value extracted from this is used
to formulate the next query which errored out and eventually caused
the test case to fail.

 [16482:3] 001_rep_changes.pl ERROR:  syntax error at or near "FROM"
at character 16
 [16482:4] 001_rep_changes.pl STATEMENT:  SELECT pid !=  FROM
pg_stat_replication WHERE application_name = 'tap_sub';

I am not an expert in perl but I looked at the perl function used in
the tap test poll_query_until(), this would poll until the query
returns a 'true' (unless specified otherwise).
I don't see in my tests that the polled function exits if the query
returns a NULL. I don't know if differences in the installed perl can
cause this difference in
behaviour. Can a NULL set be construed as a true and cause the poll to exit?

 Any suggestions?

regards,
Ajin Cherian
Fujitsu Australia

Re: Failure of subscription tests with topminnow

From

Amit Kapila

Date:

24 August 2021, 14:08:20

On Tue, Aug 24, 2021 at 11:04 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Mon, Aug 16, 2021 at 6:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > The "ALTER SUBSCRIPTION tap_sub CONNECTION" would lead to restart the
> > walsender process. Now, here the problem seems to be that the previous
> > walsender process (16336) didn't exit and the new process started to
> > use the slot which leads to the error shown in the log. This is
> > evident from the below part of log where we can see that 16336 has
> > exited after new process started to use the slot.
> >
> > 2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG:  received
> > replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
> > 0/16BEEB0 (proto_version '1', publication_names
> > '"tap_pub","tap_pub_ins_only"')
> > 2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT:
> > START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
> > publication_names '"tap_pub","tap_pub_ins_only"')
> > 2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR:  replication
> > slot "tap_sub" is active for PID 16336
> > 2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT:
> > START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
> > publication_names '"tap_pub","tap_pub_ins_only"')
> > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG:  disconnection:
> > session time: 0:00:00.036 user=nm database=postgres host=[local]
> > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG:  disconnection:
> > session time: 0:00:06.367 user=nm database=postgres host=[local]
> >
> > One idea to solve this is to first disable the subscription, wait for
> > the walsender process to exit, and then change the connection string
> > and re-enable the subscription.
>
> I tried to simulate this by putting delays prior to the exit of the
> walsender. Even though I see the same error
> in the logs that the replication slot is active for the previous PID,
> the test case does not fail in the way seen in this case here..
>
> The new walsender tries to acquire the slot but seeing that there is
> another PID that is active on the slot, it
> errors and exits. At no point does this new failed walsender update
> its pid for the slot. As a result, the below polled query
> would not return a true value
>
> $node_publisher->poll_query_until('postgres',
>     "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> application_name = 'tap_sub';"
> ) or die "Timed out while waiting for apply to restart";
>
> In my runs I see that this query is repeated until a new walsender is
> able to successfully acquire the slot.
>

What happens when there is neither a new walsender nor an old
walsender is present? It means to run the above statement when a new
walsender is exited due to error "... slot is active ..." and before a
new walsender could start. Also, allow old walsender (due to which the
error occurs) to exit before executing the statement.

Also, it seems this failure happens on REL_11_STABLE branch, not sure
if that matters, but it is better to use the same branch if you are
using a different branch to reproduce the issue.

-- 
With Regards,
Amit Kapila.

Re: Failure of subscription tests with topminnow

From

Ajin Cherian

Date:

24 August 2021, 15:58:35

On Tue, Aug 24, 2021 at 9:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> What happens when there is neither a new walsender nor an old
> walsender is present? It means to run the above statement when a new
> walsender is exited due to error "... slot is active ..." and before a
> new walsender could start. Also, allow old walsender (due to which the
> error occurs) to exit before executing the statement.
>
I tried this, then the query returns a null instead of 'false' because
there is no entry for that application_name.

postgres=# select pid != 31876 from pg_stat_replication where
application_name = 'sub';
 ?column?
----------
(0 rows)


> Also, it seems this failure happens on REL_11_STABLE branch, not sure
> if that matters, but it is better to use the same branch if you are
> using a different branch to reproduce the issue.
>

Ok, I didn't realise this. I will try this.

regards,
Ajin Cherian
Fujitsu Australia

Re: Failure of subscription tests with topminnow

From

Amit Kapila

Date:

24 August 2021, 16:11:49

On Tue, Aug 24, 2021 at 6:28 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Tue, Aug 24, 2021 at 9:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > What happens when there is neither a new walsender nor an old
> > walsender is present? It means to run the above statement when a new
> > walsender is exited due to error "... slot is active ..." and before a
> > new walsender could start. Also, allow old walsender (due to which the
> > error occurs) to exit before executing the statement.
> >
> I tried this, then the query returns a null instead of 'false' because
> there is no entry for that application_name.
>
> postgres=# select pid != 31876 from pg_stat_replication where
> application_name = 'sub';
>  ?column?
> ----------
> (0 rows)
>

But will poll function still poll or exit? Have you tried that?
Because it is not clear from your explanation how in the first
statement it returns a valid value which leads poll to exit and then
in second statement it returns NULL or maybe nothing. Can you share
the log also when you are getting "replication slot "tap_sub" is
active for ..."? If you see in the below log [1], the STATEMENT is
printed twice, I want to see if you also get prints in a similar way
or is it something different? Do you know why it is printed twice?

[1]
2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG:  received
replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
0/16BEEB0 (proto_version '1', publication_names
'"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')
2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR:  replication
slot "tap_sub" is active for PID 16336
2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT:
START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
publication_names '"tap_pub","tap_pub_ins_only"')

-- 
With Regards,
Amit Kapila.

Re: Failure of subscription tests with topminnow

From

Ajin Cherian

Date:

25 August 2021, 05:30:28

On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> But will poll function still poll or exit? Have you tried that?

I have forced that condition with a changed query and found that the
poll will not exit in case of a NULL return.

> Because it is not clear from your explanation how in the first
> statement it returns a valid value which leads poll to exit and then
> in second statement it returns NULL or maybe nothing. Can you share

I don't have an explanation for it either. Maybe things are different
in REL_11_STABLE

> the log also when you are getting "replication slot "tap_sub" is
> active for ..."? If you see in the below log [1], the STATEMENT is
> printed twice, I want to see if you also get prints in a similar way
> or is it something different? Do you know why it is printed twice?
>

Yes, I get the same. For every "LOG or ERROR" message, there is the
associated STATEMENT message with it.
First there is a LOG "received replication command" followed by the
STATEMENT and then the ERROR "replication slot .. is active.."
followed by the STATEMENT.

I will try with REL_11_STABLE and see if the behaviour is different.

regards,
Ajin Cherian
Fujitsu Australia

Re: Failure of subscription tests with topminnow

From

Amit Kapila

Date:

25 August 2021, 09:22:04

On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > But will poll function still poll or exit? Have you tried that?
>
> I have forced that condition with a changed query and found that the
> poll will not exit in case of a NULL return.
>

What if the query in a poll is fired just before we get an error
"tap_sub ERROR:  replication slot "tap_sub" is active for PID 16336"?
Won't at that stage both old and new walsender's are present, so the
query might return true. You can check that via debugger by stopping
just before this error occurs and then check pg_stat_replication view.

>
> I will try with REL_11_STABLE and see if the behaviour is different.
>

Okay, but I still suspect some timing issue here.

-- 
With Regards,
Amit Kapila.

Re: Failure of subscription tests with topminnow

From

Ajin Cherian

Date:

25 August 2021, 10:43:15

On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > But will poll function still poll or exit? Have you tried that?
> >
> > I have forced that condition with a changed query and found that the
> > poll will not exit in case of a NULL return.
> >
>
> What if the query in a poll is fired just before we get an error
> "tap_sub ERROR:  replication slot "tap_sub" is active for PID 16336"?
> Won't at that stage both old and new walsender's are present, so the
> query might return true. You can check that via debugger by stopping
> just before this error occurs and then check pg_stat_replication view.

If this error happens then the PID is NOT updated as the pid in the
Replication slot. I have checked this
and explained this in my first email itself

regards,
Ajin Cherian
Fujitsu Australia

Re: Failure of subscription tests with topminnow

From

Ajin Cherian

Date:

25 August 2021, 12:53:01

On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > But will poll function still poll or exit? Have you tried that?
> > >
> > > I have forced that condition with a changed query and found that the
> > > poll will not exit in case of a NULL return.
> > >
> >
> > What if the query in a poll is fired just before we get an error
> > "tap_sub ERROR:  replication slot "tap_sub" is active for PID 16336"?
> > Won't at that stage both old and new walsender's are present, so the
> > query might return true. You can check that via debugger by stopping
> > just before this error occurs and then check pg_stat_replication view.
>
> If this error happens then the PID is NOT updated as the pid in the
> Replication slot. I have checked this
> and explained this in my first email itself
>

Sorry about the above email, I misunderstood. I was looking at
pg_stat_replication_slot rather than pg_stat_replication hence the confusion.
Amit is correct, just prior to the walsender erroring out, it briefly
appears in the
pg_stat_replication, and that is why this error happens. Sorry for the
confusion.
I just confirmed it, got both the walsenders stopped in the debugger:

postgres=# select pid from pg_stat_replication where application_name = 'sub';
 pid
------
 7899
 7993
(2 rows)


ajin      7896  3326  0 05:22 pts/2    00:00:00 psql -p 6972 postgres
ajin      7897  7882  0 05:22 ?        00:00:00 postgres: ajin
postgres [local] idle
ajin      7899  7882  0 05:22 ?        00:00:00 postgres: walsender
ajin ::1(37719) START_REPLICATION
ajin      7992  3647  0 05:24 ?        00:00:00 postgres: logical
replication worker for subscription 16426
ajin      7993  7882  0 05:24 ?        00:00:00 postgres: walsender
ajin ::1(37720) START_REPLICATION

regards,
Ajin Cherian
Fujitsu Australia

Re: Failure of subscription tests with topminnow

From

Masahiko Sawada

Date:

25 August 2021, 14:32:01

On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > > >
> > > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > > But will poll function still poll or exit? Have you tried that?
> > > >
> > > > I have forced that condition with a changed query and found that the
> > > > poll will not exit in case of a NULL return.
> > > >
> > >
> > > What if the query in a poll is fired just before we get an error
> > > "tap_sub ERROR:  replication slot "tap_sub" is active for PID 16336"?
> > > Won't at that stage both old and new walsender's are present, so the
> > > query might return true. You can check that via debugger by stopping
> > > just before this error occurs and then check pg_stat_replication view.
> >
> > If this error happens then the PID is NOT updated as the pid in the
> > Replication slot. I have checked this
> > and explained this in my first email itself
> >
>
> Sorry about the above email, I misunderstood. I was looking at
> pg_stat_replication_slot rather than pg_stat_replication hence the confusion.
> Amit is correct, just prior to the walsender erroring out, it briefly
> appears in the
> pg_stat_replication, and that is why this error happens. Sorry for the
> confusion.
> I just confirmed it, got both the walsenders stopped in the debugger:
>
> postgres=# select pid from pg_stat_replication where application_name = 'sub';
>  pid
> ------
>  7899
>  7993
> (2 rows)

IIUC the query[1] used for polling returns two rows in this case: {t,
f} or {f, t}. But did poll_query_until() returned OK in this case even
if we expected one row of 't'? My guess of how this issue happened is:

1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
passed (for some reason).
2. all wal senders exited.
3. get the pid of wal sender with application_name 'tap_sub' but got nothing.
4. the second polling query resulted in a syntax error since $oldpid is null.

If the fact that two walsender with the same application_name could
present in pg_stat_replication view was the cause of this issue,
poll_query_until() should return OK even if we expected just 't'. I
might be missing something, though.

[1] "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = '$appname';"

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Failure of subscription tests with topminnow

From

Amit Kapila

Date:

25 August 2021, 15:23:31

On Wed, Aug 25, 2021 at 5:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > > On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > > > >
> > > > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > > But will poll function still poll or exit? Have you tried that?
> > > > >
> > > > > I have forced that condition with a changed query and found that the
> > > > > poll will not exit in case of a NULL return.
> > > > >
> > > >
> > > > What if the query in a poll is fired just before we get an error
> > > > "tap_sub ERROR:  replication slot "tap_sub" is active for PID 16336"?
> > > > Won't at that stage both old and new walsender's are present, so the
> > > > query might return true. You can check that via debugger by stopping
> > > > just before this error occurs and then check pg_stat_replication view.
> > >
> > > If this error happens then the PID is NOT updated as the pid in the
> > > Replication slot. I have checked this
> > > and explained this in my first email itself
> > >
> >
> > Sorry about the above email, I misunderstood. I was looking at
> > pg_stat_replication_slot rather than pg_stat_replication hence the confusion.
> > Amit is correct, just prior to the walsender erroring out, it briefly
> > appears in the
> > pg_stat_replication, and that is why this error happens. Sorry for the
> > confusion.
> > I just confirmed it, got both the walsenders stopped in the debugger:
> >
> > postgres=# select pid from pg_stat_replication where application_name = 'sub';
> >  pid
> > ------
> >  7899
> >  7993
> > (2 rows)
>
> IIUC the query[1] used for polling returns two rows in this case: {t,
> f} or {f, t}. But did poll_query_until() returned OK in this case even
> if we expected one row of 't'? My guess of how this issue happened is:
>

Yeah, we can check this but I guess as soon as it gets 't', the poll
query will exit.

> 1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
> passed (for some reason).
>

I think the reason for exit is that we get two rows with the same
application_name in pg_stat_replication.

> 2. all wal senders exited.
> 3. get the pid of wal sender with application_name 'tap_sub' but got nothing.
> 4. the second polling query resulted in a syntax error since $oldpid is null.
>

Your understanding of steps is the same as mine.


--
With Regards,
Amit Kapila.

Re: Failure of subscription tests with topminnow

From

Ajin Cherian

Date:

25 August 2021, 15:24:06

On Wed, Aug 25, 2021 at 9:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

>
> IIUC the query[1] used for polling returns two rows in this case: {t,
> f} or {f, t}. But did poll_query_until() returned OK in this case even
> if we expected one row of 't'? My guess of how this issue happened is:
>
> 1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
> passed (for some reason).
> 2. all wal senders exited.
> 3. get the pid of wal sender with application_name 'tap_sub' but got nothing.
> 4. the second polling query resulted in a syntax error since $oldpid is null.
>
> If the fact that two walsender with the same application_name could
> present in pg_stat_replication view was the cause of this issue,
> poll_query_until() should return OK even if we expected just 't'. I
> might be missing something, though.
>
> [1] "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> application_name = '$appname';"

Yes, the query [1] returns OK with a {f,t} or {t,f}

[1] - "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = '$appname';"

regards,
Ajin Cherian
Fujitsu Australia

Re: Failure of subscription tests with topminnow

From

Masahiko Sawada

Date:

25 August 2021, 15:40:12

On Wed, Aug 25, 2021 at 9:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Aug 25, 2021 at 5:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Aug 25, 2021 at 6:53 PM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > > On Wed, Aug 25, 2021 at 5:43 PM Ajin Cherian <itsajin@gmail.com> wrote:
> > > >
> > > > On Wed, Aug 25, 2021 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Wed, Aug 25, 2021 at 8:00 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Aug 24, 2021 at 11:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > > >
> > > > > > > But will poll function still poll or exit? Have you tried that?
> > > > > >
> > > > > > I have forced that condition with a changed query and found that the
> > > > > > poll will not exit in case of a NULL return.
> > > > > >
> > > > >
> > > > > What if the query in a poll is fired just before we get an error
> > > > > "tap_sub ERROR:  replication slot "tap_sub" is active for PID 16336"?
> > > > > Won't at that stage both old and new walsender's are present, so the
> > > > > query might return true. You can check that via debugger by stopping
> > > > > just before this error occurs and then check pg_stat_replication view.
> > > >
> > > > If this error happens then the PID is NOT updated as the pid in the
> > > > Replication slot. I have checked this
> > > > and explained this in my first email itself
> > > >
> > >
> > > Sorry about the above email, I misunderstood. I was looking at
> > > pg_stat_replication_slot rather than pg_stat_replication hence the confusion.
> > > Amit is correct, just prior to the walsender erroring out, it briefly
> > > appears in the
> > > pg_stat_replication, and that is why this error happens. Sorry for the
> > > confusion.
> > > I just confirmed it, got both the walsenders stopped in the debugger:
> > >
> > > postgres=# select pid from pg_stat_replication where application_name = 'sub';
> > >  pid
> > > ------
> > >  7899
> > >  7993
> > > (2 rows)
> >
> > IIUC the query[1] used for polling returns two rows in this case: {t,
> > f} or {f, t}. But did poll_query_until() returned OK in this case even
> > if we expected one row of 't'? My guess of how this issue happened is:
> >
>
> Yeah, we can check this but I guess as soon as it gets 't', the poll
> query will exit.

I did a quick check with the following tap test code:

$node_publisher->poll_query_until('postgres',
                                  qq(
select 1 != foo.column1 from (values(0), (1)) as foo;
));

The query returns {t, f} but poll_query_until() never finished. The
same is true when the query returns {f, t}.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Failure of subscription tests with topminnow

From

Amit Kapila

Date:

25 August 2021, 16:13:49

On Wed, Aug 25, 2021 at 5:54 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Wed, Aug 25, 2021 at 9:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> >
> > IIUC the query[1] used for polling returns two rows in this case: {t,
> > f} or {f, t}. But did poll_query_until() returned OK in this case even
> > if we expected one row of 't'? My guess of how this issue happened is:
> >
> > 1. the first polling query after "ATLER SUBSCRIPTION CONNECTION"
> > passed (for some reason).
> > 2. all wal senders exited.
> > 3. get the pid of wal sender with application_name 'tap_sub' but got nothing.
> > 4. the second polling query resulted in a syntax error since $oldpid is null.
> >
> > If the fact that two walsender with the same application_name could
> > present in pg_stat_replication view was the cause of this issue,
> > poll_query_until() should return OK even if we expected just 't'. I
> > might be missing something, though.
> >
> > [1] "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> > application_name = '$appname';"
>
> Yes, the query [1] returns OK with a {f,t} or {t,f}
>
> [1] - "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> application_name = '$appname';"
>

Can you additionally check the value of 'state' from
pg_stat_replication for both the old and new walsender sessions?

-- 
With Regards,
Amit Kapila.

Re: Failure of subscription tests with topminnow

From

Amit Kapila

Date:

25 August 2021, 16:17:05

On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> I did a quick check with the following tap test code:
>
> $node_publisher->poll_query_until('postgres',
>                                   qq(
> select 1 != foo.column1 from (values(0), (1)) as foo;
> ));
>
> The query returns {t, f} but poll_query_until() never finished. The
> same is true when the query returns {f, t}.
>

This means something different is going on in Ajin's setup. Ajin, can
you please share how did you confirm your findings about poll_query?


-- 
With Regards,
Amit Kapila.

Re: Failure of subscription tests with topminnow

From

Ajin Cherian

Date:

25 August 2021, 17:04:23

On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > I did a quick check with the following tap test code:
> >
> > $node_publisher->poll_query_until('postgres',
> >                                   qq(
> > select 1 != foo.column1 from (values(0), (1)) as foo;
> > ));
> >
> > The query returns {t, f} but poll_query_until() never finished. The
> > same is true when the query returns {f, t}.
> >

Yes, this is true, I also see the same behaviour.

>
> This means something different is going on in Ajin's setup. Ajin, can
> you please share how did you confirm your findings about poll_query?

Relooking at my logs, I think what happens is this:

1. First walsender 'a' is running.
2. Second walsender 'b' starts and attempts at acquiring the slot
finds that the slot is active for pid a.
3. Now both walsenders are active, the query does not return.
4. First walsender 'a' times out and exits.
5. Now only the second walsender is active and the query returns OK
because pid != a.
6. Second walsender exits with error.
7. Another query attempts to get the pid of the running walsender for
tap_sub but returns null because both walsender exits.
8. This null return value results in the next query erroring out and
the test failing.

>Can you additionally check the value of 'state' from
>pg_stat_replication for both the old and new walsender sessions?

Yes, will try this and post a patch tomorrow.

regards,
Ajin Cherian
Fujitsu Australia

Re: Failure of subscription tests with topminnow

From

Masahiko Sawada

Date:

26 August 2021, 04:01:56

On Wed, Aug 25, 2021 at 11:04 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > I did a quick check with the following tap test code:
> > >
> > > $node_publisher->poll_query_until('postgres',
> > >                                   qq(
> > > select 1 != foo.column1 from (values(0), (1)) as foo;
> > > ));
> > >
> > > The query returns {t, f} but poll_query_until() never finished. The
> > > same is true when the query returns {f, t}.
> > >
>
> Yes, this is true, I also see the same behaviour.
>
> >
> > This means something different is going on in Ajin's setup. Ajin, can
> > you please share how did you confirm your findings about poll_query?
>
> Relooking at my logs, I think what happens is this:
>
> 1. First walsender 'a' is running.
> 2. Second walsender 'b' starts and attempts at acquiring the slot
> finds that the slot is active for pid a.
> 3. Now both walsenders are active, the query does not return.
> 4. First walsender 'a' times out and exits.
> 5. Now only the second walsender is active and the query returns OK
> because pid != a.
> 6. Second walsender exits with error.
> 7. Another query attempts to get the pid of the running walsender for
> tap_sub but returns null because both walsender exits.
> 8. This null return value results in the next query erroring out and
> the test failing.

So this is slightly different than what we can see in the topminnow
logs? According to the server logs, step #5 happened (at 18:44:38.016)
before step #4 happened (at 18:44:38.043).

>
> >Can you additionally check the value of 'state' from
> >pg_stat_replication for both the old and new walsender sessions?
>
> Yes, will try this and post a patch tomorrow.

Thanks. I guess the state of the new walsender should be "startup"
whereas the old one should be "streaming".

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Failure of subscription tests with topminnow

From

Ajin Cherian

Date:

26 August 2021, 05:07:53

On Thu, Aug 26, 2021 at 11:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Aug 25, 2021 at 11:04 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Wed, Aug 25, 2021 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Aug 25, 2021 at 6:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > I did a quick check with the following tap test code:
> > > >
> > > > $node_publisher->poll_query_until('postgres',
> > > >                                   qq(
> > > > select 1 != foo.column1 from (values(0), (1)) as foo;
> > > > ));
> > > >
> > > > The query returns {t, f} but poll_query_until() never finished. The
> > > > same is true when the query returns {f, t}.
> > > >
> >
> > Yes, this is true, I also see the same behaviour.
> >
> > >
> > > This means something different is going on in Ajin's setup. Ajin, can
> > > you please share how did you confirm your findings about poll_query?
> >
> > Relooking at my logs, I think what happens is this:
> >
> > 1. First walsender 'a' is running.
> > 2. Second walsender 'b' starts and attempts at acquiring the slot
> > finds that the slot is active for pid a.
> > 3. Now both walsenders are active, the query does not return.
> > 4. First walsender 'a' times out and exits.
> > 5. Now only the second walsender is active and the query returns OK
> > because pid != a.
> > 6. Second walsender exits with error.
> > 7. Another query attempts to get the pid of the running walsender for
> > tap_sub but returns null because both walsender exits.
> > 8. This null return value results in the next query erroring out and
> > the test failing.
>
> So this is slightly different than what we can see in the topminnow
> logs? According to the server logs, step #5 happened (at 18:44:38.016)
> before step #4 happened (at 18:44:38.043).
>

Luckily these logs have the disconnection times of the tap test client
sessions as well. (not sure why I don't see these when I run these
tests).

Step 5 could have happened anywhere between 18:44:38.016 and 18:44:38.063
18:44:38.016 CEST [16474:3] 001_rep_changes.pl LOG:  statement: SELECT
pid != 16336 FROM pg_stat_replication WHERE application_name =
'tap_sub';
:
:
18:44:38.063 CEST [16474:4] 001_rep_changes.pl LOG:  disconnection:
session time: 0:00:00.063 user=nm database=postgres host=[local]

When the query starts both walsenders are present but when the query
completes both walsenders are gone, the actual query evaluation could
have happened any time in between. This is the rare timing window that
causes this problem.

regards,
Ajin Cherian
Fujitsu Australia

Re: Failure of subscription tests with topminnow

From

Amit Kapila

Date:

26 August 2021, 06:06:24

On Thu, Aug 26, 2021 at 7:38 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Thu, Aug 26, 2021 at 11:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
>
> Luckily these logs have the disconnection times of the tap test client
> sessions as well. (not sure why I don't see these when I run these
> tests).
>
> Step 5 could have happened anywhere between 18:44:38.016 and 18:44:38.063
> 18:44:38.016 CEST [16474:3] 001_rep_changes.pl LOG:  statement: SELECT
> pid != 16336 FROM pg_stat_replication WHERE application_name =
> 'tap_sub';
> :
> :
> 18:44:38.063 CEST [16474:4] 001_rep_changes.pl LOG:  disconnection:
> session time: 0:00:00.063 user=nm database=postgres host=[local]
>
> When the query starts both walsenders are present but when the query
> completes both walsenders are gone, the actual query evaluation could
> have happened any time in between. This is the rare timing window that
> causes this problem.
>

You have a point but if we see the below logs, it seems the second
walsender (#step6) seemed to exited before the first walsender
(#step4).

2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG:  disconnection:
session time: 0:00:00.036 user=nm database=postgres host=[local]
2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG:  disconnection:
session time: 0:00:06.367 user=nm database=postgres host=[local]

Isn't it possible that pid is cleared in the other order due to which
we are seeing this problem?

-- 
With Regards,
Amit Kapila.

Re: Failure of subscription tests with topminnow

From

Ajin Cherian

Date:

26 August 2021, 06:51:06

On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

>
> You have a point but if we see the below logs, it seems the second
> walsender (#step6) seemed to exited before the first walsender
> (#step4).
>
> 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG:  disconnection:
> session time: 0:00:00.036 user=nm database=postgres host=[local]
> 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG:  disconnection:
> session time: 0:00:06.367 user=nm database=postgres host=[local]
>
> Isn't it possible that pid is cleared in the other order due to which
> we are seeing this problem?

If the pid is cleared in the other order, wouldn't the query [1] return a false?

[1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE
application_name =  'tap_sub';"

regards,
Ajin Cherian
Fujitsu Australia

Re: Failure of subscription tests with topminnow

From

Amit Kapila

Date:

26 August 2021, 06:54:08

On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >
> > You have a point but if we see the below logs, it seems the second
> > walsender (#step6) seemed to exited before the first walsender
> > (#step4).
> >
> > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG:  disconnection:
> > session time: 0:00:00.036 user=nm database=postgres host=[local]
> > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG:  disconnection:
> > session time: 0:00:06.367 user=nm database=postgres host=[local]
> >
> > Isn't it possible that pid is cleared in the other order due to which
> > we are seeing this problem?
>
> If the pid is cleared in the other order, wouldn't the query [1] return a false?
>
> [1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE
> application_name =  'tap_sub';"
>

I think it should return true because pid for 16336 is cleared first
and the remaining one will be 16475.

-- 
With Regards,
Amit Kapila.

Re: Failure of subscription tests with topminnow

From

Ajin Cherian

Date:

26 August 2021, 06:59:22

On Thu, Aug 26, 2021 at 1:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > >
> > > You have a point but if we see the below logs, it seems the second
> > > walsender (#step6) seemed to exited before the first walsender
> > > (#step4).
> > >
> > > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG:  disconnection:
> > > session time: 0:00:00.036 user=nm database=postgres host=[local]
> > > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG:  disconnection:
> > > session time: 0:00:06.367 user=nm database=postgres host=[local]
> > >
> > > Isn't it possible that pid is cleared in the other order due to which
> > > we are seeing this problem?
> >
> > If the pid is cleared in the other order, wouldn't the query [1] return a false?
> >
> > [1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE
> > application_name =  'tap_sub';"
> >
>
> I think it should return true because pid for 16336 is cleared first
> and the remaining one will be 16475.

Yes, that was what I explained as well. 16336 is PID 'a' (first
walsender) in my explanation. The first walsender should
be cleared first for this theory to work.

regards,
Ajin Cherian
Fujitsu Australia

Re: Failure of subscription tests with topminnow

From

Masahiko Sawada

Date:

26 August 2021, 07:45:21

On Thu, Aug 26, 2021 at 12:59 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Thu, Aug 26, 2021 at 1:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Aug 26, 2021 at 9:21 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > > On Thu, Aug 26, 2021 at 1:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > >
> > > > You have a point but if we see the below logs, it seems the second
> > > > walsender (#step6) seemed to exited before the first walsender
> > > > (#step4).
> > > >
> > > > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG:  disconnection:
> > > > session time: 0:00:00.036 user=nm database=postgres host=[local]
> > > > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG:  disconnection:
> > > > session time: 0:00:06.367 user=nm database=postgres host=[local]
> > > >
> > > > Isn't it possible that pid is cleared in the other order due to which
> > > > we are seeing this problem?
> > >
> > > If the pid is cleared in the other order, wouldn't the query [1] return a false?
> > >
> > > [1] - " SELECT pid != 16336 FROM pg_stat_replication WHERE
> > > application_name =  'tap_sub';"
> > >
> >
> > I think it should return true because pid for 16336 is cleared first
> > and the remaining one will be 16475.
>
> Yes, that was what I explained as well. 16336 is PID 'a' (first
> walsender) in my explanation. The first walsender should
> be cleared first for this theory to work.

I think that it’s possible that the orders of the process writing
disconnections logs and setting 0 to walsender's pid are mismatched.
We set 0 to walsender's pid in WalSndKill() that is called during
on_shmem_exit callback. Once we set 0, pg_stat_replication doesn't
show the entry. On the other hand, disconnections logs are written by
log_disconnections() that is called during on_proc_exit callback. That
is, the following sequence could happen:

1. the second walsender (pid = 16475) raises an error as the slot is
already active (held by the first walsender).
2. the first walsender (pid = 16336) clears its pid on the shmem.
3. the polling query checks the walsender’s pid, and returns true
(since there is only the second walsender now).
4. the second walsender clears its pid on the shmem.
5. the second walsender writes disconnection log.
6. the first walsender writes disconneciton log.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Failure of subscription tests with topminnow

From

Ajin Cherian

Date:

26 August 2021, 13:58:40

On Thu, Aug 26, 2021 at 2:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> I think that it’s possible that the orders of the process writing
> disconnections logs and setting 0 to walsender's pid are mismatched.
> We set 0 to walsender's pid in WalSndKill() that is called during
> on_shmem_exit callback. Once we set 0, pg_stat_replication doesn't
> show the entry. On the other hand, disconnections logs are written by
> log_disconnections() that is called during on_proc_exit callback. That
> is, the following sequence could happen:
>
> 1. the second walsender (pid = 16475) raises an error as the slot is
> already active (held by the first walsender).
> 2. the first walsender (pid = 16336) clears its pid on the shmem.
> 3. the polling query checks the walsender’s pid, and returns true
> (since there is only the second walsender now).
> 4. the second walsender clears its pid on the shmem.
> 5. the second walsender writes disconnection log.
> 6. the first walsender writes disconneciton log.

I agree with this.

Attaching a patch on head that modifies this particular script to also
consider the state of the walsender.

regards,
Ajin Cherian
Fujitsu Australia

Attachment

v1-0001-fix-for-tap-test-failure-seen-in-001_rep_changes.patch

Re: Failure of subscription tests with topminnow

From

Amit Kapila

Date:

27 August 2021, 08:29:40

On Thu, Aug 26, 2021 at 4:28 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Thu, Aug 26, 2021 at 2:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Attaching a patch on head that modifies this particular script to also
> consider the state of the walsender.
>

I think the fix is correct but similar changes are required in
022_twophase_cascade.pl as well (search for $oldpid in tests). I am
not completely sure but I think it is better to make this test change
in back branches as well to make it stable and reduce such random
build farm failures.

-- 
With Regards,
Amit Kapila.

Re: Failure of subscription tests with topminnow

From

Ajin Cherian

Date:

27 August 2021, 14:03:17

On Fri, Aug 27, 2021 at 3:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>
> I think the fix is correct but similar changes are required in
> 022_twophase_cascade.pl as well (search for $oldpid in tests). I am
> not completely sure but I think it is better to make this test change
> in back branches as well to make it stable and reduce such random
> build farm failures.

I have made the changes in 022_twophase_cascade.pl for HEAD as well as
patches for older branches.

regards,
Ajin Cherian
Fujitsu Australia

Attachment

Re: Failure of subscription tests with topminnow

From

Amit Kapila

Date:

30 August 2021, 12:52:46

On Fri, Aug 27, 2021 at 4:33 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Fri, Aug 27, 2021 at 3:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > I think the fix is correct but similar changes are required in
> > 022_twophase_cascade.pl as well (search for $oldpid in tests). I am
> > not completely sure but I think it is better to make this test change
> > in back branches as well to make it stable and reduce such random
> > build farm failures.
>
> I have made the changes in 022_twophase_cascade.pl for HEAD as well as
> patches for older branches.
>

Isn't it better to check the streaming state when we are fetching
oldpid? If we don't add, then I suspect that the next time someone
adding tests on similar lines might get confused about where to check
the state and where not. Also, if you agree, add some comments before
the test on why it is important to check states.

For ex., in below queries, the queries used for $oldpid.
my $oldpid = $node_publisher->safe_psql('postgres',
        "SELECT pid FROM pg_stat_replication WHERE application_name =
'tap_sub';"
);
$node_subscriber->safe_psql('postgres',
        "ALTER SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
sslmode=disable'"
);
$node_publisher->poll_query_until('postgres',
        "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart";

$oldpid = $node_publisher->safe_psql('postgres',
        "SELECT pid FROM pg_stat_replication WHERE application_name =
'tap_sub';"
);
$node_subscriber->safe_psql('postgres',
        "ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only
WITH (copy_data = false)"
);
$node_publisher->poll_query_until('postgres',
        "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub' AND state = 'streaming';"


-- 
With Regards,
Amit Kapila.

Re: Failure of subscription tests with topminnow

From

Ajin Cherian

Date:

30 August 2021, 15:18:06

On Mon, Aug 30, 2021 at 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> Isn't it better to check the streaming state when we are fetching
> oldpid? If we don't add, then I suspect that the next time someone
> adding tests on similar lines might get confused about where to check
> the state and where not. Also, if you agree, add some comments before
> the test on why it is important to check states.
>
> For ex., in below queries, the queries used for $oldpid.
> my $oldpid = $node_publisher->safe_psql('postgres',
>         "SELECT pid FROM pg_stat_replication WHERE application_name =
> 'tap_sub';"
> );
> $node_subscriber->safe_psql('postgres',
>         "ALTER SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr
> sslmode=disable'"
> );
> $node_publisher->poll_query_until('postgres',
>         "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> application_name = 'tap_sub' AND state = 'streaming';"
> ) or die "Timed out while waiting for apply to restart";
>
> $oldpid = $node_publisher->safe_psql('postgres',
>         "SELECT pid FROM pg_stat_replication WHERE application_name =
> 'tap_sub';"
> );
> $node_subscriber->safe_psql('postgres',
>         "ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only
> WITH (copy_data = false)"
> );
> $node_publisher->poll_query_until('postgres',
>         "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> application_name = 'tap_sub' AND state = 'streaming';"
>

I have made the above changes on HEAD.

regards,
Ajin Cherian
Fujitsu Australia

Attachment

v3-0001-Fix-the-random-test-failure-in-001_rep_changes.patch

Re: Failure of subscription tests with topminnow

From

Amit Kapila

Date:

31 August 2021, 06:11:41

On Mon, Aug 30, 2021 at 5:48 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Mon, Aug 30, 2021 at 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I have made the above changes on HEAD.
>

Thanks, this looks mostly good to me. I'll push and backpatch this
tomorrow unless you or someone else thinks otherwise.

Minor comments
==============
1.
 $oldpid = $node_publisher->safe_psql('postgres',
- "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
+ "SELECT pid FROM pg_stat_replication WHERE application_name =
'tap_sub' AND state = 'streaming';;"
 );

An extra semicolon at the end of the statement.

2.
+# restart of subscription workers. We check the state along with
application_name
+# to ensure that the walsender is (re)started.

It is better to keep application_name in an above comment in the
second line as that will make this line looks a bit more consistent
with other comments.

3. In commit message, the text: "The reason was that the test was
assuming the walsender started before it reaches the 'streaming' state
and The check to test whether the subscription workers were restarting
after a change in the subscription was failing." seems to be
repeated/redundant.

4. Kindly submit the patches for back-branches.

-- 
With Regards,
Amit Kapila.

Re: Failure of subscription tests with topminnow

From

Masahiko Sawada

Date:

31 August 2021, 08:47:10

On Tue, Aug 31, 2021 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Aug 30, 2021 at 5:48 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Mon, Aug 30, 2021 at 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I have made the above changes on HEAD.
> >
>
> Thanks, this looks mostly good to me. I'll push and backpatch this
> tomorrow unless you or someone else thinks otherwise.
>
> Minor comments
> ==============
> 1.
>  $oldpid = $node_publisher->safe_psql('postgres',
> - "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
> + "SELECT pid FROM pg_stat_replication WHERE application_name =
> 'tap_sub' AND state = 'streaming';;"
>  );
>
> An extra semicolon at the end of the statement.
>
> 2.
> +# restart of subscription workers. We check the state along with
> application_name
> +# to ensure that the walsender is (re)started.
>
> It is better to keep application_name in an above comment in the
> second line as that will make this line looks a bit more consistent
> with other comments.
>
> 3. In commit message, the text: "The reason was that the test was
> assuming the walsender started before it reaches the 'streaming' state
> and The check to test whether the subscription workers were restarting
> after a change in the subscription was failing." seems to be
> repeated/redundant.
>
> 4. Kindly submit the patches for back-branches.

The patch with the above comments looks good to me. One minor
suggestion is to change the two messages of "die" to make the
investigation a bit easier. For example,

$node_publisher->poll_query_until('postgres',
        "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart after changing
CONNECTION";

and

$node_publisher->poll_query_until('postgres',
        "SELECT pid != $oldpid FROM pg_stat_replication WHERE
application_name = 'tap_sub' AND state = 'streaming';"
) or die "Timed out while waiting for apply to restart after changing
PUBLICATION";

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Failure of subscription tests with topminnow

From

Ajin Cherian

Date:

31 August 2021, 15:43:45

On Tue, Aug 31, 2021 at 3:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Aug 31, 2021 at 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Aug 30, 2021 at 5:48 PM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > > On Mon, Aug 30, 2021 at 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > I have made the above changes on HEAD.
> > >
> >
> > Thanks, this looks mostly good to me. I'll push and backpatch this
> > tomorrow unless you or someone else thinks otherwise.
> >
> > Minor comments
> > ==============
> > 1.
> >  $oldpid = $node_publisher->safe_psql('postgres',
> > - "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';"
> > + "SELECT pid FROM pg_stat_replication WHERE application_name =
> > 'tap_sub' AND state = 'streaming';;"
> >  );
> >
> > An extra semicolon at the end of the statement.
> >
> > 2.
> > +# restart of subscription workers. We check the state along with
> > application_name
> > +# to ensure that the walsender is (re)started.
> >
> > It is better to keep application_name in an above comment in the
> > second line as that will make this line looks a bit more consistent
> > with other comments.
> >
> > 3. In commit message, the text: "The reason was that the test was
> > assuming the walsender started before it reaches the 'streaming' state
> > and The check to test whether the subscription workers were restarting
> > after a change in the subscription was failing." seems to be
> > repeated/redundant.
> >
> > 4. Kindly submit the patches for back-branches.
>
> The patch with the above comments looks good to me. One minor
> suggestion is to change the two messages of "die" to make the
> investigation a bit easier. For example,
>
> $node_publisher->poll_query_until('postgres',
>         "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> application_name = 'tap_sub' AND state = 'streaming';"
> ) or die "Timed out while waiting for apply to restart after changing
> CONNECTION";
>
> and
>
> $node_publisher->poll_query_until('postgres',
>         "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> application_name = 'tap_sub' AND state = 'streaming';"
> ) or die "Timed out while waiting for apply to restart after changing
> PUBLICATION";
>
> Regards,
>

Thanks Masahiko-san. I have included this change and made a new patch-set.

Hi Amit,

I have included your comments as well and also attached the patches
for the back-branches.

regards,
Ajin Cherian
Fujitsu Australia

Attachment

Re: Failure of subscription tests with topminnow

From

Amit Kapila

Date:

01 September 2021, 11:24:00

On Tue, Aug 31, 2021 at 6:13 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Tue, Aug 31, 2021 at 3:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Thanks Masahiko-san. I have included this change and made a new patch-set.
>
> Hi Amit,
>
> I have included your comments as well and also attached the patches
> for the back-branches.
>

You forgot to make changes in 022_twophase_cascade.pl in the head
version patch. I have made the required changes and pushed it. Thanks
to you and Sawada-San for diagnosing and fixing this problem.

-- 
With Regards,
Amit Kapila.