Re: Failure of subscription tests with topminnow - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Failure of subscription tests with topminnow
Date
Msg-id CAA4eK1KoycXH_OCO9iKbwr0pakPQ1+5RWq8UZPhw6MPFkU-MLQ@mail.gmail.com
Whole thread Raw
In response to Re: Failure of subscription tests with topminnow  (Ajin Cherian <itsajin@gmail.com>)
Responses Re: Failure of subscription tests with topminnow
List pgsql-hackers
On Tue, Aug 24, 2021 at 11:04 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Mon, Aug 16, 2021 at 6:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > The "ALTER SUBSCRIPTION tap_sub CONNECTION" would lead to restart the
> > walsender process. Now, here the problem seems to be that the previous
> > walsender process (16336) didn't exit and the new process started to
> > use the slot which leads to the error shown in the log. This is
> > evident from the below part of log where we can see that 16336 has
> > exited after new process started to use the slot.
> >
> > 2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG:  received
> > replication command: START_REPLICATION SLOT "tap_sub" LOGICAL
> > 0/16BEEB0 (proto_version '1', publication_names
> > '"tap_pub","tap_pub_ins_only"')
> > 2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT:
> > START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
> > publication_names '"tap_pub","tap_pub_ins_only"')
> > 2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR:  replication
> > slot "tap_sub" is active for PID 16336
> > 2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT:
> > START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1',
> > publication_names '"tap_pub","tap_pub_ins_only"')
> > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG:  disconnection:
> > session time: 0:00:00.036 user=nm database=postgres host=[local]
> > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG:  disconnection:
> > session time: 0:00:06.367 user=nm database=postgres host=[local]
> >
> > One idea to solve this is to first disable the subscription, wait for
> > the walsender process to exit, and then change the connection string
> > and re-enable the subscription.
>
> I tried to simulate this by putting delays prior to the exit of the
> walsender. Even though I see the same error
> in the logs that the replication slot is active for the previous PID,
> the test case does not fail in the way seen in this case here..
>
> The new walsender tries to acquire the slot but seeing that there is
> another PID that is active on the slot, it
> errors and exits. At no point does this new failed walsender update
> its pid for the slot. As a result, the below polled query
> would not return a true value
>
> $node_publisher->poll_query_until('postgres',
>     "SELECT pid != $oldpid FROM pg_stat_replication WHERE
> application_name = 'tap_sub';"
> ) or die "Timed out while waiting for apply to restart";
>
> In my runs I see that this query is repeated until a new walsender is
> able to successfully acquire the slot.
>

What happens when there is neither a new walsender nor an old
walsender is present? It means to run the above statement when a new
walsender is exited due to error "... slot is active ..." and before a
new walsender could start. Also, allow old walsender (due to which the
error occurs) to exit before executing the statement.

Also, it seems this failure happens on REL_11_STABLE branch, not sure
if that matters, but it is better to use the same branch if you are
using a different branch to reproduce the issue.

-- 
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Laurenz Albe
Date:
Subject: Re: pgstat_send_connstats() introduces unnecessary timestamp and UDP overhead
Next
From: Ranier Vilela
Date:
Subject: Re: Showing I/O timings spent reading/writing temp buffers in EXPLAIN