Re: Failure of subscription tests with topminnow - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Failure of subscription tests with topminnow |
Date | |
Msg-id | CAA4eK1KoycXH_OCO9iKbwr0pakPQ1+5RWq8UZPhw6MPFkU-MLQ@mail.gmail.com Whole thread Raw |
In response to | Re: Failure of subscription tests with topminnow (Ajin Cherian <itsajin@gmail.com>) |
Responses |
Re: Failure of subscription tests with topminnow
|
List | pgsql-hackers |
On Tue, Aug 24, 2021 at 11:04 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Mon, Aug 16, 2021 at 6:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > The "ALTER SUBSCRIPTION tap_sub CONNECTION" would lead to restart the > > walsender process. Now, here the problem seems to be that the previous > > walsender process (16336) didn't exit and the new process started to > > use the slot which leads to the error shown in the log. This is > > evident from the below part of log where we can see that 16336 has > > exited after new process started to use the slot. > > > > 2021-08-15 18:44:38.027 CEST [16475:6] tap_sub LOG: received > > replication command: START_REPLICATION SLOT "tap_sub" LOGICAL > > 0/16BEEB0 (proto_version '1', publication_names > > '"tap_pub","tap_pub_ins_only"') > > 2021-08-15 18:44:38.027 CEST [16475:7] tap_sub STATEMENT: > > START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', > > publication_names '"tap_pub","tap_pub_ins_only"') > > 2021-08-15 18:44:38.027 CEST [16475:8] tap_sub ERROR: replication > > slot "tap_sub" is active for PID 16336 > > 2021-08-15 18:44:38.027 CEST [16475:9] tap_sub STATEMENT: > > START_REPLICATION SLOT "tap_sub" LOGICAL 0/16BEEB0 (proto_version '1', > > publication_names '"tap_pub","tap_pub_ins_only"') > > 2021-08-15 18:44:38.041 CEST [16475:10] tap_sub LOG: disconnection: > > session time: 0:00:00.036 user=nm database=postgres host=[local] > > 2021-08-15 18:44:38.043 CEST [16336:14] tap_sub LOG: disconnection: > > session time: 0:00:06.367 user=nm database=postgres host=[local] > > > > One idea to solve this is to first disable the subscription, wait for > > the walsender process to exit, and then change the connection string > > and re-enable the subscription. > > I tried to simulate this by putting delays prior to the exit of the > walsender. Even though I see the same error > in the logs that the replication slot is active for the previous PID, > the test case does not fail in the way seen in this case here.. > > The new walsender tries to acquire the slot but seeing that there is > another PID that is active on the slot, it > errors and exits. At no point does this new failed walsender update > its pid for the slot. As a result, the below polled query > would not return a true value > > $node_publisher->poll_query_until('postgres', > "SELECT pid != $oldpid FROM pg_stat_replication WHERE > application_name = 'tap_sub';" > ) or die "Timed out while waiting for apply to restart"; > > In my runs I see that this query is repeated until a new walsender is > able to successfully acquire the slot. > What happens when there is neither a new walsender nor an old walsender is present? It means to run the above statement when a new walsender is exited due to error "... slot is active ..." and before a new walsender could start. Also, allow old walsender (due to which the error occurs) to exit before executing the statement. Also, it seems this failure happens on REL_11_STABLE branch, not sure if that matters, but it is better to use the same branch if you are using a different branch to reproduce the issue. -- With Regards, Amit Kapila.
pgsql-hackers by date: