Thread: Failure in subscription test 004_sync.pl
Hi all, wrasse has just failed with what looks like a timing error with a replication slot drop: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-06-12%2006%3A16%3A30 Here is the error: error running SQL: 'psql:<stdin>:1: ERROR: could not drop replication slot "tap_sub" on publisher: ERROR: replication slot "tap_sub" is active for PID 1641' It seems to me that this just lacks a poll_query_until() doing some slot monitoring? I have not checked in details if we need to do that in more places. The code path that failed has been added in 7c4f524 from 2017. Thanks, -- Michael
Attachment
On Sat, Jun 12, 2021 at 1:13 PM Michael Paquier <michael@paquier.xyz> wrote: > > wrasse has just failed with what looks like a timing error with a > replication slot drop: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-06-12%2006%3A16%3A30 > > Here is the error: > error running SQL: 'psql:<stdin>:1: ERROR: could not drop replication > slot "tap_sub" on publisher: ERROR: replication slot "tap_sub" is > active for PID 1641' > > It seems to me that this just lacks a poll_query_until() doing some > slot monitoring? > I think it is showing a race condition issue in the code. In DropSubscription, we first stop the worker that is receiving the WAL, and then in a separate connection with the publisher, it tries to drop the slot which leads to this error. The reason is that walsender is still active as we just wait for wal receiver (or apply worker) to stop. Normally, as soon as the apply worker is stopped the walsender detects it and exits but in this case, it took some time to exit, and in the meantime, we tried to drop the slot which is still in use by walsender. If we want to fix this, we might want to wait till the slot is active on the publisher before trying to drop it but not sure if it is a good idea. In the worst case, if the user retries this operation (Drop Subscription), it will succeed. -- With Regards, Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> writes: > On Sat, Jun 12, 2021 at 1:13 PM Michael Paquier <michael@paquier.xyz> wrote: >> wrasse has just failed with what looks like a timing error with a >> replication slot drop: >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-06-12%2006%3A16%3A30 > If we want to fix this, we might want to wait till the slot is active > on the publisher before trying to drop it but not sure if it is a good > idea. In the worst case, if the user retries this operation (Drop > Subscription), it will succeed. wrasse's not the only animal reporting this type of failure. See also https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2021-06-12%2011%3A32%3A04 error running SQL: 'psql:<stdin>:1: ERROR: could not drop replication slot "pg_16387_sync_16384_6972886888894805957" onpublisher: ERROR: replication slot "pg_16387_sync_16384_6972886888894805957" is active for PID 2971625' while running 'psql -XAtq -d port=60321 host=/tmp/vdQmH7ijFI dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP SUBSCRIPTIONtestsub2' at /home/bf/build/buildfarm-komodoensis/HEAD/pgsql.build/../pgsql/src/test/perl/PostgresNode.pm line1771. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2021-06-11%2020%3A30%3A28 error running SQL: 'psql:<stdin>:1: ERROR: could not drop replication slot "testsub2" on publisher: ERROR: replicationslot "testsub2" is active for PID 27175' while running 'psql -XAtq -d port=59579 host=/tmp/9Qchjsykek dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP SUBSCRIPTIONtestsub2' at /home/pgbf/buildroot/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pmline 1771. Those are both in the t/100_bugs.pl test script, but otherwise they look like the exact same thing. I don't think that it's optional to fix a problem that occurs as often as three times in 10 days in the buildfarm. regards, tom lane
On Sat, Jun 12, 2021 at 9:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Jun 12, 2021 at 1:13 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > wrasse has just failed with what looks like a timing error with a > > replication slot drop: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-06-12%2006%3A16%3A30 > > > > Here is the error: > > error running SQL: 'psql:<stdin>:1: ERROR: could not drop replication > > slot "tap_sub" on publisher: ERROR: replication slot "tap_sub" is > > active for PID 1641' > > > > It seems to me that this just lacks a poll_query_until() doing some > > slot monitoring? > > > > I think it is showing a race condition issue in the code. In > DropSubscription, we first stop the worker that is receiving the WAL, > and then in a separate connection with the publisher, it tries to drop > the slot which leads to this error. The reason is that walsender is > still active as we just wait for wal receiver (or apply worker) to > stop. Normally, as soon as the apply worker is stopped the walsender > detects it and exits but in this case, it took some time to exit, and > in the meantime, we tried to drop the slot which is still in use by > walsender. There might be possible. That's weird since DROP SUBSCRIPTION executes DROP_REPLICATION_SLOT command with WAIT option. I found a bug that is possibly an oversight of commit 1632ea4368. The commit changed the code around the error as follows: if (active_pid != MyProcPid) { - if (behavior == SAB_Error) + if (!nowait) ereport(ERROR, (errcode(ERRCODE_OBJECT_IN_USE), errmsg("replication slot \"%s\" is active for PID %d", NameStr(s->data.name), active_pid))); - else if (behavior == SAB_Inquire) - return active_pid; /* Wait here until we get signaled, and then restart */ ConditionVariableSleep(&s->active_cv, The condition should be the opposite; we should raise the error when 'nowait' is true. I think this is the cause of the test failure. Even if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we don't wait but raise the error. Attached a small patch fixes it. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Attachment
On Mon, Jun 14, 2021 at 10:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > I think it is showing a race condition issue in the code. In > > DropSubscription, we first stop the worker that is receiving the WAL, > > and then in a separate connection with the publisher, it tries to drop > > the slot which leads to this error. The reason is that walsender is > > still active as we just wait for wal receiver (or apply worker) to > > stop. Normally, as soon as the apply worker is stopped the walsender > > detects it and exits but in this case, it took some time to exit, and > > in the meantime, we tried to drop the slot which is still in use by > > walsender. > > There might be possible. > > That's weird since DROP SUBSCRIPTION executes DROP_REPLICATION_SLOT > command with WAIT option. I found a bug that is possibly an oversight > of commit 1632ea4368. > .. > > The condition should be the opposite; we should raise the error when > 'nowait' is true. I think this is the cause of the test failure. Even > if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we > don't wait but raise the error. > > Attached a small patch fixes it. > Yes, this should fix the recent buildfarm failures. Alvaro, would you like to take care of this? -- With Regards, Amit Kapila.
On 2021-Jun-14, Amit Kapila wrote: > On Mon, Jun 14, 2021 at 10:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > The condition should be the opposite; we should raise the error when > > 'nowait' is true. I think this is the cause of the test failure. Even > > if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we > > don't wait but raise the error. > > Yes, this should fix the recent buildfarm failures. Alvaro, would you > like to take care of this? Ugh, thanks for CCing me. Yes, I'll get this fixed ASAP. -- Álvaro Herrera Valdivia, Chile "Saca el libro que tu religión considere como el indicado para encontrar la oración que traiga paz a tu alma. Luego rebootea el computador y ve si funciona" (Carlos Duclós)
On 2021-Jun-14, Masahiko Sawada wrote: > The condition should be the opposite; we should raise the error when > 'nowait' is true. I think this is the cause of the test failure. Even > if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we > don't wait but raise the error. Hi, thanks for diagnosing this and producing a patch! I ended up turning the condition around, so that all three stanzas still test "!nowait"; which seems a bit easier to follow. TBH I'm quite troubled by the fact that this test only failed once on each animal; they all had a lot of successful runs after that. I wonder if this is because coverage is insufficient, or is it just bad luck. I also wonder if this bug is what caused the random failures in the test case I tried to add. I should look at that some more now ... -- Álvaro Herrera Valdivia, Chile Al principio era UNIX, y UNIX habló y dijo: "Hello world\n". No dijo "Hello New Jersey\n", ni "Hello USA\n".