[HACKERS] More race conditions in logical replication - Mailing list pgsql-hackers

From Tom Lane
Subject [HACKERS] More race conditions in logical replication
Date
Msg-id 11904.1499039688@sss.pgh.pa.us
Whole thread Raw
Responses Re: [HACKERS] More race conditions in logical replication
Re: [HACKERS] More race conditions in logical replication
List pgsql-hackers
I noticed a recent failure that looked suspiciously like a race condition:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2017-07-02%2018%3A02%3A07

The critical bit in the log file is

error running SQL: 'psql:<stdin>:1: ERROR:  could not drop the replication slot "tap_sub" on publisher
DETAIL:  The error was: ERROR:  replication slot "tap_sub" is active for PID 3866790'
while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP
SUBSCRIPTIONtap_sub' at
/home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pmline 1198. 

After poking at it a bit, I found that I can cause several different
failures of this ilk in the subscription tests by injecting delays at
the points where a slot's active_pid is about to be cleared, as in the
attached patch (which also adds some extra printouts for debugging
purposes; none of that is meant for commit).  It seems clear that there
is inadequate interlocking going on when we kill and restart a logical
rep worker: we're trying to start a new one before the old one has
gotten out of the slot.

I'm not particularly interested in fixing this myself, so I'm just
going to add it to the open items list.

            regards, tom lane

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index dc7de20..78cd416 100644
*** a/src/backend/replication/slot.c
--- b/src/backend/replication/slot.c
*************** ReplicationSlotAcquire(const char *name)
*** 358,364 ****
      if (active_pid != MyProcPid)
          ereport(ERROR,
                  (errcode(ERRCODE_OBJECT_IN_USE),
!                  errmsg("replication slot \"%s\" is active for PID %d",
                          name, active_pid)));

      /* We made this slot active, so it's ours now. */
--- 358,364 ----
      if (active_pid != MyProcPid)
          ereport(ERROR,
                  (errcode(ERRCODE_OBJECT_IN_USE),
!                  errmsg("replication slot \"%s\" is active for PID %d (acq)",
                          name, active_pid)));

      /* We made this slot active, so it's ours now. */
*************** ReplicationSlotRelease(void)
*** 391,396 ****
--- 391,398 ----
           * Mark persistent slot inactive.  We're not freeing it, just
           * disconnecting.
           */
+         pg_usleep(100000);
+         elog(LOG, "ReplicationSlotRelease: unmarking persistent slot");
          SpinLockAcquire(&slot->mutex);
          slot->active_pid = 0;
          SpinLockRelease(&slot->mutex);
*************** ReplicationSlotDropPtr(ReplicationSlot *
*** 523,528 ****
--- 525,532 ----
      {
          bool        fail_softly = slot->data.persistency != RS_PERSISTENT;

+         pg_usleep(100000);
+         elog(LOG, "ReplicationSlotDropPtr: unmarking slot after rename fail");
          SpinLockAcquire(&slot->mutex);
          slot->active_pid = 0;
          SpinLockRelease(&slot->mutex);
*************** ReplicationSlotDropPtr(ReplicationSlot *
*** 540,545 ****
--- 544,551 ----
       * and nobody can be attached to this slot and thus access it without
       * scanning the array.
       */
+     pg_usleep(100000);
+     elog(LOG, "ReplicationSlotDropPtr: unmarking slot");
      LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE);
      slot->active_pid = 0;
      slot->in_use = false;
*************** restart:
*** 876,882 ****
          if (active_pid)
              ereport(ERROR,
                      (errcode(ERRCODE_OBJECT_IN_USE),
!                      errmsg("replication slot \"%s\" is active for PID %d",
                              slotname, active_pid)));

          /*
--- 882,888 ----
          if (active_pid)
              ereport(ERROR,
                      (errcode(ERRCODE_OBJECT_IN_USE),
!                      errmsg("replication slot \"%s\" is active for PID %d (drop)",
                              slotname, active_pid)));

          /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: protocol version negotiation (Re: [HACKERS] Libpq PGRES_COPY_BOTH- version compatibility)
Next
From: Michael Paquier
Date:
Subject: [HACKERS] Re: pg_ctl wait exit code (was Re: [COMMITTERS] pgsql: Additionaltests for subtransactions in recovery)