[HACKERS] More race conditions in logical replication - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | [HACKERS] More race conditions in logical replication |
Date | |
Msg-id | 11904.1499039688@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: [HACKERS] More race conditions in logical replication
Re: [HACKERS] More race conditions in logical replication |
List | pgsql-hackers |
I noticed a recent failure that looked suspiciously like a race condition: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2017-07-02%2018%3A02%3A07 The critical bit in the log file is error running SQL: 'psql:<stdin>:1: ERROR: could not drop the replication slot "tap_sub" on publisher DETAIL: The error was: ERROR: replication slot "tap_sub" is active for PID 3866790' while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP SUBSCRIPTIONtap_sub' at /home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pmline 1198. After poking at it a bit, I found that I can cause several different failures of this ilk in the subscription tests by injecting delays at the points where a slot's active_pid is about to be cleared, as in the attached patch (which also adds some extra printouts for debugging purposes; none of that is meant for commit). It seems clear that there is inadequate interlocking going on when we kill and restart a logical rep worker: we're trying to start a new one before the old one has gotten out of the slot. I'm not particularly interested in fixing this myself, so I'm just going to add it to the open items list. regards, tom lane diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index dc7de20..78cd416 100644 *** a/src/backend/replication/slot.c --- b/src/backend/replication/slot.c *************** ReplicationSlotAcquire(const char *name) *** 358,364 **** if (active_pid != MyProcPid) ereport(ERROR, (errcode(ERRCODE_OBJECT_IN_USE), ! errmsg("replication slot \"%s\" is active for PID %d", name, active_pid))); /* We made this slot active, so it's ours now. */ --- 358,364 ---- if (active_pid != MyProcPid) ereport(ERROR, (errcode(ERRCODE_OBJECT_IN_USE), ! errmsg("replication slot \"%s\" is active for PID %d (acq)", name, active_pid))); /* We made this slot active, so it's ours now. */ *************** ReplicationSlotRelease(void) *** 391,396 **** --- 391,398 ---- * Mark persistent slot inactive. We're not freeing it, just * disconnecting. */ + pg_usleep(100000); + elog(LOG, "ReplicationSlotRelease: unmarking persistent slot"); SpinLockAcquire(&slot->mutex); slot->active_pid = 0; SpinLockRelease(&slot->mutex); *************** ReplicationSlotDropPtr(ReplicationSlot * *** 523,528 **** --- 525,532 ---- { bool fail_softly = slot->data.persistency != RS_PERSISTENT; + pg_usleep(100000); + elog(LOG, "ReplicationSlotDropPtr: unmarking slot after rename fail"); SpinLockAcquire(&slot->mutex); slot->active_pid = 0; SpinLockRelease(&slot->mutex); *************** ReplicationSlotDropPtr(ReplicationSlot * *** 540,545 **** --- 544,551 ---- * and nobody can be attached to this slot and thus access it without * scanning the array. */ + pg_usleep(100000); + elog(LOG, "ReplicationSlotDropPtr: unmarking slot"); LWLockAcquire(ReplicationSlotControlLock, LW_EXCLUSIVE); slot->active_pid = 0; slot->in_use = false; *************** restart: *** 876,882 **** if (active_pid) ereport(ERROR, (errcode(ERRCODE_OBJECT_IN_USE), ! errmsg("replication slot \"%s\" is active for PID %d", slotname, active_pid))); /* --- 882,888 ---- if (active_pid) ereport(ERROR, (errcode(ERRCODE_OBJECT_IN_USE), ! errmsg("replication slot \"%s\" is active for PID %d (drop)", slotname, active_pid))); /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
pgsql-hackers by date: