Re: [HACKERS] More race conditions in logical replication - Mailing list pgsql-hackers

From Petr Jelinek
Subject Re: [HACKERS] More race conditions in logical replication
Date
Msg-id 46ad9139-10b8-f2cf-baaa-9bede4b5e038@2ndquadrant.com
Whole thread Raw
In response to [HACKERS] More race conditions in logical replication  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] More race conditions in logical replication
List pgsql-hackers
On 03/07/17 01:54, Tom Lane wrote:
> I noticed a recent failure that looked suspiciously like a race condition:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2017-07-02%2018%3A02%3A07
> 
> The critical bit in the log file is
> 
> error running SQL: 'psql:<stdin>:1: ERROR:  could not drop the replication slot "tap_sub" on publisher
> DETAIL:  The error was: ERROR:  replication slot "tap_sub" is active for PID 3866790'
> while running 'psql -XAtq -d port=59543 host=/tmp/QpCJtafT7R dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql
'DROPSUBSCRIPTION tap_sub' at
/home/nm/farm/xlc64/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pmline 1198.
 
> 
> After poking at it a bit, I found that I can cause several different
> failures of this ilk in the subscription tests by injecting delays at
> the points where a slot's active_pid is about to be cleared, as in the
> attached patch (which also adds some extra printouts for debugging
> purposes; none of that is meant for commit).  It seems clear that there
> is inadequate interlocking going on when we kill and restart a logical
> rep worker: we're trying to start a new one before the old one has
> gotten out of the slot.
> 

Thanks for the test case.

It's not actually that we start new worker fast. It's that we try to
drop the slot right after worker process was killed but if the code that
clears slot's active_pid takes too long, it still looks like it's being
used. I am quite sure it's possible to make this happen for physical
replication as well when using slots.

This is not something that can be solved by locking on subscriber. ISTM
we need to make pg_drop_replication_slot behave more nicely, like making
it wait for the slot to become available (either by default or as an
option).

I'll have to think about how to do it without rewriting half of
replication slots or reimplementing lock queue though because the
replication slots don't use normal catalog access so there is no object
locking with wait queue. We could use some latch wait with small timeout
but that seems ugly as that function can be called by user without
having dropped the slot before so the wait can be quite long (as in
"forever").

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Ryan Murphy
Date:
Subject: Re: [HACKERS] Bug in Physical Replication Slots (at least 9.5)?
Next
From: Ashutosh Bapat
Date:
Subject: [HACKERS] paths in partitions of a dummy partitioned table