undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION |
Date | |
Msg-id | d291bb50-12c4-e8af-2af2-7bb9bb4d8e3e@enterprisedb.com Whole thread Raw |
Responses |
Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
|
List | pgsql-hackers |
Hi, I decided to do some stress-testing of the built-in logical replication, as part of the sequence decoding work. And I soon ran into an undetected deadlock related to ALTER SUBSCRIPTION ... REFRESH PUBLICATION :-( The attached bash scripts triggers that in a couple seconds for me. The script looks complicated, but most of the code is waiting for sync to complete, catchup, and that sort of thing. What the script does is pretty simple: 1) initialize two clusters, set them as publisher/subscriber pair 2) create some number of tables, add them to publication and wait for the sync to complete 3) start two pgbench runs in the background, modifying the publication (one removes+adds all tables in a single transaction, one does that with transaction per table) 4) run refresh.sh which does ALTER PUBLICATION ... REFRESH PUBLICATION in a loop (now that I think about it, could be another pgbench script, but well ...) 5) some consistency checks, but the lockup happens earlier so this does not really matter After a small number of refresh cycles (for me it's usually a couple dozen), we end up with a couple stuck locks (I shortened the backend type string a bit, for formatting reasons): test=# select a.pid, classid, objid, backend_type, query from pg_locks l join pg_stat_activity a on (a.pid = l.pid) where not granted; pid | classid | objid | backend_type | query ---------+---------+-------+------------------+---------------------- 2691941 | 6100 | 16785 | client backend | ALTER SUBSCRIPTION s REFRESH PUBLICATION 2691837 | 6100 | 16785 | tablesync worker | 2691936 | 6100 | 16785 | tablesync worker | (3 rows) All these backends wait for 6100/16785, which is the subscription row in pg_subscription. The tablesync workers are requesting AccessShareLock, the client backend however asks for AccessExclusiveLock. The entry is currently locked by: test=# select a.pid, mode, backend_type from pg_locks l join pg_stat_activity a on (a.pid = l.pid) where classid=6100 and objid=16785 and granted; pid | mode | backend_type ---------+-----------------+---------------------------------- 2690477 | AccessShareLock | logical replication apply worker (1 row) But the apply worker is not waiting for any locks, so what's going on? Well, the problem is the apply worker is waiting for notification from the tablesync workers the relation is synced, which happens through updating the pg_subscription_rel row. And that wait happens in wait_for_relation_state_change, which simply checks the row in a loop, with a sleep by WaitLatch(). Unfortunately, the tablesync workers can't update the row because the client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION sneaked in, and waits for an AccessExclusiveLock. So the tablesync workers are stuck in the queue and can't proceed. The client backend can't proceed, because it's waiting for a lock held by the apply worker. The tablesync workers can't proceed because their lock request is stuck behind the AccessExclusiveLock request. And the apply worker can't proceed, because it's waiting for status update from the tablesync workers. And the deadlock is undetected, because the apply worker is not waiting on a lock, but sleeping on a latch :-( I don't know what's the right solution here. I wonder if the apply worker might release the lock before waiting for the update, that'd solve this whole issue. Alternatively, ALTER PUBLICATION might wait for the lock only for a limited amount of time, and try again (but then it'd be susceptible to starving, of course). Or maybe there's a way to make this work in a way that would be visible to the deadlock detector? That'd mean we occasionally get processes killed to resolve a deadlock, but that's still better than processes stuck indefinitely ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: