Undetected deadlock with pg_dump on hot_standby server - Mailing list pgsql-bugs

From Oliver Seemann
Subject Undetected deadlock with pg_dump on hot_standby server
Date
Msg-id CANCipfoOA=P0zoykZY-XCLVvpKONm+aNVCOmSY96SE6SyhwecA@mail.gmail.com
Whole thread Raw
List pgsql-bugs
Hi!

We observe the following unexpected behaviour:
We run pg_dump on the hot_standby slave every two hours and once in 2-3 weeks it
just hangs forever and the slave stops replaying logs.

Version: 9.3.11
OS: Debian GNU/Linux 7.9 (wheezy)
GUC hot_standby_feedback is off.

(See also attached pg_stat_activity_and_pg_locks.txt)
Looking at pg_stat_activity on the slave we see pg_dump trying to acquire a
lock, with state=active, waiting=t.

Looking at pg_locks we see an entry for pg_dump with granted=f on relation
31353 and lots of other entries for the pg_dump pid (570) on other relations
with granted=t. An AccessExclusiveLock on that relation 31353 is held by
another pid (2033) with virtualtransaction 1/0.

Pid 2033 is the postmaster process. strace says it's waiting in a loop:
  select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)

GDB Backtrace:

#0  0x00007f99d355edd3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f99d57a6c4e in pg_usleep ()
#2  0x00007f99d56836a0 in ?? ()
#3  0x00007f99d5683b43 in StandbyAcquireAccessExclusiveLock ()
#4  0x00007f99d5683f6c in standby_redo ()
#5  0x00007f99d54ed01d in StartupXLOG ()
#6  0x00007f99d565454f in StartupProcessMain ()
#7  0x00007f99d54f8d0d in AuxiliaryProcessMain ()
#8  0x00007f99d56512f3 in ?? ()
#9  0x00007f99d56537f7 in PostmasterMain ()
#10 0x00007f99d54886d0 in main ()

Looking at the disassembly and the registers it appears that
StandbyAcquireAccessExclusiveLock() is called with arguments dbOid=30850,
relOid=31262. Looking again at pg_locks, a lock on 31262 is already held by
pg_dump!

So it looks like a deadlock between pg_dump and standby_redo() in the
postmaster that is not detected, making both wait indefinitely. We
would expect the pg_dump session to die when a conflict is
encountered, but the fact that it and the postmaster just hang forever
makes it look like a bug.

We do not call pg_xlog_replay_pause()/resume() around the pg_dump call, we'll
try that now as a workaround.


Oliver

Attachment

pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: BUG #13755: pgwin32_is_service not checking if SECURITY_SERVICE_SID is disabled
Next
From: Nick Cleaton
Date:
Subject: Re: streaming replication master can fail to shut down