Hi!
We observe the following unexpected behaviour:
We run pg_dump on the hot_standby slave every two hours and once in 2-3 weeks it
just hangs forever and the slave stops replaying logs.
Version: 9.3.11
OS: Debian GNU/Linux 7.9 (wheezy)
GUC hot_standby_feedback is off.
(See also attached pg_stat_activity_and_pg_locks.txt)
Looking at pg_stat_activity on the slave we see pg_dump trying to acquire a
lock, with state=active, waiting=t.
Looking at pg_locks we see an entry for pg_dump with granted=f on relation
31353 and lots of other entries for the pg_dump pid (570) on other relations
with granted=t. An AccessExclusiveLock on that relation 31353 is held by
another pid (2033) with virtualtransaction 1/0.
Pid 2033 is the postmaster process. strace says it's waiting in a loop:
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
GDB Backtrace:
#0 0x00007f99d355edd3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f99d57a6c4e in pg_usleep ()
#2 0x00007f99d56836a0 in ?? ()
#3 0x00007f99d5683b43 in StandbyAcquireAccessExclusiveLock ()
#4 0x00007f99d5683f6c in standby_redo ()
#5 0x00007f99d54ed01d in StartupXLOG ()
#6 0x00007f99d565454f in StartupProcessMain ()
#7 0x00007f99d54f8d0d in AuxiliaryProcessMain ()
#8 0x00007f99d56512f3 in ?? ()
#9 0x00007f99d56537f7 in PostmasterMain ()
#10 0x00007f99d54886d0 in main ()
Looking at the disassembly and the registers it appears that
StandbyAcquireAccessExclusiveLock() is called with arguments dbOid=30850,
relOid=31262. Looking again at pg_locks, a lock on 31262 is already held by
pg_dump!
So it looks like a deadlock between pg_dump and standby_redo() in the
postmaster that is not detected, making both wait indefinitely. We
would expect the pg_dump session to die when a conflict is
encountered, but the fact that it and the postmaster just hang forever
makes it look like a bug.
We do not call pg_xlog_replay_pause()/resume() around the pg_dump call, we'll
try that now as a workaround.
Oliver