Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica. - Mailing list pgsql-bugs

From Michael Paquier
Subject Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
Date
Msg-id CAB7nPqTAHrSKsxu2PwbXwcnQs_PBv-KPxS5roRBbZDsd=AucmA@mail.gmail.com
Whole thread Raw
In response to Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.  (Maxim Boguk <maxim.boguk@gmail.com>)
Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.  (Jeff Janes <jeff.janes@gmail.com>)
List pgsql-bugs
On Sun, Oct 4, 2015 at 4:26 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Oct 1, 2015 at 9:48 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>>
>>
>> On Thu, Oct 1, 2015 at 9:52 PM, Maxim Boguk <maxim.boguk@gmail.com> wrote:
>> >> So wal replay got AccessExclusiveLock on relation 17987 and waiting for
>> >> something.
>> >> And query waiting for AccessShareLock on the same relation.
>> >>
>> > gdb backtrace from stuck startup process:
>> > (gdb) bt
>> > #0  0x00007f04ad862633 in select () from /lib/x86_64-linux-gnu/libc.so.6
>> > #1  0x00007f04af86488e in pg_usleep (microsec=<optimized out>) at
>> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/port/pgsleep.c:53
>> > #2  0x00007f04af7328ac in WaitExceedsMaxStandbyDelay () at
>> >
>> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:171
>> > #3  ResolveRecoveryConflictWithVirtualXIDs
>> > (reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=0x7f04b13ba2f0) at
>> >
>> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:232
>> > #4  ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x7f04b13ba2f0,
>> > reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at
>> >
>> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:191
>> > #5  0x00007f04af544445 in heap_xlog_clean (record=0x7f04b1395b80,
>> > lsn=107351881751648) at
>> >
>> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/heap/heapam.c:7329
>>
>> This backtrace is not indicating that this process is waiting on a
>> relation lock, it is resolving a recovery conflict while removing tuples,
>> killing the virtual transaction depending on if max_standby_streaming_delay
>> or max_standby_archive_delay are set if the conflict gets longer. Did you
>> change the default of those parameters, which is 30s, to -1? This would mean
>> that the standby waits indefinitely.
>
> While setting it to -1 gives the startup process permission to wait
> indefinitely for another back-end which is doing something, I don't think
> that that means it should have permission to deadlock indefinitely.  A
> deadlock is a different kind of thing that "someone started a transaction
> and then left on vacation for a month without closing it"

Looking at that more closely... We know that a startup process is the
only process in a standby that can take an exclusive lock on a
relation, per LockAcquireExtended in lock.c and that this lock is
taken after replaying XLOG_STANDBY_LOCK. The deadlock reported by
Maxim indicates that the deadlock is showing up in heap_xlog_clean,
which may happen in the case of a VACUUM for example. It seems to me
that we could have a deadlock if some WAL records had their order
swapped. For example, the EXCLUSIVE lock is replayed first and the
cleanup of relation pages happens before the lock has been released,
freezing the other backend using pg_relation_size on a standby waiting
for the access share lock indefinitely. This would indeed prevent
recovery conflict to work properly, because we should not hold an
exclusive lock on the relation while replaying heap_xlog_clean. Hence
it seems to me that we may have an issue with 68a2e52 (XLogInsert
multi-slot), and it has been introduced in 9.4.

We could perhaps think about adding some safeguards in
ResolveRecoveryConflictWithSnapshot.

Maxim, are you using commit_delay? It may be possible that this is
fixed by bab9599 that has been pushed after 9.4.4 was released. Could
you also check if HEAD or the soon-to-be-released 9.4.5 fix your
problem?
Regards,
--
Michael

pgsql-bugs by date:

Previous
From: Dave Page
Date:
Subject: Re: BUG #13662: Base Directory Slashes are FORWARD
Next
From: Maxim Boguk
Date:
Subject: Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.