Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica. - Mailing list pgsql-bugs

From Jeff Janes
Subject Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
Date
Msg-id CAMkU=1zxgAf846Lrjk9z1iBAP3MrwHvr9p-XLvC1-N9_uNoKfw@mail.gmail.com
Whole thread Raw
In response to Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.  (Michael Paquier <michael.paquier@gmail.com>)
Responses Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-bugs
On Mon, Oct 5, 2015 at 1:15 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

> On Sun, Oct 4, 2015 at 4:26 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> > On Thu, Oct 1, 2015 at 9:48 PM, Michael Paquier <
> michael.paquier@gmail.com>
> > wrote:
> >>
> >>
> >>
> >> On Thu, Oct 1, 2015 at 9:52 PM, Maxim Boguk <maxim.boguk@gmail.com>
> wrote:
> >> >> So wal replay got AccessExclusiveLock on relation 17987 and waiting
> for
> >> >> something.
> >> >> And query waiting for AccessShareLock on the same relation.
> >> >>
> >> > gdb backtrace from stuck startup process:
> >> > (gdb) bt
> >> > #0  0x00007f04ad862633 in select () from
> /lib/x86_64-linux-gnu/libc.so.6
> >> > #1  0x00007f04af86488e in pg_usleep (microsec=<optimized out>) at
> >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/port/pgsleep.c:53
> >> > #2  0x00007f04af7328ac in WaitExceedsMaxStandbyDelay () at
> >> >
> >> >
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:171
> >> > #3  ResolveRecoveryConflictWithVirtualXIDs
> >> > (reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=0x7f04b13ba2f0)
> at
> >> >
> >> >
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:232
> >> > #4  ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x7f04b13ba2f0,
> >> > reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at
> >> >
> >> >
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:191
> >> > #5  0x00007f04af544445 in heap_xlog_clean (record=0x7f04b1395b80,
> >> > lsn=107351881751648) at
> >> >
> >> >
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/heap/heapam.c:7329
> >>
> >> This backtrace is not indicating that this process is waiting on a
> >> relation lock, it is resolving a recovery conflict while removing
> tuples,
> >> killing the virtual transaction depending on if
> max_standby_streaming_delay
> >> or max_standby_archive_delay are set if the conflict gets longer. Did
> you
> >> change the default of those parameters, which is 30s, to -1? This would
> mean
> >> that the standby waits indefinitely.
> >
> > While setting it to -1 gives the startup process permission to wait
> > indefinitely for another back-end which is doing something, I don't think
> > that that means it should have permission to deadlock indefinitely.  A
> > deadlock is a different kind of thing that "someone started a transaction
> > and then left on vacation for a month without closing it"
>
> Looking at that more closely... We know that a startup process is the
> only process in a standby that can take an exclusive lock on a
> relation, per LockAcquireExtended in lock.c and that this lock is
> taken after replaying XLOG_STANDBY_LOCK. The deadlock reported by
> Maxim indicates that the deadlock is showing up in heap_xlog_clean,
> which may happen in the case of a VACUUM for example. It seems to me
> that we could have a deadlock if some WAL records had their order
> swapped. For example, the EXCLUSIVE lock is replayed first and the
> cleanup of relation pages happens before the lock has been released,
> freezing the other backend using pg_relation_size on a standby waiting
> for the access share lock indefinitely. This would indeed prevent
> recovery conflict to work properly, because we should not hold an
> exclusive lock on the relation while replaying heap_xlog_clean. Hence
> it seems to me that we may have an issue with 68a2e52 (XLogInsert
> multi-slot), and it has been introduced in 9.4.
>

I was assuming that the table being AccessExclusive locked and the table
being VACUUMed were two different tables, which were done in two different
processes on the master, which means that no re-ordering of WAL records
would be necessary to hit this.  Although I also haven't yet been able to
reproduce the deadlock under this assumption.

Cheers,

Jeff

pgsql-bugs by date:

Previous
From: John R Pierce
Date:
Subject: Re: BUG #13663: uuid-ossp in multiple schemas same database
Next
From: Michael Paquier
Date:
Subject: Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.