Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica. - Mailing list pgsql-bugs

From Maxim Boguk
Subject Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
Date
Msg-id CAK-MWwS=MvYN=4VQFYbxoMiqUsnCXpc0JF9ChyP+opvhkcrD6Q@mail.gmail.com
Whole thread Raw
In response to Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.  (Michael Paquier <michael.paquier@gmail.com>)
Responses Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-bugs
On Mon, Oct 5, 2015 at 7:15 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

> On Sun, Oct 4, 2015 at 4:26 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> > On Thu, Oct 1, 2015 at 9:48 PM, Michael Paquier <
> michael.paquier@gmail.com>
> > wrote:
> >>
> >>
> >>
> >> On Thu, Oct 1, 2015 at 9:52 PM, Maxim Boguk <maxim.boguk@gmail.com>
> wrote:
> >> >> So wal replay got AccessExclusiveLock on relation 17987 and waiting
> for
> >> >> something.
> >> >> And query waiting for AccessShareLock on the same relation.
> >> >>
> >> > gdb backtrace from stuck startup process:
> >> > (gdb) bt
> >> > #0  0x00007f04ad862633 in select () from
> /lib/x86_64-linux-gnu/libc.so.6
> >> > #1  0x00007f04af86488e in pg_usleep (microsec=<optimized out>) at
> >> > /tmp/buildd/postgresql-9.4-9.4.4/build/../src/port/pgsleep.c:53
> >> > #2  0x00007f04af7328ac in WaitExceedsMaxStandbyDelay () at
> >> >
> >> >
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:171
> >> > #3  ResolveRecoveryConflictWithVirtualXIDs
> >> > (reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT, waitlist=0x7f04b13ba2f0)
> at
> >> >
> >> >
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:232
> >> > #4  ResolveRecoveryConflictWithVirtualXIDs (waitlist=0x7f04b13ba2f0,
> >> > reason=PROCSIG_RECOVERY_CONFLICT_SNAPSHOT) at
> >> >
> >> >
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/ipc/standby.c:191
> >> > #5  0x00007f04af544445 in heap_xlog_clean (record=0x7f04b1395b80,
> >> > lsn=107351881751648) at
> >> >
> >> >
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/heap/heapam.c:7329
> >>
> >> This backtrace is not indicating that this process is waiting on a
> >> relation lock, it is resolving a recovery conflict while removing
> tuples,
> >> killing the virtual transaction depending on if
> max_standby_streaming_delay
> >> or max_standby_archive_delay are set if the conflict gets longer. Did
> you
> >> change the default of those parameters, which is 30s, to -1? This would
> mean
> >> that the standby waits indefinitely.
> >
> > While setting it to -1 gives the startup process permission to wait
> > indefinitely for another back-end which is doing something, I don't think
> > that that means it should have permission to deadlock indefinitely.  A
> > deadlock is a different kind of thing that "someone started a transaction
> > and then left on vacation for a month without closing it"
>
> Looking eat that more closely... We know that a startup process is the
> only process in a standby that can take an exclusive lock on a
> relation, per LockAcquireExtended in lock.c and that this lock is
> taken after replaying XLOG_STANDBY_LOCK. The deadlock reported by
> Maxim indicates that the deadlock is showing up in heap_xlog_clean,
> which may happen in the case of a VACUUM for example. It seems to me
> that we could have a deadlock if some WAL records had their order
> swapped. For example, the EXCLUSIVE lock is replayed first and the
> cleanup of relation pages happens before the lock has been released,
> freezing the other backend using pg_relation_size on a standby waiting
> for the access share lock indefinitely. This would indeed prevent
> recovery conflict to work properly, because we should not hold an
> exclusive lock on the relation while replaying heap_xlog_clean. Hence
> it seems to me that we may have an issue with 68a2e52 (XLogInsert
> multi-slot), and it has been introducd in 9.4.
>
> We could perhaps think about adding some safeguards in
> ResolveRecoveryConflictWithSnapshot.
>
> Maxim, are you using commit_delay? It may be possible that this is
> fixed by bab9599 that has been pushed after 9.4.4 was released. Could
> you also check if HEAD or the soon-to-be-released 9.4.5 fix your
> problem?
> Regards,
> --
> Michael
>

Hi Michael,

commit_delay isn't used on master server or replicas.
I would like to check 9.4.5 but I prefer wait for release and not build it
by hands, and from my experience building from git tree could be tricky as
I don't have root access to install additional packages on the server.
However, I will try it in next few days (if I get an permission).


--
Maxim Boguk
Senior Postgresql DBA
http://www.postgresql-consulting.ru/ <http://www.postgresql-consulting.com/>

pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: BUG #13657: Some kind of undetected deadlock between query and "startup process" on replica.
Next
From: wjarl@hotmail.com
Date:
Subject: BUG #13663: uuid-ossp in multiple schemas same database