On 3 October 2017 at 15:30, Sokolov Yura <y.sokolov@postgrespro.ru> wrote:
> If hundreds of backends reaches this timeout trying to acquire
> advisory lock on a same value, it leads to hard-stuck for many
> seconds, cause they all traverse same huge lock graph under
> exclusive lock.
> During this stuck there is no possibility to do any meaningful
> operations (no new transaction can begin).
Well observed, we clearly need to improve this.
> Attached patch makes CheckDeadlock to do two passes:
> - first pass uses LW_SHARED on partitions of lock hash.
> DeadLockCheck is called with flag "readonly", so it doesn't
> modify anything.
> - If there is possibility of "soft" or "hard" deadlock detected,
> ie if there is need to modify lock graph, then partitions
> relocked with LW_EXCLUSIVE, and DeadLockCheck is called again.
>
> It fixes hard-stuck, cause backends walk lock graph under shared
> lock, and found that there is no real deadlock.
In phase 2, does this relock only the partitions required to reorder
the lock graph, or does it request all locks? Fewer locks would be
better.
If you decide to reorder the lock graph, then only one backend should
attempt this at a time. We should keep track of reorder-requests, so
if two backends arrive at the same conclusion then only one should
proceed to do this.
Many deadlocks happen between locks in same table. It would also be a
useful optimization to check just one partition for lock graphs before
we checked all partitions.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services