Hi,
On 2018-04-13 19:13:07 +0300, Konstantin Knizhnik wrote:
> On 13.04.2018 18:41, Andres Freund wrote:
> > On 2018-04-13 16:43:09 +0300, Konstantin Knizhnik wrote:
> > > Updated patch is attached.
> > > + /*
> > > + * Ensure that only one backend is checking for deadlock.
> > > + * Otherwise under high load cascade of deadlock timeout expirations can cause stuck of Postgres.
> > > + */
> > > + if (!pg_atomic_test_set_flag(&ProcGlobal->activeDeadlockCheck))
> > > + {
> > > + enable_timeout_after(DEADLOCK_TIMEOUT, DeadlockTimeout);
> > > + return;
> > > + }
> > > + inside_deadlock_check = true;
> > I can't see that ever being accepted. This means there's absolutely no
> > bound for deadlock checks happening even under light concurrency, even
> > if there's no contention for a large fraction of the time.
>
> It may cause problems only if
> 1. There is large number of active sessions
> 2. They perform deadlock-prone queries (so no attempts to avoid deadlocks at
> application level)
> 3. Deadlock timeout is set to be very small (10 msec?)
That's just not true.
> Otherwise either probability that all backends once and once again are
> trying to check deadlocks concurrently is very small (and can be even more
> reduced by using random timeout for subsequent deadlock checks), either
> system can not normally function in any case because large number of clients
> fall into deadlock.
Operating systems batch wakeups.
> I completely agree that there are plenty of different approaches, but IMHO
> the currently used strategy is the worst one, because it can stall system
> even if there are not deadlocks at all.
> I always think that deadlock is a programmer's error rather than normal
> situation. May be it is wrong assumption
It is.
> So before implementing some complicated solution of the problem9too slow
> deadlock detection), I think that first it is necessary to understand
> whether there is such problem at al and under which workload it can happen.
Sure. I'm not saying that you shouldn't experiment with a patch like the
one you sent. What I am saying is that that can't be the actual solution
that will be integrated.
Greetings,
Andres Freund