Hi,
On 2024-04-11 15:24:28 -0400, Robert Haas wrote:
> On Wed, Apr 10, 2024 at 9:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Maybe we should rip out the whole mechanism and hard-wire
> > spins_per_delay at 1000 or so.
>
> Or, rip out the whole, whole mechanism and just don't PANIC.
I continue believe that that'd be a quite bad idea.
My suspicion is that most of the false positives are caused by lots of signals
interrupting the pg_usleep()s. Because we measure the number of delays, not
the actual time since we've been waiting for the spinlock, signals
interrupting pg_usleep() trigger can very significantly shorten the amount of
time until we consider a spinlock stuck. We should fix that.
> To believe that the PANIC is the right idea, we have to suppose that
> we have stuck-spinlock bugs that people actually hit, but that those
> people don't hit them often enough to care, as long as the system
> resets when the spinlock gets stuck, instead of hanging. I can't
> completely rule out the existence of either such bugs or such people,
> but I'm not aware of having encountered them.
I don't think that's a fair description of the situation. It supposes that the
alternative to the PANIC is that the problem is detected and resolved some
other way. But, depending on the spinlock, the problem will not be detected by
automated checks for the system being up. IME you end up with a system that's
degraded in a complicated hard to understand way, rather than one that's just
down.
Greetings,
Andres Freund