Hi,
On 2024-04-11 16:11:40 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2024-04-11 15:24:28 -0400, Robert Haas wrote:
> >> Or, rip out the whole, whole mechanism and just don't PANIC.
>
> > I continue believe that that'd be a quite bad idea.
>
> I'm warming to it myself.
>
> > My suspicion is that most of the false positives are caused by lots of signals
> > interrupting the pg_usleep()s. Because we measure the number of delays, not
> > the actual time since we've been waiting for the spinlock, signals
> > interrupting pg_usleep() trigger can very significantly shorten the amount of
> > time until we consider a spinlock stuck. We should fix that.
>
> We wouldn't need to fix it, if we simply removed the NUM_DELAYS
> limit. Whatever kicked us off the sleep doesn't matter, we might
> as well go check the spinlock.
I suspect we should fix it regardless of whether we keep NUM_DELAYS. We
shouldn't increase cur_delay faster just because a lot of signals are coming
in. If it were just user triggered signals it'd probably not be worth
worrying about, but we do sometimes send a lot of signals ourselves...
> Also, you propose in your other message replacing spinlocks with lwlocks.
> Whatever the other merits of that, I notice that we have no timeout or
> "stuck lwlock" detection.
True. And that's not great. But at least lwlocks can be identified in
pg_stat_activity, which does help some.
Greetings,
Andres Freund