Hi,
On 2024-04-11 23:15:38 -0400, Tom Lane wrote:
> I wrote:
> > ... But Robert's question remains: how does
> > PANIC'ing after awhile make anything better? I flat out don't
> > believe the idea that having a backend stuck on a spinlock
> > would otherwise go undetected.
>
> Oh, wait. After thinking a bit longer I believe I recall the argument
> for this behavior: it automates recovery from a genuinely stuck
> spinlock. If we waited forever, the only way out of that is for a
> DBA to kill -9 the stuck process, which has exactly the same end
> result as a PANIC, except that it takes a lot longer to put the system
> back in service and perhaps rousts somebody, or several somebodies,
> out of their warm beds to fix it. If you don't have a DBA on-call
> 24x7 then that answer looks even worse.
Precisely. And even if you have a DBA on call 24x7, they need to know that
they need to react to something.
Today you can automate getting notified of crash-restarts, by using
restart_after_crash = false
and restarting somewhere outside of postgres.
Imo that's the only sensible setting for larger production environments,
although I'm sure not everyone agrees with that.
> So there's that. But that's not an argument that we need to be in a
> hurry to timeout; if the built-in reaction time is less than perhaps
> 10 minutes you're still miles ahead of the manual solution.
The current timeout is of a hard to determine total time, due to the
increasing and wrapping around wait times, but it's normally longer than 60s,
unless you're interrupted by a lot of signals. 1000 sleeps between 1000 and
1000000 us.
I think we should make the timeout something predictable and probably somewhat
longer.
> On the third hand, it's still true that we have no comparable
> behavior for any other source of system lockups, and it's difficult
> to make a case that stuck spinlocks really need more concern than
> other kinds of bugs.
Spinlocks are somewhat more finnicky though, compared to e.g. lwlocks that are
released on error. Lwlocks also take e.g. care to hold interrupts so code
doesn't just jump out of a section with lwlocks held.
Greetings,
Andres Freund