On Thu, Jan 4, 2024 at 6:06 PM Andres Freund <andres@anarazel.de> wrote:
> I think we should add infrastructure to detect bugs like this during
> development, but not PANICing when this happens in production seems completely
> non-viable.
I mean +1 for the infrastructure, but "completely non-viable"? Why?
I've only very rarely seen this PANIC occur, and in the few cases
where I've seen it, it was entirely unclear that the problem was due
to a bug where somebody failed to release a spinlock. It seemed more
likely that the machine was just not really functioning, and the PANIC
was a symptom of processes not getting scheduled rather than a PG bug.
And every time I tell a user that they might need to use a debugger
to, say, set VacuumCostActive = false, or to get a backtrace, or any
other reason, I have to tell them to make sure to detach the debugger
in under 60 seconds, because in the unlikely event that they attach
while the process is holding a spinlock, failure to detach in under 60
seconds will take their production system down for no reason. Now, if
you're about to say that people shouldn't need to use a debugger on
their production instance, I entirely agree ... but in the world I
inhabit, that's often the only way to solve a customer problem, and it
probably will continue to be until we have much better ways of getting
backtraces without using a debugger than is currently the case.
Have you seen real cases where this PANIC prevents a hangup? If yes,
that PANIC traced back to a bug in PostgreSQL? And why didn't the user
just keep hitting the same bug over and PANICing in an endless loop?
I feel like this is one of those things that has just been this way
forever and we don't question it because it's become an article of
faith that it's something we have to have. But I have a very hard time
explaining why it's even a net positive, let alone the unquestionable
good that you seem to think.
--
Robert Haas
EDB: http://www.enterprisedb.com