Re: the s_lock_stuck on perform_spin_delay - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: the s_lock_stuck on perform_spin_delay |
Date | |
Msg-id | 20240105191139.djeez5zdr6ehvs73@awork3.anarazel.de Whole thread Raw |
In response to | Re: the s_lock_stuck on perform_spin_delay (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: the s_lock_stuck on perform_spin_delay
|
List | pgsql-hackers |
Hi, On 2024-01-05 08:51:53 -0500, Robert Haas wrote: > On Thu, Jan 4, 2024 at 6:06 PM Andres Freund <andres@anarazel.de> wrote: > > I think we should add infrastructure to detect bugs like this during > > development, but not PANICing when this happens in production seems completely > > non-viable. > > I mean +1 for the infrastructure, but "completely non-viable"? Why? > > I've only very rarely seen this PANIC occur, and in the few cases > where I've seen it, it was entirely unclear that the problem was due > to a bug where somebody failed to release a spinlock. I see it fairly regularly. Including finding several related bugs that lead to stuck systems last year (signal handlers are a menace). > It seemed more likely that the machine was just not really functioning, and > the PANIC was a symptom of processes not getting scheduled rather than a PG > bug. If processes don't get scheduled for that long a crash-restart doesn't seem that bad anymore :) > And every time I tell a user that they might need to use a debugger to, say, > set VacuumCostActive = false, or to get a backtrace, or any other reason, I > have to tell them to make sure to detach the debugger in under 60 seconds, > because in the unlikely event that they attach while the process is holding > a spinlock, failure to detach in under 60 seconds will take their production > system down for no reason. Hm - isn't the stuck lock timeout more like 900s (MAX_DELAY_USEC * NUM_DELAYS = 1000s, but we start at a lower delay)? One issue with the code as-is is that interrupted sleeps count towards to the timeout, despite possibly sleeping much shorter. We should probably fix that, and also report the time the lock was stuck for in s_lock_stuck(). > Now, if you're about to say that people shouldn't need to use a debugger on > their production instance, I entirely agree ... but in the world I inhabit, > that's often the only way to solve a customer problem, and it probably will > continue to be until we have much better ways of getting backtraces without > using a debugger than is currently the case. > > Have you seen real cases where this PANIC prevents a hangup? If yes, > that PANIC traced back to a bug in PostgreSQL? And why didn't the user > just keep hitting the same bug over and PANICing in an endless loop? Many, as hinted above. Some bugs in postgres, more bugs in extensions. IME these bugs aren't hit commonly, so a crash-restart at least allows to hobble along. The big issue with not crash-restarting is that often the system ends up inaccessible, which makes it very hard to investigate the issue. > I feel like this is one of those things that has just been this way > forever and we don't question it because it's become an article of > faith that it's something we have to have. But I have a very hard time > explaining why it's even a net positive, let alone the unquestionable > good that you seem to think. I don't think it's an unquestionable good, I just think the alternative of just endlessly spinning is way worse. Greetings, Andres Freund
pgsql-hackers by date: