Re: the s_lock_stuck on perform_spin_delay - Mailing list pgsql-hackers

From Andres Freund
Subject Re: the s_lock_stuck on perform_spin_delay
Date
Msg-id 20240105191139.djeez5zdr6ehvs73@awork3.anarazel.de
Whole thread Raw
In response to Re: the s_lock_stuck on perform_spin_delay  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: the s_lock_stuck on perform_spin_delay
List pgsql-hackers
Hi,

On 2024-01-05 08:51:53 -0500, Robert Haas wrote:
> On Thu, Jan 4, 2024 at 6:06 PM Andres Freund <andres@anarazel.de> wrote:
> > I think we should add infrastructure to detect bugs like this during
> > development, but not PANICing when this happens in production seems completely
> > non-viable.
> 
> I mean +1 for the infrastructure, but "completely non-viable"? Why?
> 
> I've only very rarely seen this PANIC occur, and in the few cases
> where I've seen it, it was entirely unclear that the problem was due
> to a bug where somebody failed to release a spinlock.

I see it fairly regularly. Including finding several related bugs that lead to
stuck systems last year (signal handlers are a menace).


> It seemed more likely that the machine was just not really functioning, and
> the PANIC was a symptom of processes not getting scheduled rather than a PG
> bug.

If processes don't get scheduled for that long a crash-restart doesn't seem
that bad anymore :)


> And every time I tell a user that they might need to use a debugger to, say,
> set VacuumCostActive = false, or to get a backtrace, or any other reason, I
> have to tell them to make sure to detach the debugger in under 60 seconds,
> because in the unlikely event that they attach while the process is holding
> a spinlock, failure to detach in under 60 seconds will take their production
> system down for no reason.

Hm - isn't the stuck lock timeout more like 900s (MAX_DELAY_USEC * NUM_DELAYS
= 1000s, but we start at a lower delay)?  One issue with the code as-is is
that interrupted sleeps count towards to the timeout, despite possibly
sleeping much shorter. We should probably fix that, and also report the time
the lock was stuck for in s_lock_stuck().


> Now, if you're about to say that people shouldn't need to use a debugger on
> their production instance, I entirely agree ... but in the world I inhabit,
> that's often the only way to solve a customer problem, and it probably will
> continue to be until we have much better ways of getting backtraces without
> using a debugger than is currently the case.
> 
> Have you seen real cases where this PANIC prevents a hangup? If yes,
> that PANIC traced back to a bug in PostgreSQL? And why didn't the user
> just keep hitting the same bug over and PANICing in an endless loop?

Many, as hinted above. Some bugs in postgres, more bugs in extensions. IME
these bugs aren't hit commonly, so a crash-restart at least allows to hobble
along. The big issue with not crash-restarting is that often the system ends
up inaccessible, which makes it very hard to investigate the issue.


> I feel like this is one of those things that has just been this way
> forever and we don't question it because it's become an article of
> faith that it's something we have to have. But I have a very hard time
> explaining why it's even a net positive, let alone the unquestionable
> good that you seem to think.

I don't think it's an unquestionable good, I just think the alternative of
just endlessly spinning is way worse.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Oversight in reparameterize_path_by_child leading to executor crash
Next
From: Robert Haas
Date:
Subject: Re: Fix bogus Asserts in calc_non_nestloop_required_outer