Re: the s_lock_stuck on perform_spin_delay - Mailing list pgsql-hackers

From Andres Freund
Subject Re: the s_lock_stuck on perform_spin_delay
Date
Msg-id 20240104230628.j2wqqck3wrgol6ym@awork3.anarazel.de
Whole thread Raw
In response to the s_lock_stuck on perform_spin_delay  (Andy Fan <zhihuifan1213@163.com>)
Responses Re: the s_lock_stuck on perform_spin_delay
Re: the s_lock_stuck on perform_spin_delay
List pgsql-hackers
Hi,

On 2024-01-04 14:59:06 +0800, Andy Fan wrote:
> My question is if someone doesn't obey the rule by mistake (everyone
> can make mistake), shall we PANIC on a production environment? IMO I
> think it can be a WARNING on a production environment and be a stuck
> when 'ifdef USE_ASSERT_CHECKING'.
>
> [...]
>
> I notice this issue actually because of the patch "Cache relation
> sizes?" from Thomas Munro [1], where the latest patch[2] still have the
> following code.
> +        sr = smgr_alloc_sr();  <-- HERE a spin lock is hold
> +
> +        /* Upgrade to exclusive lock so we can create a mapping. */
> +        LWLockAcquire(mapping_lock, LW_EXCLUSIVE); <-- HERE a complex
>   operation is needed. it may take a long time.
>
> Our internal testing system found more misuses on our own PG version.

> I think a experienced engineer like Thomas can make this mistake and the
> patch was reviewed by 3 peoples, the bug is still there. It is not easy
> to say just don't do it.

I don't follow this argument - just ignoring the problem, which emitting a
WARNING basically is, doesn't reduce the impact of the bug, it *increases* the
impact, because now the system will not recover from the bug without explicit
operator intervention.  During that time the system might be essentially
unresponsive, because all backends end up contending for some spinlock, which
makes investigating such issues very hard.


I think we should add infrastructure to detect bugs like this during
development, but not PANICing when this happens in production seems completely
non-viable.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: Emit fewer vacuum records by reaping removable tuples during pruning
Next
From: Michael Paquier
Date:
Subject: Re: Adding facility for injection points (or probe points?) for more advanced tests