Re: the s_lock_stuck on perform_spin_delay - Mailing list pgsql-hackers

From Andy Fan
Subject Re: the s_lock_stuck on perform_spin_delay
Date
Msg-id 87wmsoo56u.fsf@163.com
Whole thread Raw
In response to Re: the s_lock_stuck on perform_spin_delay  (Andres Freund <andres@anarazel.de>)
Responses Re: the s_lock_stuck on perform_spin_delay
List pgsql-hackers
Hi,

Andres Freund <andres@anarazel.de> writes:

>
> On 2024-01-04 14:59:06 +0800, Andy Fan wrote:
>> My question is if someone doesn't obey the rule by mistake (everyone
>> can make mistake), shall we PANIC on a production environment? IMO I
>> think it can be a WARNING on a production environment and be a stuck
>> when 'ifdef USE_ASSERT_CHECKING'.
>>
>> [...]
>>
>> I notice this issue actually because of the patch "Cache relation
>> sizes?" from Thomas Munro [1], where the latest patch[2] still have the
>> following code.
>> +        sr = smgr_alloc_sr();  <-- HERE a spin lock is hold
>> +
>> +        /* Upgrade to exclusive lock so we can create a mapping. */
>> +        LWLockAcquire(mapping_lock, LW_EXCLUSIVE); <-- HERE a complex
>>   operation is needed. it may take a long time.
>>
>> Our internal testing system found more misuses on our own PG version.
>
>> I think a experienced engineer like Thomas can make this mistake and the
>> patch was reviewed by 3 peoples, the bug is still there. It is not easy
>> to say just don't do it.
>
> I don't follow this argument - just ignoring the problem,

I agree with you but I'm feeling you ignored my post at [1], where I
said for the known issue, it should be fixed at the first chance.

> which emitting a
> WARNING basically is, doesn't reduce the impact of the bug, it *increases* the
> impact, because now the system will not recover from the bug without explicit
> operator intervention.  During that time the system might be essentially
> unresponsive, because all backends end up contending for some spinlock, which
> makes investigating such issues very hard.

Acutally they are doing pg_usleep at the most time.

Besides what Robert said, one more reason to question PANIC is that: PAINC
can't always make the system recovery faster because: a). In the most
system, PANIC makes a core dump which take times and spaces. b). After
the reboot, all the caches like relcache, plancache, fdcache need to be
rebuit. c). Customer needs to handle failure better or else they will be
hurt *more often*.  All of such sense cause slowness as well.

>
> I think we should add infrastructure to detect bugs like this during
> development,

The commit 2 in [1] does something like this. for the details, I missed the
check for memory allocation case as you suggested at [2], but checked
heavyweight lock as well. others should be same IIUC.

> but not PANICing when this happens in production seems completely
> non-viable.
>

Not sure what does *this* exactly means. If it means the bug in Thomas's 
patch, I absoluately agree with you(since it is a known bug and it
should be fixed). If it means the general *unknown* case, it's something
we talked above.

I'm also agree that some LLVM static checker should be pretty good
ideally, it just requires more knowledge base and future maintain
effort. I am willing to have a try shortly. 

[1] https://www.postgresql.org/message-id/871qaxp3ly.fsf%40163.com
[2]
https://www.postgresql.org/message-id/20240104225403.dgmbbfffmm3srpgq%40awork3.anarazel.de

-- 
Best Regards
Andy Fan




pgsql-hackers by date:

Previous
From: John Naylor
Date:
Subject: Re: add AVX2 support to simd.h
Next
From: Andres Freund
Date:
Subject: Re: the s_lock_stuck on perform_spin_delay