Re: the s_lock_stuck on perform_spin_delay - Mailing list pgsql-hackers
From | Andy Fan |
---|---|
Subject | Re: the s_lock_stuck on perform_spin_delay |
Date | |
Msg-id | 87wmsoo56u.fsf@163.com Whole thread Raw |
In response to | Re: the s_lock_stuck on perform_spin_delay (Andres Freund <andres@anarazel.de>) |
Responses |
Re: the s_lock_stuck on perform_spin_delay
|
List | pgsql-hackers |
Hi, Andres Freund <andres@anarazel.de> writes: > > On 2024-01-04 14:59:06 +0800, Andy Fan wrote: >> My question is if someone doesn't obey the rule by mistake (everyone >> can make mistake), shall we PANIC on a production environment? IMO I >> think it can be a WARNING on a production environment and be a stuck >> when 'ifdef USE_ASSERT_CHECKING'. >> >> [...] >> >> I notice this issue actually because of the patch "Cache relation >> sizes?" from Thomas Munro [1], where the latest patch[2] still have the >> following code. >> + sr = smgr_alloc_sr(); <-- HERE a spin lock is hold >> + >> + /* Upgrade to exclusive lock so we can create a mapping. */ >> + LWLockAcquire(mapping_lock, LW_EXCLUSIVE); <-- HERE a complex >> operation is needed. it may take a long time. >> >> Our internal testing system found more misuses on our own PG version. > >> I think a experienced engineer like Thomas can make this mistake and the >> patch was reviewed by 3 peoples, the bug is still there. It is not easy >> to say just don't do it. > > I don't follow this argument - just ignoring the problem, I agree with you but I'm feeling you ignored my post at [1], where I said for the known issue, it should be fixed at the first chance. > which emitting a > WARNING basically is, doesn't reduce the impact of the bug, it *increases* the > impact, because now the system will not recover from the bug without explicit > operator intervention. During that time the system might be essentially > unresponsive, because all backends end up contending for some spinlock, which > makes investigating such issues very hard. Acutally they are doing pg_usleep at the most time. Besides what Robert said, one more reason to question PANIC is that: PAINC can't always make the system recovery faster because: a). In the most system, PANIC makes a core dump which take times and spaces. b). After the reboot, all the caches like relcache, plancache, fdcache need to be rebuit. c). Customer needs to handle failure better or else they will be hurt *more often*. All of such sense cause slowness as well. > > I think we should add infrastructure to detect bugs like this during > development, The commit 2 in [1] does something like this. for the details, I missed the check for memory allocation case as you suggested at [2], but checked heavyweight lock as well. others should be same IIUC. > but not PANICing when this happens in production seems completely > non-viable. > Not sure what does *this* exactly means. If it means the bug in Thomas's patch, I absoluately agree with you(since it is a known bug and it should be fixed). If it means the general *unknown* case, it's something we talked above. I'm also agree that some LLVM static checker should be pretty good ideally, it just requires more knowledge base and future maintain effort. I am willing to have a try shortly. [1] https://www.postgresql.org/message-id/871qaxp3ly.fsf%40163.com [2] https://www.postgresql.org/message-id/20240104225403.dgmbbfffmm3srpgq%40awork3.anarazel.de -- Best Regards Andy Fan
pgsql-hackers by date: