Re: the s_lock_stuck on perform_spin_delay - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: the s_lock_stuck on perform_spin_delay |
Date | |
Msg-id | 20240105192713.nrszmgxk3xgvdauq@awork3.anarazel.de Whole thread Raw |
In response to | Re: the s_lock_stuck on perform_spin_delay (Andy Fan <zhihuifan1213@163.com>) |
List | pgsql-hackers |
Hi, On 2024-01-05 10:20:39 +0800, Andy Fan wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2024-01-04 14:59:06 +0800, Andy Fan wrote: > >> My question is if someone doesn't obey the rule by mistake (everyone > >> can make mistake), shall we PANIC on a production environment? IMO I > >> think it can be a WARNING on a production environment and be a stuck > >> when 'ifdef USE_ASSERT_CHECKING'. > >> > >> [...] > >> > >> I notice this issue actually because of the patch "Cache relation > >> sizes?" from Thomas Munro [1], where the latest patch[2] still have the > >> following code. > >> + sr = smgr_alloc_sr(); <-- HERE a spin lock is hold > >> + > >> + /* Upgrade to exclusive lock so we can create a mapping. */ > >> + LWLockAcquire(mapping_lock, LW_EXCLUSIVE); <-- HERE a complex > >> operation is needed. it may take a long time. > >> > >> Our internal testing system found more misuses on our own PG version. > > > >> I think a experienced engineer like Thomas can make this mistake and the > >> patch was reviewed by 3 peoples, the bug is still there. It is not easy > >> to say just don't do it. > > > > I don't follow this argument - just ignoring the problem, > > I agree with you but I'm feeling you ignored my post at [1], where I > said for the known issue, it should be fixed at the first chance. With "ignoring the problem" I was referencing emitting a WARNING instead of crash-restart. IME stuck spinlocks are caused by issues like not releasing a spinlock, possibly due to returning early due to an error or such, having lock-nesting issues leading to deadlocks, acquiring spinlocks or lwlocks in signal handlers, blocking in signal handlers while holding a spinlock outside of the signal handers and many variations of those. The common theme of these causes is that they don't resolve after some time. The only way out of the situation is to crash-restart, either "automatically" or by operator intervention. > > which emitting a > > WARNING basically is, doesn't reduce the impact of the bug, it *increases* the > > impact, because now the system will not recover from the bug without explicit > > operator intervention. During that time the system might be essentially > > unresponsive, because all backends end up contending for some spinlock, which > > makes investigating such issues very hard. > > Acutally they are doing pg_usleep at the most time. Sure - but that changes nothing about the problem. The concern isn't CPU usage, the concern is that there's often no possible forward progress. To take a recent-ish production issue I looked at, a buggy signal handler lead to a backend sleeping while holding a spinlock. Soon after the entire system got stuck, because they also acquired the spinlock. The person initially investigating the issue at first contacted me because they couldn't even log into the system, because connection establishment also acquired the spinlock (I'm not sure anymore which spinlock it was, possibly xlog.c's info_lck?). > > but not PANICing when this happens in production seems completely > > non-viable. > > > > Not sure what does *this* exactly means. If it means the bug in Thomas's > patch, I absoluately agree with you(since it is a known bug and it > should be fixed). If it means the general *unknown* case, it's something > we talked above. I mean that I think that not PANICing anymore would be a seriously bad idea and cause way more problems than the PANIC. Greetings, Andres Freund
pgsql-hackers by date: