Re: Issue with the PRNG used by Postgres - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Issue with the PRNG used by Postgres
Date
Msg-id 20240410190821.yhquanxyhpqtkett@awork3.anarazel.de
Whole thread Raw
In response to Re: Issue with the PRNG used by Postgres  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Issue with the PRNG used by Postgres
List pgsql-hackers
Hi,

On 2024-04-10 14:02:20 -0400, Tom Lane wrote:
> On third thought ... while I still think this is a misuse of
> perform_spin_delay and we should change it, I'm not sure it'll do
> anything to address Parag's problem, because IIUC he's seeing actual
> "stuck spinlock" reports.  That implies that the inner loop of
> LWLockWaitListLock slept NUM_DELAYS times without ever seeing
> LW_FLAG_LOCKED clear.  What I'm suggesting would change the triggering
> condition to "NUM_DELAYS sleeps without acquiring the lock", which is
> strictly more likely to happen, so it's not going to help him.  It's
> certainly still well out in we-shouldn't-get-there territory, though.

I think it could exascerbate the issue. Parag reported ~7k connections on a
128 core machine. The buffer replacement logic in < 16 tries to lock the old
and new lock partitions at once. That can lead to quite bad "chains" of
dependent lwlocks, occasionally putting all the pressure on a single lwlock.
With 7k waiters on a single spinlock, higher frequency of wakeups will make it
much more likely that the process holding the spinlock will be put to sleep.

This is greatly exacerbated by the issue fixed in a4adc31f690, once the
waitqueue is long, the spinlock will be held for an extended amount of time.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: broken JIT support on Fedora 40
Next
From: Bruce Momjian
Date:
Subject: Re: Table AM Interface Enhancements