Re: Issue with the PRNG used by Postgres - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Issue with the PRNG used by Postgres
Date
Msg-id 4090821.1712772140@sss.pgh.pa.us
Whole thread Raw
In response to Re: Issue with the PRNG used by Postgres  (Andres Freund <andres@anarazel.de>)
Responses Re: Issue with the PRNG used by Postgres
List pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
> On 2024-04-10 13:03:05 -0400, Tom Lane wrote:
>> So I think we need something like the attached.

> LGTM.

On third thought ... while I still think this is a misuse of
perform_spin_delay and we should change it, I'm not sure it'll do
anything to address Parag's problem, because IIUC he's seeing actual
"stuck spinlock" reports.  That implies that the inner loop of
LWLockWaitListLock slept NUM_DELAYS times without ever seeing
LW_FLAG_LOCKED clear.  What I'm suggesting would change the triggering
condition to "NUM_DELAYS sleeps without acquiring the lock", which is
strictly more likely to happen, so it's not going to help him.  It's
certainly still well out in we-shouldn't-get-there territory, though.

Also, fooling around with the cur_delay adjustment doesn't affect
this at all: "stuck spinlock" is still going to be raised after
NUM_DELAYS failures to observe the lock clear or obtain the lock.
Increasing cur_delay won't change that, it'll just spread the
fixed number of attempts over a longer period; and there's no
reason to believe that does anything except make it take longer
to fail.  Per the header comment for s_lock.c:

 * We time out and declare error after NUM_DELAYS delays (thus, exactly
 * that many tries).  With the given settings, this will usually take 2 or
 * so minutes.  It seems better to fix the total number of tries (and thus
 * the probability of unintended failure) than to fix the total time
 * spent.

If you believe that waiting processes can be awakened close enough to
simultaneously to hit the behavior I posited earlier, then encouraging
them to have different cur_delay values will help; but Andres doesn't
believe it and I concede it seems like a stretch.

So I think fooling with the details in s_lock.c is pretty much beside
the point.  The most likely bet is that Parag's getting bit by the
bug fixed in a4adc31f690.  It's possible he's seeing the effect of
some different issue that causes lwlock.c to hold that lock a long
time at scale, but that's where I'd look first.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Issue with the PRNG used by Postgres
Next
From: Tom Lane
Date:
Subject: Re: psql: Greatly speed up "\d tablename" when not using regexes