Re: Issue with the PRNG used by Postgres - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Issue with the PRNG used by Postgres
Date
Msg-id 20240412183317.m7ick24374mcj6do@awork3.anarazel.de
Whole thread Raw
In response to Re: Issue with the PRNG used by Postgres  (Andres Freund <andres@anarazel.de>)
Responses Re: Issue with the PRNG used by Postgres
List pgsql-hackers
Hi,

Given that I found several ways to spuriously trigger the stuck spinlock
logic, I think we need to backpatch a fix.


On 2024-04-11 21:41:39 -0700, Andres Freund wrote:
> Tracking the total amount of time spent sleeping doesn't require any
> additional syscalls, due to the nanosleep()...

I did quickly implement this, and it works nicely on unix. Unfortunately I
couldn't find something similar for windows. Which would mean we'd need to
record the time before every sleep.

Another issue with that approach is that we'd need to add a field to
SpinDelayStatus, which'd be an ABI break. Not sure there's any external code
using it, but if there were, it'd be a fairly nasty, rarely hit, path.

Afaict there's no padding that could be reused on 32 bit platforms [1].


I wonder if, for easy backpatching, the easiest solution is to just reset
errno before calling pg_usleep(), and only increment status->delays if
errno != EINTR. Given our pg_usleep() implementations, that'd preven the stuck
spinlock logic from triggering too fast.


If the ABI issue weren't there, we could record the current time the first
time we sleep during a spinlock acquisition (there's an existing branch for
that) and then only trigger the stuck spinlock detection when enough time has
passed.  That'd be fine from an efficiency POV.


Even if we decide to do so, I don't think it'd be a good idea to remove the
stuck logic alltogether in the backbranches. That might take out services that
have been running ok-ish for a long time.

Greetings,

Andres Freund


[1] But we are wasting a bunch of space on 64bit platforms due to switching
between pointers and 32bit integers, should probably fix that.



pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: using extended statistics to improve join estimates
Next
From: Nathan Bossart
Date:
Subject: Re: allow changing autovacuum_max_workers without restarting