Re: Issue with the PRNG used by Postgres - Mailing list pgsql-hackers

From Andrey M. Borodin
Subject Re: Issue with the PRNG used by Postgres
Date
Msg-id 239D257E-6740-4644-BBDA-9600A45592AF@yandex-team.ru
Whole thread Raw
In response to Re: Issue with the PRNG used by Postgres  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
FWIW, yesterday we had one more reproduction of stuck spinlock panic which does not seem as a stuck spinlock.

I don’t see any valuable diagnostic information. The reproduction happened on hot standby. There’s a message in logs on
primaryat the same time, but does not seem to be releated: 
"process 3918804 acquired ShareLock on transaction 909261926 after 2716.594 ms"
PostgreSQL 14.11
VM with this node does not seem heavily loaded, according to monitoring there were just 2 busy backends before panic
shutdown.


> On 16 Apr 2024, at 20:54, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2024-04-15 10:54:16 -0400, Robert Haas wrote:
>> On Fri, Apr 12, 2024 at 3:33 PM Andres Freund <andres@anarazel.de> wrote:
>>> Here's a patch implementing this approach. I confirmed that before we trigger
>>> the stuck spinlock logic very quickly and after we don't. However, if most
>>> sleeps are interrupted, it can delay the stuck spinlock detection a good
>>> bit. But that seems much better than triggering it too quickly.
>>
>> +1 for doing something about this. I'm not sure if it goes far enough,
>> but it definitely seems much better than doing nothing.
>
> One thing I started to be worried about is whether a patch ought to prevent
> the timeout used by perform_spin_delay() from increasing when
> interrupted. Otherwise a few signals can trigger quite long waits.
>
> But as a I can't quite see a way to make this accurate in the backbranches, I
> suspect something like what I posted is still a good first version.
>


What kind of inaccuracy do you see?
The code in performa_spin_delay() does not seem to be much different across REL_11_STABLE..REL_12_STABLE.
The only difference I see is how random number is generated.

Thanks!


Best regards, Andrey Borodin.


pgsql-hackers by date:

Previous
From: Bertrand Drouvot
Date:
Subject: Re: Track the amount of time waiting due to cost_delay
Next
From: Jeff Davis
Date:
Subject: Re: Improve the granularity of PQsocketPoll's timeout parameter?