Re: Issue with the PRNG used by Postgres - Mailing list pgsql-hackers

From Parag Paul
Subject Re: Issue with the PRNG used by Postgres
Date
Msg-id CAA=PXp074Aab6xjyMfQoYhLUFFsDzT+0X=-JKbxyryXa9SH0eQ@mail.gmail.com
Whole thread Raw
In response to Re: Issue with the PRNG used by Postgres  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Issue with the PRNG used by Postgres
List pgsql-hackers
hi Tom, 
 First of all thanks for you response. I did not misread it. The 0.5 is added to the result of the multiplication which then uses C integer casting, which does not round off, but just drops the decimal portion.


status->cur_delay += (int) (status->cur_delay *
                                    pg_prng_double(&pg_global_prng_state) + 0.5);


So, if RNG generated 0.0000001 and cur_delay =1000.
Result will be
1000 + int(1000*0.000001 + 5) = (int)(1000 + (0.1+.5)) = (int)1000.6 = 1000 <--  back to the same value
and there is no change after that starts happening, if the RNG is in the low hamming index state. If avalanche happens soon, then it will correct it self, but in the mean time, we have a stuck_spin_lock PANIC that could bring down a production server. 
-Parag

On Wed, Apr 10, 2024 at 8:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
> I'm not convinced that we should try to improve the RNG, but surely we
> need to put parentheses around pg_prng_double(&pg_global_prng_state) +
> 0.5. IIUC, the current logic is making us multiply the spin delay by a
> value between 0 and 1 when what was intended was that it should be
> multiplied by a value between 0.5 and 1.5.

No, I think you are misreading it, because the assignment is += not =.
The present coding is

        /* increase delay by a random fraction between 1X and 2X */
        status->cur_delay += (int) (status->cur_delay *
                                    pg_prng_double(&pg_global_prng_state) + 0.5);

which looks fine to me.  The +0.5 is so that the conversion to integer
rounds rather than truncating.

In any case, I concur with Andres: if this behavior is anywhere near
critical then the right fix is to not be using spinlocks.

                        regards, tom lane

pgsql-hackers by date:

Previous
From: Kartyshov Ivan
Date:
Subject: Re: [HACKERS] make async slave to wait for lsn to be replayed
Next
From: Tom Lane
Date:
Subject: Re: Speed up clean meson builds by ~25%