[MASSMAIL]Issue with the PRNG used by Postgres - Mailing list pgsql-hackers

From Parag Paul
Subject [MASSMAIL]Issue with the PRNG used by Postgres
Date
Msg-id CAA=PXp2XTxw_ZoE+1L-qwXmuxYnqQ41y_PHpWja_4i83g-7QNQ@mail.gmail.com
Whole thread Raw
Responses Re: Issue with the PRNG used by Postgres
List pgsql-hackers
hi all, 
 We have an interesting problem, where PG went to PANIC due to stuck spinlock case. 
On careful analysis and hours of trying to reproduce this(something that showed up in production after almost 2 weeks of stress run), I did some statistical analysis on the RNG generator that PG uses to create the backoff for the spin locks. 

 My expectation is that, this RNG will be fair to all backends.
After studying the underlying hash function that PG uses,"xoroshiro" which is from the LFSR family, it seems, that it has failed a bunch of statistical tests and has been mentioned in various blogs. 
The one that caught my attention was :https://www.pcg-random.org/posts/xoshiro-repeat-flaws.html
This mentions the zeroland problem and the big belt when in low Hamming index. What that means, is that when the status reaches a low hamming index state(state where it is mainly all 0's and less 1's), it takes a while to get back to regular entropy. 

Here is the code from s_lock.c that shows how we add delay for the backoff, before the TAS is tried again.
Also, there is a bug here in all likelihood, as the author wanted to round off the value coming out of the equation by adding 0.5. Integer casting in c/c++ will drop the decimal part out((int)12.9999 = 12). 
And if the RNG keeps producing values close to 0, the delay will never increase(as it will keep integer casting to 0) and end up spinning 1000 times within 1second itself. That is one of the reasons why we could reproduce this by forcing low RNG numbers in our test cases. 
So, steps that I would recommend
1. Firstly, use ceiling for round off
2. cur_delay should increase by max(1000, RNG generated value)
3. Change the hashing function
4. Capture the new 0 state, and then try to reseed.

        /* increase delay by a random fraction between 1X and 2X */
        status->cur_delay += (int) (status->cur_delay *
                                    pg_prng_double(&pg_global_prng_state) + 0.5);

pgsql-hackers by date:

Previous
From: Andrei Lepikhov
Date:
Subject: Re: post-freeze damage control
Next
From: Michael Paquier
Date:
Subject: Re: Allow non-superuser to cancel superuser tasks.