Re: rand48 replacement - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: rand48 replacement
Date
Msg-id alpine.DEB.2.22.394.2107041718570.2359988@pseudo
Whole thread Raw
In response to Re: rand48 replacement  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: rand48 replacement
List pgsql-hackers
> Now suppose we want a random number in the range [0,6). This is what
> happens with your algorithm for each of the possible prng() return
> values:
>
>  prng() returns 0 -- OK
>  prng() returns 1 -- OK
>  prng() returns 2 -- OK
>  prng() returns 3 -- OK
>  prng() returns 4 -- OK
>  prng() returns 5 -- OK
>  prng() returns 6 -- out of range so use splitmix3() with initial state=6:
>    state=6 -> 3, return 7 -- still out of range, so repeat
>    state=3 -> 0, return 3 -- now OK
>  prng() returns 7 -- out of range so use splitmix3() with initial state=7:
>    state=7 -> 4, return 4 -- now OK
>
> So, assuming that prng() chooses each of the 8 possible values with
> equal probability (1/8), the overall result is that the values 0,1,2
> and 5 are returned with a probability of 1/8, whereas 3 and 4 are
> returned with a probability of 2/8.

Ok, I got that explanation.

> Using the correct implementation of the bitmask algorithm, each
> iteration calls prng() again, so in the end no particular return value
> is ever more likely than any other (hence it's unbiased).

Ok, you're taking into account the number of states of the PRNG, so this 
finite number implies some bias on some values if you actually enumerate 
all possible cases, as you do above.

I was reasoning "as if" the splitmix PRNG was an actual random function, 
which is obviously false, but is also somehow a usual (false) assumption 
with PRNGs, and with this false assumption my implementation is perfect:-)

The defect of the modulo method for range extraction is that even with an 
actual (real) random generator the results would be biased. The bias is in 
the method itself. Now you are arguing for a bias linked to the internals 
of the PRNG. They are not the same in nature, even if the effect is the 
same.

Also the bias is significant for close to the limit ranges, which is not 
the kind of use case I have in mind when looking at pgbench.

If you consider the PRNG internals, then splitmix extraction function 
could also be taken into account. If it is not invertible (I'm unsure), 
then assuming it is some kind of hash function, about 1/e of output values 
would not reachable, which is yet another bias that you could argue 
against.

Using the initial PRNG works better only because the underlying 128 bits 
state is much larger than the output value. Which is the point for having 
a larger state in the first place, anyway.

> As for determinism, the end result is still fully deterministic. For
> example, lets say that prng() returns the following sequence, for some
> initial state:
>
>  1,0,3,0,3,7,4,7,6,6,5,3,7,7,7,0,3,6,5,2,3,3,4,0,0,2,7,4,...
>
> then the bitmask method just returns that sequence with all the 6's
> and 7's removed:
>
>  1,0,3,0,3,4,5,3,0,3,5,2,3,3,4,0,0,2,4,...
>
> and that same sequence will always be returned, when starting from
> that initial seed.

Yes and no.

The result is indeed deterministic of you call the function with the same 
range. However, if you change the range value in one place then sometimes 
the state can advance differently, so the determinism is lost, meaning 
that it depends on actual range values.

Attached a v7 which does as you wish, but also looses the deterministic 
non-value dependent property I was seeking. I would work around that by 
deriving another 128 bit generator instead of splitmix 64 bit, but that is 
overkill.

-- 
Fabien.
Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: PostgreSQL-13.3 parser.y with positional references by named references
Next
From: Dean Rasheed
Date:
Subject: Re: rand48 replacement