Re: rand48 replacement - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: rand48 replacement
Date
Msg-id alpine.DEB.2.22.394.2107012212500.2248546@pseudo
Whole thread Raw
In response to Re: rand48 replacement  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Responses Re: rand48 replacement
List pgsql-hackers
Hello Dean,

> I haven't looked at the patch in detail, but one thing I object to is
> the code to choose a random integer in an arbitrary range.

Thanks for bringing up this interesting question!

> Currently, this is done in pgbench by getrand(), which has its
> problems.

Yes. That is one of the motivation for providing something hopefully 
better.

> However, this patch seems to be replacing that with a simple
> modulo operation, which is perhaps the worst possible way to do it.

I did it knowing this issue. Here is why:

The modulo operation is biased for large ranges close to the limit, sure. 
Also, the bias is somehow of the same magnitude as the FP multiplication 
approach used previously, so the "worst" has not changed much, it is 
really the same as before.

I thought it is not such an issue because for typical uses we are unlikely 
to be in these conditions, so the one-operation no branching approach 
seemed like a good precision vs performance compromise: I'd expect the 
typical largest ranges to be well below 40 bits (eg a key in a pretty 
large table in pgbench), which makes the bias well under 1/2**24 and ISTM 
that I can live with that. With the initial 48 bits state, obviously the 
situation was not the same.

> There's plenty of research out there on how to do it better -- see,
> for example, [1] for a nice summary.

Rejection methods include branches, thus may cost significantly more, as 
show by the performance figures in blog.

Also, it somehow breaks the sequence determinism when using range, which I 
found quite annoying: ISTM desirable that when generating a number the 
state advances once, and just once.

Also some methods have higher costs depending on the actual range, eg the 
bitmask approach: for range 129 the bitmask is 0xff and you have a nearly 
50% probability of iterating once, nearly 25% of iterating twice, and so 
on… I like performance to be uniform, not to depend on actual values.

Given these arguments I'd be inclined to keep the bias, but I'm open to 
more discussion.

> Also, I'd say that functions to choose random integers in an arbitrary 
> range ought to be part of the common API, as they are in almost every 
> language's random API.

That is a good point.

-- 
Fabien.

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: make world and install-world without docs
Next
From: Daniel Gustafsson
Date:
Subject: Re: SSL/TLS instead of SSL in docs