Hello,
> Yes but for a third thread (each on a physical core) it will be 1/40 +
> 1/40 and so on up to roughly 40/40 for 40 cores.
That is why I proposed a formula which depends on the number of threads.
> [...] But they aren't constant only close. It may or not show up in this 
> case but I've noticed that often the collision rate is a lot higher than 
> the probability would suggest, I'm not sure why,
If so, I would suggested that the probability is wrong and try to 
understand why:-)
>>> Moreover  they will write to the same cache lines for every fprintf
>>> and this is very very bad even without atomic operations.
>>
>> We're talking of transactions that involve network messages and possibly
>> disk IOs on the server, so some cache issues issues within pgbench would not
>> be a priori the main performance driver.
> Sure but :
> - good measurement is hard and by adding locking in fprintf it make
> its timing more noisy.
This really depends on the probability of the lock collisions. If it is 
small enough, the impact would be negligeable.
> - it's against 'good practices' for scalable code.
> Trivial code can show that elapsed time for as low as four cores writing 
> to same cache line in a loop, without locking or synchronization, is 
> greater than the elapsed time for running these four loops sequentially 
> on one core. If they write to different cache lines it scales linearly.
I'm not argumenting about general scalability principles, which may or may 
not be relevant to the case at hand.
I'm discussing whether the proposed feature can be implemented much simply 
with mutex instead of the current proposal which is on the heavy side, 
thus induces more maintenance effort latter.
Now I agree that if there is a mutex it must be a short as possible and 
not hinder performance significantly for pertinent use case. Note that 
overhead evaluation by Tomas is pessimistic as it only involves read-only 
transactions for which all transaction details are logged. Note also that 
if you have 1000 cores to run pgbench and that locking may be an issue, 
you could still use the per-thread logs.
The current discussion suggests that each thread should prepare the string 
off-lock (say with some sprintf) and then only lock when sending the 
string. This looks reasonable, but still need to be validated (i.e. the 
lock time would indeed be very small wrt the transaction time).
-- 
Fabien.