Hello Andres,
>> With your worst-case figure and some rounding, it seems to look like:
>>
>> #threads collision probability performance impact
>> 2 1/40 1/3200
>> 4 1/7 1/533
>> 8 0.7 < 0.01 (about 1%)
>>
>> This suggest that for a pessimistic (ro load) fprintf overhead ratio there
>> would be a small impact even with 8 thread doing 20000 tps each.
>
> I think math like this mostly disregards hardware realities.
Hmmm. In my mind, doing the maths helps understand what may be going on.
Note that it does not preclude to check afterwards that it does indeed
correspond to reality:-)
The key suggestion of the maths is that if p*t << 1 all is (seems) well.
> You don't actually need to have actual lock contention to notice
> overhead.
The overhead assumed is 1/40 of the transaction time from Tomas' measures.
Given the ~ 18000 tps (we are talking of an in-memory read-only load
probably on the same host), transaction time for pgbench seems to be about
0.06 ms, and fprintf seems to be about 0.0015 ms (1.5 µs).
> - frequently acquiring an *uncontended* lock that resides in another
> socket's cache and where the cacheline is dirty requires relatively
> expensive cross cpu transfers. That's all besides the overhead of doing
> a lock operation itself. A lock; xaddl;, or whatever you end up using,
> has a significant cost in itself. It implies a bus lock and cache flush,
> which is far from free.
Ok, I did not assume an additional "lock cost". Do you have a figure? A
quick googling suggested figure for "lightweight mutexes" around 100 ns,
but the test conditions were unclear. If it is oky, then it is does not
change much the above maths to add that overhead.
> Additionally we're quite possibly talking about more than 8 threads.
> I've frequently used pgbench with hundreds of threads; for imo good
> reasons.
Good for you. I do not have access to a host on which this would make
sense:-)
> That all said, it's far from guaranteed that there's an actual problem
> here. If done right, i.e. the expensive formatting of the string is
> separated from the locked output to the kernel, it might end up being
> acceptable.
That is what I would like to assess. Indeed, probably snprinf (to avoid
mallocing anything) and then fputs/write/whatever would indeed help reduce
the "contention" probability, if not the actual overhead.
--
Fabien.