On Tue, Aug 26, 2014 at 1:53 AM, Fabien COELHO <
coelho@cri.ensmp.fr> wrote:
>
>
> Hello pgdevs,
>
> I've been playing with pg for some time now to try to reduce the maximum latency of simple requests, to have a responsive server under small to medium load.
>
> On an old computer with a software RAID5 HDD attached, pgbench simple update script run for a some time (scale 100, fillfactor 95)
>
> pgbench -M prepared -N -c 2 -T 500 -P 1 ...
>
> gives 300 tps. However this performance is really +1000 tps for a few seconds followed by 16 seconds at about 0 tps for the checkpoint induced IO storm. The server is totally unresponsive 75% of the time. That's bandwidth optimization for you. Hmmm... why not.
>
> Now, given this setup, if pgbench is throttled at 50 tps (1/6 the above max):
>
> pgbench -M prepared -N -c 2 -R 50.0 -T 500 -P 1 ...
>
> The same thing more or less happens in a delayed fashion... You get 50 tps for some time, followed by sections of 15 seconds at 0 tps for the checkpoint when the segments are full... the server is unresponsive about 10% of the time (one in ten transaction is late by more than 200 ms).
I think another thing to know here is why exactly checkpoint
storm is causing tps to drop so steeply. One reason could be
that backends might need to write more WAL due Full_Page_Writes,
another could be contention around buffer content_lock.
To dig more about the reason, the same tests can be tried
by making Full_Page_Writes = off and/or
synchronous_commit = off to see if WAL writes are causing
tps to go down.
Similarly for checkpoints, use checkpoint_completion_target to
spread the checkpoint_writes as suggested by Jeff as well to see
if that can mitigate the problem.