Re: checkpointer continuous flushing - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: checkpointer continuous flushing |
Date | |
Msg-id | alpine.DEB.2.10.1506071638490.11135@sto Whole thread Raw |
In response to | Re: checkpointer continuous flushing (Andres Freund <andres@anarazel.de>) |
Responses |
Re: checkpointer continuous flushing
Re: checkpointer continuous flushing |
List | pgsql-hackers |
Hello Andres, > They pretty much can't if you flush things frequently. That's why I > think this won't be acceptable without the sorting in the checkpointer. * VERSION 2 "WORK IN PROGRESS". The implementation is more a proof-of-concept for having feedback than clean code. What it does: - as version 1 : simplified asynchronous flush based on Andres Freund patch, with sync_file_range/posix_fadvise used tohint the OS that the buffer must be sent to disk "now". - added: checkpoint buffer sorting based on a 2007 patch by Takahiro Itagaki but with a smaller and static buffer allocatedonce. Also, sorting is done by chunks in the current version. - also added: sync/advise calls are now merged if possible, so less calls are used, especially when buffers are sorted, but also if there are few files. * PERFORMANCE TESTS Impacts on "pgbench -M prepared -N -P 1" scale 10 (simple update pgbench with a mostly-write activity), with checkpoint_completion_target=0.8 and shared_buffers=1GB. Contrary to v1, I have not tested bgwriter flushing as the impact on the first round was close to nought. This does not mean that particular loads may benefit or be harmed but flushing from bgwriter. - 100 tps throttled max 100 ms latency over 6400 seconds with checkpoint_timeout=30s flush | sort | late transactions off | off | 6.0 % off | on | 6.1 % on | off | 0.4 % on | on | 0.4% (93% improvement) - 100 tps throttled max 100 ms latency over 4000 seconds with checkpoint_timeout=10mn flush | sort | late transactions off | off | 1.5 % off | on | 0.6 % (?!) on | off | 0.8 % on | on |0.6 % (60% improvement) - 150 tps throttled max 100 ms latency over 19600 seconds (5.5 hours) with checkpoint_timeout=30s flush | sort | late transactions off | off | 8.5 % off | on | 8.1 % on | off | 0.5 % on | on | 0.4% (95% improvement) - full speed bgbench over 6400 seconds with checkpoint_timeout=30s flush | sort | tps performance over per second data off | off | 676 +- 230 off | on | 683 +- 213 on | off| 712 +- 130 on | on | 725 +- 116 (7.2% avg/50% stddev improvements) - full speed bgbench over 4000 seconds with checkpoint_timeout=10mn flush | sort | tps performance over per second data off | off | 885 +- 188 off | on | 940 +- 120 (6%/36%!) on | off | 778 +- 245 (hmmm... not very consistent?) on | on | 927 +- 108 (4.5% avg/43% sttdev improvements) - full speed bgbench "-j2 -c4" over 6400 seconds with checkpoint_timeout=30s flush | sort | tps performance over per second data off | off | 2012 +- 747 off | on | 2086 +- 708 on | off| 2099 +- 459 on | on | 2114 +- 422 (5% avg/44% stddev improvements) * CONCLUSION : For all these HDD tests, when both options are activated the tps performance is improved, the latency is reduced and the performance is more stable (smaller standard deviation). Overall the option effects, not surprisingly, are quite (with exceptions) orthogonal: - latency is essentially improved (60 to 95% reduction) by flushing - throughput is improved (4 to 7% better)thanks to sorting In detail, some loads may benefit more from only one option activated. Also on SSD probably both options would have limited benefit. Usual caveat: these are only benches on one host at a particular time and location, which may or may not be reproducible nor be representative as such of any other load. The good news is that all these tests tell the same thing. * LOOK FOR THOUGHTS - The bgwriter flushing option seems ineffective, it could be removed from the patch? - Move fsync as early as possible, suggested by Andres Freund? In these tests, when the flush option is activated, the fsync duration at the end of the checkpoint is small: on more than 5525 checkpoint fsyncs, 0.5% are above 1 second when flush is on, but the figure raises to 24% when it is off.... This suggest that doing the fsync as soon as possible would probably have no significant effect on these tests. My opinion is that this should be left out for the nonce. - Take into account tablespaces, as pointed out by Andres Freund? The issue is that if writes are sorted, they are not be distributed randomly over tablespaces, inducing lower performance on such systems. How to do it: while scanning shared_buffers, count dirty buffers for each tablespace. Then start as many threads as table spaces, each one doing its own independent throttling for a tablespace? For some obscure reason there are 2 tablespaces by default (pg_global and pg_default), that would mean at least 2 threads. Alternatively, maybe it can be done from one thread, but it would probably involve some strange hocus-pocus to switch frequently between tablespaces. -- Fabien.
pgsql-hackers by date: