Re: checkpointer continuous flushing - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | Re: checkpointer continuous flushing |
Date | |
Msg-id | alpine.DEB.2.10.1506170803210.9794@sto Whole thread Raw |
In response to | Re: checkpointer continuous flushing (Fabien COELHO <coelho@cri.ensmp.fr>) |
Responses |
Re: checkpointer continuous flushing
|
List | pgsql-hackers |
Hello, Here is version 3, including many performance tests with various settings, representing about 100 hours of pgbench run. This patch aims at improving checkpoint I/O behavior so that tps throughput is improved, late transactions are less frequent, and overall performances are more stable. * SOLILOQUIZING > - The bgwriter flushing option seems ineffective, it could be removed > from the patch? I did that. > - Move fsync as early as possible, suggested by Andres Freund? > > My opinion is that this should be left out for the nonce. I did that. > - Take into account tablespaces, as pointed out by Andres Freund? > > Alternatively, maybe it can be done from one thread, but it would probably > involve some strange hocus-pocus to switch frequently between tablespaces. I did the hocus-pocus approach, including a quasi-proof (not sure what is this mathematical object:-) in comments to show how/why it works. * PATCH CONTENTS - as version 1: simplified asynchronous flush based on Andres Freund patch, with sync_file_range/posix_fadvise used tohint the OS that the buffer must be sent to disk "now". - as version 2: checkpoint buffer sorting based on a 2007 patch by Takahiro Itagaki but with a smaller and static bufferallocated once. Also, sorting is done by chunks of 131072 pages in the current version, with a guc to change thisvalue. - as version 2: sync/advise calls are now merged if possible, so less calls will be used, especially when buffers aresorted, but also if there are few files written. - new: the checkpointer balance its page writes per tablespace. this is done by choosing to write pages for a tablespacefor which the progress ratio (written/to_write) is beyond the overall progress ratio for all tablespace, andby doing that in a round robin manner so that all tablespaces regularly get some attention. No threads. - new: some more documentation is added. - removed: "bgwriter_flush_to_write" is removed, as there was no clear benefit on the (simple) tests. It could be consideredfor another patch. - question: I'm not sure I understand the checkpointer memory management. There is some exception handling in the checkpointermain. I wonder whether the allocated memory would be lost in such event and should be reallocated. The patchcurrently assumes that the memory is kept. * PERFORMANCE TESTS Impacts on "pgbench -M prepared -N -P 1 ..." (simple update test, mostly random write activity on one table), checkpoint_completion_target=0.8, with different settings on a 16GB 8-core host: . tiny: scale=10 shared_buffers=1GB checkpoint_timeout=30s time=6400s . small: scale=120 shared_buffers=2GB checkpoint_timeout=300stime=4000s . medium: scale=250 shared_buffers=4GB checkpoint_timeout=15min time=4000s . large: scale=1000shared_buffers=4GB checkpoint_timeout=40min time=7500s Note: figures noted with a star (*) had various issues during their run, so pgbench progress figures were more or less incorrect, thus the standard deviation computation is not to be trusted beyond "pretty bad". Caveat: these are only benches on one host at a particular time and location, which may or may not be reproducible nor be representative as such of any other load. The good news is that all these tests tell the same thing. - full-speed 1-client options | tps performance over per second data flush | sort | tiny | small | medium | large off | off | 687 +- 231 | 163 +- 280 * | 191 +- 626 * | 37.7 +- 25.6 off | on | 699 +- 223 | 457 +- 315 |479 +- 319 | 48.4 +- 28.8 on | off | 740 +- 125 | 143 +- 387 * | 179 +- 501 * | 37.3 +- 13.3 on | on | 722+- 119 | 550 +- 140 | 549 +- 180 | 47.2 +- 16.8 - full speed 4-clients options | tps performance over per second data flush | sort | tiny | small | medium off | off| 2006 +- 748 | 193 +- 1898 * | 205 +- 2465 * off | on | 2086 +- 673 | 819 +- 905 * | 807 +- 1029 * on | off| 2212 +- 451 | 169 +- 1269 * | 160 +- 502 * on | on | 2073 +- 437 | 743 +- 413 | 822 +- 467 - 100-tps 1-client max 100-ms latency options | percent of late transactions flush | sort | tiny | small | medium off | off | 6.31 | 29.44 | 30.74 off | on | 6.23 | 8.93 | 7.12 on | off | 0.44 | 7.01 | 8.14 on | on | 0.59 | 0.83 | 1.84 - 200-tps 1-client max 100-ms latency options | percent of late transactions flush | sort | tiny | small | medium off | off | 10.00 | 50.61 | 45.51 off | on | 8.82 | 12.75 | 12.89 on | off | 0.59 | 40.48 | 42.64 on | on | 0.53 | 1.76 | 2.59 - 400-tps 1-client (or 4 for medium) max 100-ms latency options | percent of late transactions flush | sort | tiny | small | medium off | off | 12.0 | 64.28 | 68.6 off | on | 11.3 | 22.05 | 22.6 on | off | 1.1 | 67.93 | 67.9 on | on | 0.6 | 3.24 | 3.1 * CONCLUSION : For most of these HDD tests, when both options are activated the tps throughput is improved (+3 to +300%), late transactions are reduced (by 91% to 97%) and overall the performance is more stable (tps standard deviation is typically halved). The option effects are somehow orthogonal: - latency is essentially limited by flushing, although sorting also contributes. - throughput is mostly improved thanks to sorting, with some occasional small positive or negative effect from flushing. In detail, some loads may benefit more from only one option activated. In particular, flushing may have a small adverse effect on throughput in some conditions, although not always. With SSD probably both options would probably have limited benefit. -- Fabien.
pgsql-hackers by date: