checkpointer continuous flushing - Mailing list pgsql-hackers
From | Fabien COELHO |
---|---|
Subject | checkpointer continuous flushing |
Date | |
Msg-id | alpine.DEB.2.10.1506011320000.28433@sto Whole thread Raw |
Responses |
Re: checkpointer continuous flushing
Re: checkpointer continuous flushing Re: checkpointer continuous flushing Re: checkpointer continuous flushing - V16 Re: checkpointer continuous flushing |
List | pgsql-hackers |
Hello pg-devs, This patch is a simplified and generalized version of Andres Freund's August 2014 patch for flushing while writing during checkpoints, with some documentation and configuration warnings added. For the initial patch, see: http://www.postgresql.org/message-id/20140827091922.GD21544@awork2.anarazel.de For the whole thread: http://www.postgresql.org/message-id/alpine.DEB.2.10.1408251900211.11151@sto The objective is to help avoid PG stalling when fsyncing on checkpoints, and in general to get better latency-bound performance. Flushes are managed with pg throttled writes instead of waiting for the checkpointer final "fsync" which induces occasional stalls. From "pgbench -P 1 ...", such stalls look like this: progress: 35.0 s, 615.9 tps, lat 1.344 ms stddev 4.043 # ok progress: 36.0 s, 3.0 tps, lat 346.111 ms stddev 123.828# stalled progress: 37.0 s, 4.0 tps, lat 252.462 ms stddev 29.346 # ... progress: 38.0 s, 161.0 tps, lat 6.968ms stddev 32.964 # restart progress: 39.0 s, 701.0 tps, lat 1.421 ms stddev 3.326 # ok I've seen similar behavior on FreeBSD with its native FS, so it is not a Linux-specific or ext4-specific issue, even if both factor may contribute. There are two implementations, first one based on "sync_file_range" is Linux specific, while the other relies on "posix_fadvise". Tests below ran on Linux. If someone could test the posix_fadvise version on relevant platforms, that would be great... The Linux specific "sync_file_range" approach was suggested among other ideas by Theodore Ts'o on Robert Haas blog in March 2014: http://rhaas.blogspot.fr/2014/03/linuxs-fsync-woes-are-getting-some.html Two guc variables control whether the feature is activated for writes of dirty pages issued by checkpointer and bgwriter. Given that the settings may improve or degrade performance, having GUC seems justified. In particular the stalling issue disappears with SSD. The effect is significant on a series of tests shown below with scale 10 pgbench on an (old) dedicated host (8 GB memory, 8 cores, ext4 over hw RAID), with shared_buffers=1GB checkpoint_completion_target=0.8 completion_timeout=30s, unless stated otherwise. Note: I know that this completion_timeout is too small for a normal config, but the point is to test how checkpoints behave, so the test triggers as many checkpoints as possible, hence the minimum timeout setting. I have also done some tests with larger timeout. (1) THROTTLED PGBENCH The objective of the patch is to be able to reduce the latency of transactions under a moderate load. These first serie of tests focuses on this point with the help of pgbench -R (rate) and -L (skip/count late transactions). The measure counts transactions which were skipped or beyond the expected latency limit while targetting a transaction rate. * "pgbench -M prepared -N -T 100 -P 1 -R 100 -L 100" (100 tps targeted during 100 seconds, and latency limit is 100 ms),over 256 runs, 7 hours per case: flush | percent of skipped cp | bgw | & out of latency limit transactions off | off | 6.5 % off | on | 6.1 % on | off | 0.4 % on | on | 0.4 % * Same as above (100 tps target) over one run of 4000 seconds with shared_buffers=256MB and checkpoint_timeout=10mn: flush | percent of skipped cp | bgw | & out of latency limit transactions off | off | 1.3 % off | on | 1.5 % on | off | 0.6 % on | on | 0.6 % * Same as first one but with "-R 150", i.e. targetting 150 tps, 256 runs: flush | percent of skipped cp | bgw | & out of latency limit transactions off | off | 8.0 % off | on | 8.0 % on | off | 0.4 % on | on | 0.4 % * Same as above (150 tps target) over one run of 4000 seconds with shared_buffers=256MB and checkpoint_timeout=10mn: flush | percent of skipped cp | bgw | & out of latency limit transactions off | off | 1.7 % off | on | 1.9 % on | off | 0.7 % on | on | 0.6 % Turning "checkpoint_flush_to_disk = on" reduces significantly the number of late transactions. These late transactions are not uniformly distributed, but are rather clustered around times when pg is stalled, i.e. more or less unresponsive. bgwriter_flush_to_disk does not seem to have a significant impact on these tests, maybe because pg shared_buffers size is much larger than the database, so the bgwriter is seldom active. (2) FULL SPEED PGBENCH This is not the target use case, but it seems necessary to assess the impact of these options of tps figures and their variability. * "pgbench -M prepared -N -T 100 -P 1" over 512 runs, 14 hours per case. flush | performance on ... cp | bgw | 512 100-seconds runs | 1s intervals (over 51200 seconds) off | off |691 +- 36 tps | 691 +- 236 tps off | on | 677 +- 29 tps | 677 +- 230 tps on | off | 655 +- 23 tps | 655 +- 130 tps on | on | 657 +- 22 tps | 657 +- 130 tps On this first test, setting checkpoint_flush_to_disk reduces the performance by 5%, but the per second standard deviation is nearly halved, that is the performance is more stable over the runs, although lower. Option bgwriter_flush_to_disk effect is inconclusive. * "pgbench -M prepared -N -T 4000 -P 1" on only 1 (long) run, with checkpoint_timeout=10mn and shared_buffers=256MB (atleast 6 checkpoints during the run, probably more because segments are filled more often than every 10mn): flush | performance ... (stddev over per second tps) off | off | 877 +- 179 tps off | on | 880 +- 183 tps on | off | 896 +- 131 tps on | on | 888 +- 132 tps On this second short test, setting checkpoint_flush_to_disk seems to maybe slightly improve performance (maybe 2% ?) and significantly reduces variability, so it looks like a good move. * "pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients) flush | performance on ... cp | bgw | 32 100-seconds runs | 1s intervals (over 3200 seconds) off | off | 1970+- 60 tps | 1970 +- 783 tps off | on | 1928 +- 61 tps | 1928 +- 813 tps on | off | 1578 +- 45 tps | 1578 +- 631 tps on | on | 1594 +- 47 tps | 1594 +- 618 tps On this test both average and standard deviation are both reduced by 20%. This does not look like a win. CONCLUSION This approach is simple and significantly improves pg fsync behavior under moderate load, where the database stays mostly responsive. Under full load, the situation may be improved or degraded, it depends. OTHER OPTIONS Another idea suggested by Theodore Ts'o seems impractical: playing with Linux io-scheduler priority (ioprio_set) looks only relevant with the "sfq" scheduler on actual hard disk, but does not work with other schedulers, especially "deadline" which seems more advisable for Pg, nor for hardware RAID, which is a common setting. Also, Theodore Ts'o suggested to use "sync_file_range" to check whether the writes have reached the disk, and possibly to delay the actual fsync/checkpoint conclusion if not... I have not tried that, the implementation is not as trivial, and I'm not sure what to do when the completion target is coming, but possibly that could be an interesting option to investigate. Preliminary tests by adding a sleep between the writes and the final fsync did not yield very good results. I've also played with numerous other options (changing checkpointer throttling parameters, reducing checkpoint timeout to 1 second, playing around with various kernel settings), but that did not seem to be very effective for the problem at hand. I also attached a test script I used, that can be adapted if someone wants to collect some performance data. I also have some basic scripts to extract and compute stats, ask if needed. -- Fabien.
pgsql-hackers by date: