checkpointer continuous flushing - Mailing list pgsql-hackers

Hello pg-devs,

This patch is a simplified and generalized version of Andres Freund's 
August 2014 patch for flushing while writing during checkpoints, with some 
documentation and configuration warnings added.

For the initial patch, see:
  http://www.postgresql.org/message-id/20140827091922.GD21544@awork2.anarazel.de

For the whole thread:
  http://www.postgresql.org/message-id/alpine.DEB.2.10.1408251900211.11151@sto

The objective is to help avoid PG stalling when fsyncing on checkpoints, 
and in general to get better latency-bound performance.

Flushes are managed with pg throttled writes instead of waiting for the
checkpointer final "fsync" which induces occasional stalls. From
"pgbench -P 1 ...", such stalls look like this:
  progress: 35.0 s, 615.9 tps, lat 1.344 ms stddev 4.043   # ok  progress: 36.0 s, 3.0 tps, lat 346.111 ms stddev
123.828# stalled  progress: 37.0 s, 4.0 tps, lat 252.462 ms stddev 29.346  # ...  progress: 38.0 s, 161.0 tps, lat
6.968ms stddev 32.964  # restart  progress: 39.0 s, 701.0 tps, lat 1.421 ms stddev 3.326   # ok
 

I've seen similar behavior on FreeBSD with its native FS, so it is not a
Linux-specific or ext4-specific issue, even if both factor may contribute.

There are two implementations, first one based on "sync_file_range" is Linux
specific, while the other relies on "posix_fadvise". Tests below ran on Linux.
If someone could test the posix_fadvise version on relevant platforms, that
would be great...

The Linux specific "sync_file_range" approach was suggested among other ideas
by Theodore Ts'o on Robert Haas blog in March 2014:
  http://rhaas.blogspot.fr/2014/03/linuxs-fsync-woes-are-getting-some.html

Two guc variables control whether the feature is activated for writes of 
dirty pages issued by checkpointer and bgwriter. Given that the settings 
may improve or degrade performance, having GUC seems justified.  In 
particular the stalling issue disappears with SSD.

The effect is significant on a series of tests shown below with scale 10 
pgbench on an (old) dedicated host (8 GB memory, 8 cores, ext4 over hw 
RAID), with shared_buffers=1GB checkpoint_completion_target=0.8 
completion_timeout=30s, unless stated otherwise.

Note: I know that this completion_timeout is too small for a normal 
config, but the point is to test how checkpoints behave, so the test 
triggers as many checkpoints as possible, hence the minimum timeout 
setting. I have also done some tests with larger timeout.


(1) THROTTLED PGBENCH

The objective of the patch is to be able to reduce the latency of transactions
under a moderate load. These first serie of tests focuses on this point with
the help of pgbench -R (rate) and -L (skip/count late transactions).
The measure counts transactions which were skipped or beyond the expected
latency limit while targetting a transaction rate.

* "pgbench -M prepared -N -T 100 -P 1 -R 100 -L 100" (100 tps targeted during  100 seconds, and latency limit is 100
ms),over 256 runs, 7 hours per case:
 
  flush     | percent of skipped  cp  | bgw | & out of latency limit transactions  off | off | 6.5 %  off |  on | 6.1 %
 on | off | 0.4 %   on |  on | 0.4 %
 

* Same as above (100 tps target) over one run of 4000 seconds with  shared_buffers=256MB and checkpoint_timeout=10mn:
  flush     | percent of skipped  cp  | bgw | &  out of latency limit transactions  off | off | 1.3 %  off |  on | 1.5
%  on | off | 0.6 %   on |  on | 0.6 %
 

* Same as first one but with "-R 150", i.e. targetting 150 tps, 256 runs:
  flush     | percent of skipped  cp  | bgw | &  out of latency limit transactions  off | off | 8.0 %  off |  on | 8.0
%  on | off | 0.4 %   on |  on | 0.4 %
 

* Same as above (150 tps target) over one run of 4000 seconds with  shared_buffers=256MB and checkpoint_timeout=10mn:
  flush     | percent of skipped  cp  | bgw | &  out of latency limit transactions  off | off | 1.7 %  off |  on | 1.9
%  on | off | 0.7 %   on |  on | 0.6 %
 

Turning "checkpoint_flush_to_disk = on" reduces significantly the number
of late transactions. These late transactions are not uniformly distributed,
but are rather clustered around times when pg is stalled, i.e. more or less
unresponsive.

bgwriter_flush_to_disk does not seem to have a significant impact on these 
tests, maybe because pg shared_buffers size is much larger than the 
database, so the bgwriter is seldom active.


(2) FULL SPEED PGBENCH

This is not the target use case, but it seems necessary to assess the 
impact of these options of tps figures and their variability.

* "pgbench -M prepared -N -T 100 -P 1" over 512 runs, 14 hours per case.
      flush   | performance on ...    cp  | bgw | 512 100-seconds runs | 1s intervals (over 51200 seconds)    off | off
|691 +- 36 tps        | 691 +- 236 tps    off |  on | 677 +- 29 tps        | 677 +- 230 tps     on | off | 655 +- 23
tps       | 655 +- 130 tps     on |  on | 657 +- 22 tps        | 657 +- 130 tps
 

On this first test, setting checkpoint_flush_to_disk reduces the performance by
5%, but the per second standard deviation is nearly halved, that is the
performance is more stable over the runs, although lower.
Option bgwriter_flush_to_disk effect is inconclusive.

* "pgbench -M prepared -N -T 4000 -P 1" on only 1 (long) run, with  checkpoint_timeout=10mn and shared_buffers=256MB
(atleast 6 checkpoints  during the run, probably more because segments are filled more often than  every 10mn):
 
       flush   | performance ... (stddev over per second tps)     off | off | 877 +- 179 tps     off |  on | 880 +- 183
tps     on | off | 896 +- 131 tps      on |  on | 888 +- 132 tps
 

On this second short test, setting checkpoint_flush_to_disk seems to maybe 
slightly improve performance (maybe 2% ?) and significantly reduces 
variability, so it looks like a good move.

* "pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients)
      flush   | performance on ...    cp  | bgw | 32 100-seconds runs | 1s intervals (over 3200 seconds)    off | off |
1970+- 60 tps      | 1970 +- 783 tps    off |  on | 1928 +- 61 tps      | 1928 +- 813 tps     on | off | 1578 +- 45 tps
    | 1578 +- 631 tps     on |  on | 1594 +- 47 tps      | 1594 +- 618 tps
 

On this test both average and standard deviation are both reduced by 20%.
This does not look like a win.


CONCLUSION

This approach is simple and significantly improves pg fsync behavior under
moderate load, where the database stays mostly responsive.  Under full load,
the situation may be improved or degraded, it depends.


OTHER OPTIONS

Another idea suggested by Theodore Ts'o seems impractical: playing with 
Linux io-scheduler priority (ioprio_set) looks only relevant with the 
"sfq" scheduler on actual hard disk, but does not work with other 
schedulers, especially "deadline" which seems more advisable for Pg, nor 
for hardware RAID, which is a common setting.

Also, Theodore Ts'o suggested to use "sync_file_range" to check whether 
the writes have reached the disk, and possibly to delay the actual 
fsync/checkpoint conclusion if not... I have not tried that, the 
implementation is not as trivial, and I'm not sure what to do when the 
completion target is coming, but possibly that could be an interesting 
option to investigate. Preliminary tests by adding a sleep between the 
writes and the final fsync did not yield very good results.

I've also played with numerous other options (changing checkpointer 
throttling parameters, reducing checkpoint timeout to 1 second, playing 
around with various kernel settings), but that did not seem to be very 
effective for the problem at hand.


I also attached a test script I used, that can be adapted if someone wants 
to collect some performance data. I also have some basic scripts to 
extract and compute stats, ask if needed.

-- 
Fabien.

pgsql-hackers by date:

Previous
From: Albe Laurenz
Date:
Subject: Re: [NOVICE] psql readline Tab insert tab
Next
From: Tom Lane
Date:
Subject: Re: [CORE] postpone next week's release