Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing
Date
Msg-id alpine.DEB.2.10.1506170803210.9794@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Fabien COELHO <coelho@cri.ensmp.fr>)
Responses Re: checkpointer continuous flushing
List pgsql-hackers

Hello,

Here is version 3, including many performance tests with various settings, 
representing about 100 hours of pgbench run. This patch aims at improving 
checkpoint I/O behavior so that tps throughput is improved, late 
transactions are less frequent, and overall performances are more stable.


* SOLILOQUIZING

> - The bgwriter flushing option seems ineffective, it could be removed
>  from the patch?

I did that.

> - Move fsync as early as possible, suggested by Andres Freund?
>
> My opinion is that this should be left out for the nonce.

I did that.

> - Take into account tablespaces, as pointed out by Andres Freund?
>
> Alternatively, maybe it can be done from one thread, but it would probably 
> involve some strange hocus-pocus to switch frequently between tablespaces.

I did the hocus-pocus approach, including a quasi-proof (not sure what is 
this mathematical object:-) in comments to show how/why it works.


* PATCH CONTENTS
 - as version 1: simplified asynchronous flush based on Andres Freund   patch, with sync_file_range/posix_fadvise used
tohint the OS that   the buffer must be sent to disk "now".
 
 - as version 2: checkpoint buffer sorting based on a 2007 patch by   Takahiro Itagaki but with a smaller and static
bufferallocated once.   Also, sorting is done by chunks of 131072 pages in the current version,   with a guc to change
thisvalue.
 
 - as version 2: sync/advise calls are now merged if possible,   so less calls will be used, especially when buffers
aresorted,   but also if there are few files written.
 
 - new: the checkpointer balance its page writes per tablespace.   this is done by choosing to write pages for a
tablespacefor which   the progress ratio (written/to_write) is beyond the overall progress   ratio for all tablespace,
andby doing that in a round robin manner   so that all tablespaces regularly get some attention. No threads.
 
 - new: some more documentation is added.
 - removed: "bgwriter_flush_to_write" is removed, as there was no clear   benefit on the (simple) tests. It could be
consideredfor another patch.
 
 - question: I'm not sure I understand the checkpointer memory management.   There is some exception handling in the
checkpointermain. I wonder   whether the allocated memory would be lost in such event and should   be reallocated.  The
patchcurrently assumes that the memory is kept.
 


* PERFORMANCE TESTS

Impacts on "pgbench -M prepared -N -P 1 ..." (simple update test, mostly
random write activity on one table), checkpoint_completion_target=0.8, with
different settings on a 16GB 8-core host:
 . tiny: scale=10 shared_buffers=1GB checkpoint_timeout=30s time=6400s . small: scale=120 shared_buffers=2GB
checkpoint_timeout=300stime=4000s . medium: scale=250 shared_buffers=4GB checkpoint_timeout=15min time=4000s . large:
scale=1000shared_buffers=4GB checkpoint_timeout=40min time=7500s
 

Note: figures noted with a star (*) had various issues during their run, so
pgbench progress figures were more or less incorrect, thus the standard
deviation computation is not to be trusted beyond "pretty bad".

Caveat: these are only benches on one host at a particular time and
location, which may or may not be reproducible nor be representative
as such of any other load.  The good news is that all these tests tell
the same thing.

- full-speed 1-client
     options   | tps performance over per second data  flush | sort |    tiny    |    small     |   medium     |
large   off |  off | 687 +- 231 | 163 +- 280 * | 191 +- 626 * | 37.7 +- 25.6    off |   on | 699 +- 223 | 457 +- 315
|479 +- 319   | 48.4 +- 28.8     on |  off | 740 +- 125 | 143 +- 387 * | 179 +- 501 * | 37.3 +- 13.3     on |   on |
722+- 119 | 550 +- 140   | 549 +- 180   | 47.2 +- 16.8
 

- full speed 4-clients
      options  | tps performance over per second data  flush | sort |    tiny     |     small     |    medium    off |
off| 2006 +- 748 | 193 +- 1898 * | 205 +- 2465 *    off |   on | 2086 +- 673 | 819 +-  905 * | 807 +- 1029 *     on |
off| 2212 +- 451 | 169 +- 1269 * | 160 +-  502 *     on |   on | 2073 +- 437 | 743 +-  413   | 822 +-  467
 

- 100-tps 1-client max 100-ms latency
     options   | percent of late transactions  flush | sort |  tiny | small | medium    off |  off |  6.31 | 29.44 |
30.74   off |   on |  6.23 |  8.93 |  7.12     on |  off |  0.44 |  7.01 |  8.14     on |   on |  0.59 |  0.83 |  1.84
 

- 200-tps 1-client max 100-ms latency
     options   | percent of late transactions  flush | sort |  tiny | small | medium    off |  off | 10.00 | 50.61 |
45.51   off |   on |  8.82 | 12.75 | 12.89     on |  off |  0.59 | 40.48 | 42.64     on |   on |  0.53 |  1.76 |  2.59
 

- 400-tps 1-client (or 4 for medium) max 100-ms latency
     options   | percent of late transactions  flush | sort | tiny | small | medium    off |  off | 12.0 | 64.28 | 68.6
  off |   on | 11.3 | 22.05 | 22.6     on |  off |  1.1 | 67.93 | 67.9     on |   on |  0.6 |  3.24 |  3.1
 


* CONCLUSION :

For most of these HDD tests, when both options are activated the tps 
throughput is improved (+3 to +300%), late transactions are reduced (by 
91% to 97%) and overall the performance is more stable (tps standard 
deviation is typically halved).

The option effects are somehow orthogonal:
 - latency is essentially limited by flushing, although sorting also   contributes.
 - throughput is mostly improved thanks to sorting, with some occasional   small positive or negative effect from
flushing.

In detail, some loads may benefit more from only one option activated. In 
particular, flushing may have a small adverse effect on throughput in some 
conditions, although not always. With SSD probably both options would 
probably have limited benefit.

-- 
Fabien.

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: pg_rewind and xlogtemp files
Next
From: Michael Paquier
Date:
Subject: Re: pg_rewind and xlogtemp files