Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Fabien COELHO
Subject Re: checkpointer continuous flushing
Date
Msg-id alpine.DEB.2.10.1506071638490.11135@sto
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Responses Re: checkpointer continuous flushing
Re: checkpointer continuous flushing
List pgsql-hackers
Hello Andres,

> They pretty much can't if you flush things frequently. That's why I
> think this won't be acceptable without the sorting in the checkpointer.


* VERSION 2 "WORK IN PROGRESS".

The implementation is more a proof-of-concept for having feedback than
clean code. What it does:
 - as version 1 : simplified asynchronous flush based on Andres Freund   patch, with sync_file_range/posix_fadvise used
tohint the OS that   the buffer must be sent to disk "now".
 
 - added: checkpoint buffer sorting based on a 2007 patch by Takahiro Itagaki   but with a smaller and static buffer
allocatedonce. Also,   sorting is done by chunks in the current version.
 
 - also added: sync/advise calls are now merged if possible,   so less calls are used, especially when buffers are
sorted,  but also if there are few files.
 


* PERFORMANCE TESTS

Impacts on "pgbench -M prepared -N -P 1" scale 10  (simple update pgbench
with a mostly-write activity),  with checkpoint_completion_target=0.8
and shared_buffers=1GB.

Contrary to v1, I have not tested bgwriter flushing as the impact
on the first round was close to nought. This does not mean that particular
loads may benefit or be harmed but flushing from bgwriter.

- 100 tps throttled max 100 ms latency over 6400 seconds  with checkpoint_timeout=30s
  flush | sort | late transactions    off |  off | 6.0 %    off |   on | 6.1 %     on |  off | 0.4 %     on |   on |
0.4% (93% improvement)
 

- 100 tps throttled max 100 ms latency over 4000 seconds  with checkpoint_timeout=10mn
  flush | sort | late transactions    off |  off | 1.5 %    off |   on | 0.6 % (?!)     on |  off | 0.8 %     on |   on
|0.6 % (60% improvement)
 

- 150 tps throttled max 100 ms latency over 19600 seconds (5.5 hours)  with checkpoint_timeout=30s
  flush | sort | late transactions    off |  off | 8.5 %    off |   on | 8.1 %     on |  off | 0.5 %     on |   on |
0.4% (95% improvement)
 

- full speed bgbench over 6400 seconds with checkpoint_timeout=30s
  flush | sort | tps performance over per second data    off |  off | 676 +- 230    off |   on | 683 +- 213     on |
off| 712 +- 130     on |   on | 725 +- 116 (7.2% avg/50% stddev improvements)
 

- full speed bgbench over 4000 seconds with checkpoint_timeout=10mn
  flush | sort | tps performance over per second data    off |  off | 885 +- 188    off |   on | 940 +- 120 (6%/36%!)
 on |  off | 778 +- 245 (hmmm... not very consistent?)     on |   on | 927 +- 108 (4.5% avg/43% sttdev improvements)
 

- full speed bgbench "-j2 -c4" over 6400 seconds with checkpoint_timeout=30s
  flush | sort | tps performance over per second data    off |  off | 2012 +- 747    off |   on | 2086 +- 708     on |
off| 2099 +- 459     on |   on | 2114 +- 422 (5% avg/44% stddev improvements)
 


* CONCLUSION :

For all these HDD tests, when both options are activated the tps performance
is improved, the latency is reduced and the performance is more stable
(smaller standard deviation).

Overall the option effects, not surprisingly, are quite (with exceptions) 
orthogonal: - latency is essentially improved (60 to 95% reduction) by flushing - throughput is improved (4 to 7%
better)thanks to sorting
 

In detail, some loads may benefit more from only one option activated.
Also on SSD probably both options would have limited benefit.

Usual caveat: these are only benches on one host at a particular time and
location, which may or may not be reproducible nor be representative
as such of any other load.  The good news is that all these tests tell
the same thing.


* LOOK FOR THOUGHTS

- The bgwriter flushing option seems ineffective, it could be removed  from the patch?

- Move fsync as early as possible, suggested by Andres Freund?

In these tests, when the flush option is activated, the fsync duration
at the end of the checkpoint is small: on more than 5525 checkpoint
fsyncs, 0.5% are above 1 second when flush is on, but the figure raises
to 24% when it is off.... This suggest that doing the fsync as soon as
possible would probably have no significant effect on these tests.

My opinion is that this should be left out for the nonce.


- Take into account tablespaces, as pointed out by Andres Freund?

The issue is that if writes are sorted, they are not be distributed 
randomly over tablespaces, inducing lower performance on such systems.

How to do it: while scanning shared_buffers, count dirty buffers for each
tablespace. Then start as many threads as table spaces, each one doing
its own independent throttling for a tablespace? For some obscure reason 
there are 2 tablespaces by default (pg_global and  pg_default), that would 
mean at least 2 threads.

Alternatively, maybe it can be done from one thread, but it would probably 
involve some strange hocus-pocus to switch frequently between tablespaces.

-- 
Fabien.

pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: [CORE] Restore-reliability mode
Next
From: Andreas Karlsson
Date:
Subject: Re: PoC: Partial sort