Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From Andres Freund
Subject Re: checkpointer continuous flushing
Date
Msg-id 20160111134516.imdpaeynxpfggdvx@alap3.anarazel.de
Whole thread Raw
In response to Re: checkpointer continuous flushing  (Fabien COELHO <coelho@cri.ensmp.fr>)
Responses Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Re: checkpointer continuous flushing  (Fabien COELHO <coelho@cri.ensmp.fr>)
Re: checkpointer continuous flushing  (Andres Freund <andres@anarazel.de>)
Re: checkpointer continuous flushing  (Fabien COELHO <coelho@cri.ensmp.fr>)
List pgsql-hackers
On 2016-01-09 16:49:56 +0100, Fabien COELHO wrote:
> 
> Hello Andres,
> 
> >Hm. New theory: The current flush interface does the flushing inside
> >FlushBuffer()->smgrwrite()->mdwrite()->FileWrite()->FlushContextSchedule(). The
> >problem with that is that at that point we (need to) hold a content lock
> >on the buffer!
> 
> You are worrying that FlushBuffer is holding a lock on a buffer and the
> "sync_file_range" call occurs is issued at that moment.
> 
> Although I agree that it is not that good, I would be surprise if that was
> the explanation for a performance regression, because the sync_file_range
> with the chosen parameters is an async call, it "advises" the OS to send the
> file, but it does not wait for it to be completed.

I frequently see sync_file_range blocking - it waits till it could
submit the writes into the io queues. On a system bottlenecked on IO
that's not always possible immediately.

> Also, maybe you could answer a question I had about the performance
> regression you observed, I could not find the post where you gave the
> detailed information about it, so that I could try reproducing it: what are
> the exact settings and conditions (shared_buffers, pgbench scaling, host
> memory, ...), what is the observed regression (tps? other?), and what is the
> responsiveness of the database under the regression (eg % of seconds with 0
> tps for instance, or something like that).

I measured it in a different number of cases, both on SSDs and spinning
rust. I just reproduced it with:

postgres-ckpt14 \       -D /srv/temp/pgdev-dev-800/ \       -c maintenance_work_mem=2GB \       -c fsync=on \       -c
synchronous_commit=off\       -c shared_buffers=2GB \       -c wal_level=hot_standby \       -c max_wal_senders=10 \
  -c max_wal_size=100GB \       -c checkpoint_timeout=30s
 

Using a fresh cluster each time (copied from a "template" to save time)
and using
pgbench -M prepared -c 16 -j16 -T 300 -P 1
I get

My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram:
master:
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 1155733
latency average: 4.151 ms
latency stddev: 8.712 ms
tps = 3851.242965 (including connections establishing)
tps = 3851.725856 (excluding connections establishing)

ckpt-14 (flushing by backends disabled):
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 855156
latency average: 5.612 ms
latency stddev: 7.896 ms
tps = 2849.876327 (including connections establishing)
tps = 2849.912015 (excluding connections establishing)

My laptop 1 850 PRO, 1 i7-4800MQ, 16GB ram:
master:
transaction type: TPC-B (sort of)
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 2104781
latency average: 2.280 ms
latency stddev: 9.868 ms
tps = 7010.397938 (including connections establishing)
tps = 7010.475848 (excluding connections establishing)

ckpt-14 (flushing by backends disabled):
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 1930716
latency average: 2.484 ms
latency stddev: 7.303 ms
tps = 6434.785605 (including connections establishing)
tps = 6435.177773 (excluding connections establishing)

In neither case there are periods of 0 tps, but both have times of <
1000 tps with noticeably increased latency.


The endresults are similar with a sane checkpoint timeout - the tests
just take much longer to give meaningful results. Constantly running
long tests on prosumer level SSDs isn't nice - I've now killed 5 SSDs
with postgres testing...

As you can see there's roughly a 30% performance regression on the
slower SSD and a ~9% on the faster one. HDD results are similar (but I
can't repeat on the laptop right now since the 2nd hdd is now an SSD).


My working copy of checkpoint sorting & flushing currently results in:
My laptop 1 EVO 840, 1 i7-4800MQ, 16GB ram:
transaction type: TPC-B (sort of)
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 1136260
latency average: 4.223 ms
latency stddev: 8.298 ms
tps = 3786.696499 (including connections establishing)
tps = 3786.778875 (excluding connections establishing)

My laptop 1 850 PRO, 1 i7-4800MQ, 16GB ram:
transaction type: TPC-B (sort of)
scaling factor: 800
query mode: prepared
number of clients: 16
number of threads: 16
duration: 300 s
number of transactions actually processed: 2050661
latency average: 2.339 ms
latency stddev: 7.708 ms
tps = 6833.593170 (including connections establishing)
tps = 6833.680391 (excluding connections establishing)

My version of the patch currently addresses various points, which need
to be separated and benchmarked separate:
* Different approach to background writer, trying to make backends write less. While that proves to be beneficial in
isolation,on its own that doesn't address the performance regression.
 
* Different flushing API, done outside the lock

So this partially addresses the performance problems, but not yet
completely.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Vitaly Burovoy
Date:
Subject: Re: Need help on pgcrypto
Next
From: Alvaro Herrera
Date:
Subject: Re: 2016-01 Commitfest