Thread: PGSQL, checkpoints, and file system syncs

PGSQL, checkpoints, and file system syncs

From
Reza Taheri
Date:

Hello PGSQL performance community,

You might remember that I pinged you in July 2012 to introduce the TPC-V benchmark, and to ask for a feature similar to clustered indexes. I am now back with more data, and a question about checkpoints. As far as the plans for the benchmark, we are hoping to release a benchmarking kit for multi-VM servers this year (and of course one can always simply configure it to run on one database)

 

Anyways, I am facing a situation that is affecting performance when checkpoints end. This becomes a big problem when you have many VMs sharing the same underlying storage, but I have reduced the problem to a single VM/single database here to make it easier to discuss.

 

Complete config info is in the attached files. Briefly, it is a 6-vCPU VM with 91G of memory, and 70GB in PGSQL shared buffers. The host has 512GB of memory and 4 sockets of Westmere (E7-4870) processors with HT enabled.

 

The data tablespace is on an ext4 file system on a (virtual) disk which is striped on 16 SSD drives in RAID 0. This is obviously overkill for the load we are putting on this VM, but in the usual config, the 16 SSDs are shared by 24 VMs. Log is on an ext3 file system on 4 spinning drives in RAID 1.

 

We are running PGSQL version 9.2 on RHEL 6.4, and here are some parameters of interest (postgresql.conf in the attachment):

checkpoint_segments = 1200

checkpoint_timeout = 360s

checkpoint_completion_target = 0.8

wal_sync_method = open_datasync

wal_buffers = 16MB

wal_writer_delay = 10ms

effective_io_concurrency = 10

effective_cache_size = 1024MB

 

When running tests, I noticed that when a checkpoint completes, we have a big burst of writes to the data disk. The log disk has a very steady write rate that is not affected by checkpoints except for the known phenomenon of more bytes in each log write when a new checkpoint period starts. In a multi-VM config with all VMs sharing the same data disks, when these write bursts happen, all VMs take a hit.

 

So I set out to see what causes this write burst.  After playing around with PGSQL parameters and observing its behavior, it appears that the bursts aren’t produced by the database engine; they are produced by the file system. I suspect PGSQL has to issue a sync(2)/fsync(2)/sync_file_range(2) system call at the completion of the checkpoint to ensure that all blocks are flushed to disk before creating a checkpoint marker. To test this, I ran a loop to call sync(8) once a second.

 

The pdf labeled “Chart 280” has the throughput, data disk activity, and checkpoint start/completion timestamps for the baseline case. You can see that the checkpoint completion, the write burst, and the throughput dip all occur at the same time, so much so that it is hard to see the checkpoint completion line under the graph of writes. It looks like the file system does a mini flush every 30 seconds. The pdf labeled “Chart 274” is the case with sync commands running in the background. You can see that everything is more smooth.

 

Is there something I can set in the PGSQL parameters or in the file system parameters to force a steady flow of writes to disk rather than waiting for a sync system call? Mounting with “commit=1” did not make a difference.

 

Thanks,

Reza

 

Attachment

Re: PGSQL, checkpoints, and file system syncs

From
Shaun Thomas
Date:
> Is there something I can set in the PGSQL parameters or in the file system
> parameters to force a steady flow of writes to disk rather than waiting for
> a sync system call? Mounting with "commit=1" did not make a difference.

The PostgreSQL devs actually had a long talk with the Linux kernel devs over this exact issue, actually. While we wait
forthe results of that to bear some fruit, I'd recommend using the dirty_background_bytes and dirty_bytes settings both
onthe VM side, and on the host server. To avoid excessive flushes, you want to avoid having more dirty memory than the
systemcan handle in one gulp. 

The dirty_bytes setting will begin flushing disks synchronously when the amount of dirty memory reaches this amount.
Whiledirty_background_bytes will flush in the background when the amount of dirty memory hits the specified limit. It's
thebackground flushing that will prevent your current problems, and it should be set at the same level as the amount of
writecache your system has available. 

So if you are on a 1GB RAID card, set it to 1GB. Once you have 1GB of dirty memory (from a checkpoint or whatever),
Linuxwill begin flushing. 

This is a pretty well-known issue on Linux systems with large amounts of RAM. Most VM servers fit that profile, so I'm
notsurprised it's hurting you. 

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd | Suite 400 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: PGSQL, checkpoints, and file system syncs

From
Bruce Momjian
Date:
On Tue, Apr 8, 2014 at 02:00:05PM +0000, Shaun Thomas wrote:
> So if you are on a 1GB RAID card, set it to 1GB. Once you have 1GB
> of dirty memory (from a checkpoint or whatever), Linux will begin
> flushing.
>
> This is a pretty well-known issue on Linux systems with large amounts
> of RAM. Most VM servers fit that profile, so I'm not surprised it's
> hurting you.

Agreed.  The dirty kernel defaults Linux defaults are too high for
systems with large amounts of memory.  See sysclt -a | grep dirty for a
list.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +