Home > mailing lists

Can PostgreSQL create new WAL files instead of reusing old ones? - Mailing list pgsql-general

From	David Pacheco
Subject	Can PostgreSQL create new WAL files instead of reusing old ones?
Date	April 17, 2018 21:17:28
Msg-id	CACukRjO7DJvub8e2AijOayj8BfKK3XXBTwu3KKARiTr67M3E3w@mail.gmail.com Whole thread Raw
Responses	Re: Can PostgreSQL create new WAL files instead of reusing old ones?
List	pgsql-general

tl;dr: We've found that under many conditions, PostgreSQL's re-use of old WAL

files appears to significantly degrade query latency on ZFS. The reason is

complicated and I have details below. Has it been considered to make this

behavior tunable, to cause PostgreSQL to always create new WAL files instead of

re-using old ones?

Context: we're running a large fleet of PostgreSQL shards. Each shard consists

of a primary, a synchronous standby, and an asynchronous standby using chained

replication. For this problem, we can consider only the primary and

synchronous standby.

PostgreSQL: 9.6.3

OS: illumos (SmartOS, mixed versions, but all from 2017 or later)

FS: ZFS over mirrored HDDs (not SSDs), using a record size of 8K to match

PostgreSQL's record size. We have an SSD log device for completing synchronous

writes very quickly.

WAL files are 16MB each, and we're keeping a very large number of segments.

(There's likely a lot of improvement for WAL tuning here.)

Since we're using an 8K record size, when PostgreSQL writes small (or

non-aligned) records to the WAL files, ZFS has to read the old contents in

order to write the new 8K record (read-modify-write). If that's in cache,

that's not a big deal. But when PostgreSQL decides to reuse an old WAL file

whose contents have been evicted from the cache (because they haven't been used

in hours), this turns what should be a workload bottlenecked by synchronous write

performance (that's well-optimized with our SSD log device) into a random read

workload (that's much more expensive for any system).

What's significantly worse is that we saw this on synchronous standbys. When

that happened, the WAL receiver was blocked on a random read from disk, and

since it's single-threaded, all write queries on the primary stop until the

random read finishes. This is particularly bad for us when the sync is doing

other I/O (e.g., for an autovacuum or a database backup) that causes disk reads

to take hundreds of milliseconds.

Reusing old WAL files seems like an optimization intended for filesystems that

allocate disk blocks up front. With copy-on-write, it doesn't seem to make

much sense. If instead of using an old WAL file, PostgreSQL instead just

created a new one, there would be no random reads required to complete these

operations, and we believe we'd completely eliminate our latency outliers.

Thanks,

Dave

From: Adrian Klaver
Date: 17 April 2018, 17:26:48
Subject: Re: Postgresql with JDK

From: Alvaro Herrera
Date: 17 April 2018, 21:20:13
Subject: Re: Can PostgreSQL create new WAL files instead of reusing old ones?