Home > mailing lists

Thread: Can PostgreSQL create new WAL files instead of reusing old ones?

Can PostgreSQL create new WAL files instead of reusing old ones?

From

David Pacheco

Date:

17 April 2018, 21:17:28

tl;dr: We've found that under many conditions, PostgreSQL's re-use of old WAL

files appears to significantly degrade query latency on ZFS. The reason is

complicated and I have details below. Has it been considered to make this

behavior tunable, to cause PostgreSQL to always create new WAL files instead of

re-using old ones?

Context: we're running a large fleet of PostgreSQL shards. Each shard consists

of a primary, a synchronous standby, and an asynchronous standby using chained

replication. For this problem, we can consider only the primary and

synchronous standby.

PostgreSQL: 9.6.3

OS: illumos (SmartOS, mixed versions, but all from 2017 or later)

FS: ZFS over mirrored HDDs (not SSDs), using a record size of 8K to match

PostgreSQL's record size. We have an SSD log device for completing synchronous

writes very quickly.

WAL files are 16MB each, and we're keeping a very large number of segments.

(There's likely a lot of improvement for WAL tuning here.)

Since we're using an 8K record size, when PostgreSQL writes small (or

non-aligned) records to the WAL files, ZFS has to read the old contents in

order to write the new 8K record (read-modify-write). If that's in cache,

that's not a big deal. But when PostgreSQL decides to reuse an old WAL file

whose contents have been evicted from the cache (because they haven't been used

in hours), this turns what should be a workload bottlenecked by synchronous write

performance (that's well-optimized with our SSD log device) into a random read

workload (that's much more expensive for any system).

What's significantly worse is that we saw this on synchronous standbys. When

that happened, the WAL receiver was blocked on a random read from disk, and

since it's single-threaded, all write queries on the primary stop until the

random read finishes. This is particularly bad for us when the sync is doing

other I/O (e.g., for an autovacuum or a database backup) that causes disk reads

to take hundreds of milliseconds.

Reusing old WAL files seems like an optimization intended for filesystems that

allocate disk blocks up front. With copy-on-write, it doesn't seem to make

much sense. If instead of using an old WAL file, PostgreSQL instead just

created a new one, there would be no random reads required to complete these

operations, and we believe we'd completely eliminate our latency outliers.

Thanks,

Dave

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

From

Alvaro Herrera

Date:

17 April 2018, 21:20:13

David Pacheco wrote:
> tl;dr: We've found that under many conditions, PostgreSQL's re-use of old
> WAL
> files appears to significantly degrade query latency on ZFS.  The reason is
> complicated and I have details below.  Has it been considered to make this
> behavior tunable, to cause PostgreSQL to always create new WAL files
> instead of re-using old ones?

I don't think this has ever been proposed, because there was no use case
for it.  Maybe you want to work on a patch for it?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

From

Tom Lane

Date:

17 April 2018, 21:57:03

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> David Pacheco wrote:
>> tl;dr: We've found that under many conditions, PostgreSQL's re-use of old
>> WAL
>> files appears to significantly degrade query latency on ZFS.  The reason is
>> complicated and I have details below.  Has it been considered to make this
>> behavior tunable, to cause PostgreSQL to always create new WAL files
>> instead of re-using old ones?

> I don't think this has ever been proposed, because there was no use case
> for it.  Maybe you want to work on a patch for it?

I think possibly the OP doesn't understand why it's designed that way.
The point is not really to "recycle old WAL files", it's to avoid having
disk space allocation occur during the critical section where we must
PANIC on failure.  Now, of course, that doesn't really work if the
filesystem is COW underneath, because it's allocating fresh disk space
anyway even though semantically we're overwriting existing data.
But what I'd like to see is a fix that deals with that somehow, rather
than continue to accept the possibility of ENOSPC occurring inside WAL
writes on these file systems.  I have no idea what such a fix would
look like :-(

            regards, tom lane

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

From

David Pacheco

Date:

17 April 2018, 23:05:58

On Tue, Apr 17, 2018 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> David Pacheco wrote:
>> tl;dr: We've found that under many conditions, PostgreSQL's re-use of old
>> WAL
>> files appears to significantly degrade query latency on ZFS. The reason is
>> complicated and I have details below. Has it been considered to make this
>> behavior tunable, to cause PostgreSQL to always create new WAL files
>> instead of re-using old ones?

> I don't think this has ever been proposed, because there was no use case
> for it. Maybe you want to work on a patch for it?

I think possibly the OP doesn't understand why it's designed that way.
The point is not really to "recycle old WAL files", it's to avoid having
disk space allocation occur during the critical section where we must
PANIC on failure. Now, of course, that doesn't really work if the
filesystem is COW underneath, because it's allocating fresh disk space
anyway even though semantically we're overwriting existing data.
But what I'd like to see is a fix that deals with that somehow, rather
than continue to accept the possibility of ENOSPC occurring inside WAL
writes on these file systems. I have no idea what such a fix would
look like :-(

I think I do understand, but as you've observed, recycling WAL files to avoid allocation relies on the implementation details of the filesystem -- details that I'd expect not to be true of any copy-on-write filesystem. On such systems, there may not be a way to avoid ENOSPC in special critical sections. (And that's not necessarily such a big deal -- to paraphrase a colleague, ensuring that the system doesn't run out of space does not seem like a particularly surprising or heavy burden for the operator. It's great that PostgreSQL can survive this event better on some systems, but the associated tradeoffs may not be worthwhile for everybody.) And given that, it seems worthwhile to provide the operator an option where they take on the risk that the database might crash if it runs out of space (assuming the result isn't data corruption) in exchange for a potentially tremendous improvement in tail latency and overall throughput.

To quantify this: in a recent incident, transaction latency on the primary was degraded about 2-3x (from a p90 of about 45ms to upwards of 120ms, with outliers exceeding 1s). Over 95% of the outliers above 1s spent over 90% of their time blocked on synchronous replication (based on tracing with DTrace). On the synchronous standby, almost 10% of the WAL receiver's wall clock time was spent blocked on disk reads in this read-modify-write path. The rest of the time was essentially idle -- there was plenty of headroom in other dimensions (CPU, synchronous write performance).

Thanks,

Dave

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

From

Michael Paquier

Date:

18 April 2018, 07:35:43

On Tue, Apr 17, 2018 at 02:57:03PM -0400, Tom Lane wrote:
> I think possibly the OP doesn't understand why it's designed that way.
> The point is not really to "recycle old WAL files", it's to avoid having
> disk space allocation occur during the critical section where we must
> PANIC on failure.  Now, of course, that doesn't really work if the
> filesystem is COW underneath, because it's allocating fresh disk space
> anyway even though semantically we're overwriting existing data.
> But what I'd like to see is a fix that deals with that somehow, rather
> than continue to accept the possibility of ENOSPC occurring inside WAL
> writes on these file systems.  I have no idea what such a fix would
> look like :-(

That looks like a rather difficult problem to solve in PostgreSQL
itself, as the operator running the cluster is in charge of setting up
the FS options which would control the COW behavior, so it seems to me
that there is room as well for an in-core option to tell the
checkpointer to enforce the removal of past files instead of simple
recycling them, because this actually breaks max_wal_size.

max_wal_size is of course a soft limit, and it has been discussed a
couple of times that it would be nice to get that to a hard limit, but
it is really a hard problem to avoid the system to not slow down or even
stop its I/O if the hard is close by or reached..
--
Michael

Attachment

signature.asc

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

From

Vick Khera

Date:

18 April 2018, 16:34:50

On Wed, Apr 18, 2018 at 12:35 AM, Michael Paquier <michael@paquier.xyz> wrote:

That looks like a rather difficult problem to solve in PostgreSQL
itself, as the operator running the cluster is in charge of setting up
the FS options which would control the COW behavior, so it seems to me

You cannot turn off CoW on ZFS. What other behavior would you refer to here?

I suppose one could make a dedicated data set for the WAL and have ZFS make a reservation for about 2x the total expected WAL size. It would require careful attention to detail if you increase WAL segments configuration, though, and if you had any kind of hiccup with streaming replication that caused the segments to stick around longer than expected (but that's no different from any other file system).

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

From

Bruce Momjian

Date:

27 April 2018, 19:28:25

On Wed, Apr 18, 2018 at 09:34:50AM -0400, Vick Khera wrote:
> On Wed, Apr 18, 2018 at 12:35 AM, Michael Paquier <michael@paquier.xyz> wrote:
> 
> 
>     That looks like a rather difficult problem to solve in PostgreSQL
>     itself, as the operator running the cluster is in charge of setting up
>     the FS options which would control the COW behavior, so it seems to me
> 
> 
> You cannot turn off CoW on ZFS. What other behavior would you refer to here?
> 
> I suppose one could make a dedicated data set for the WAL and have ZFS make a
> reservation for about 2x the total expected WAL size. It would require careful
> attention to detail if you increase WAL segments configuration, though, and if
> you had any kind of hiccup with streaming replication that caused the segments
> to stick around longer than expected (but that's no different from any other
> file system).

Uh, at the risk of asking an obvious question, why is the WAL file COW
if it was renamed?  No one has the old WAL file open, as far as I know.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

From

Fabio Ugo Venchiarutti

Date:

27 April 2018, 19:54:17

I was wondering the same. The WAL writer is overwriting the same inodes
again and again, so block COW logic should only happen once: at allocation.

I'm no expert: does XFS track COW based on path (ugh?)

Maybe I'm crazy but here's a possible workaround if the problem is
effectively at that level: OP could use the archive_command to
deliberately allocate a new segment and switch the old one with it
before returning zero to the archiver.

The WAL writer will then recycle what it thinks is the same inode and
not your impostor.

I'm rather confident this should work ok but you may want to make sure
with the hackers that no file descriptors are open on a ready-to-archive
segments while you shuffle things around in your command (or some other
weird implication I'm missing).

On 27/04/18 17:28, Bruce Momjian wrote:
> On Wed, Apr 18, 2018 at 09:34:50AM -0400, Vick Khera wrote:
>> On Wed, Apr 18, 2018 at 12:35 AM, Michael Paquier <michael@paquier.xyz> wrote:
>>
>>
>>      That looks like a rather difficult problem to solve in PostgreSQL
>>      itself, as the operator running the cluster is in charge of setting up
>>      the FS options which would control the COW behavior, so it seems to me
>>
>>
>> You cannot turn off CoW on ZFS. What other behavior would you refer to here?
>>
>> I suppose one could make a dedicated data set for the WAL and have ZFS make a
>> reservation for about 2x the total expected WAL size. It would require careful
>> attention to detail if you increase WAL segments configuration, though, and if
>> you had any kind of hiccup with streaming replication that caused the segments
>> to stick around longer than expected (but that's no different from any other
>> file system).
>
> Uh, at the risk of asking an obvious question, why is the WAL file COW
> if it was renamed?  No one has the old WAL file open, as far as I know.
>

--
Regards

Fabio Ugo Venchiarutti
Data Services Department
Ocado Technology

--

Notice:  This email is confidential and may contain copyright material of
members of the Ocado Group. Opinions and views expressed in this message
may not necessarily reflect the opinions and views of the members of the
Ocado Group. 

If you are not the intended recipient, please notify us
immediately and delete all copies of this message. Please note that it is
your responsibility to scan this message for viruses. 

Fetch and Sizzle
are trading names of Speciality Stores Limited and Fabled is a trading name
of Marie Claire Beauty Limited, both members of the Ocado Group.

References to the “Ocado Group” are to Ocado Group plc (registered in
England and Wales with number 7098618) and its subsidiary undertakings (as
that expression is defined in the Companies Act 2006) from time to time.  
The registered office of Ocado Group plc is Buildings One & Two, Trident
Place, Mosquito Way, Hatfield, Hertfordshire, AL10 9UL.

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

From

Andres Freund

Date:

27 April 2018, 20:33:40

On 2018-04-27 12:28:25 -0400, Bruce Momjian wrote:
> On Wed, Apr 18, 2018 at 09:34:50AM -0400, Vick Khera wrote:
> > On Wed, Apr 18, 2018 at 12:35 AM, Michael Paquier <michael@paquier.xyz> wrote:
> > 
> > 
> >     That looks like a rather difficult problem to solve in PostgreSQL
> >     itself, as the operator running the cluster is in charge of setting up
> >     the FS options which would control the COW behavior, so it seems to me
> > 
> > 
> > You cannot turn off CoW on ZFS. What other behavior would you refer to here?
> > 
> > I suppose one could make a dedicated data set for the WAL and have ZFS make a
> > reservation for about 2x the total expected WAL size. It would require careful
> > attention to detail if you increase WAL segments configuration, though, and if
> > you had any kind of hiccup with streaming replication that caused the segments
> > to stick around longer than expected (but that's no different from any other
> > file system).
> 
> Uh, at the risk of asking an obvious question, why is the WAL file COW
> if it was renamed?  No one has the old WAL file open, as far as I know.

Because it's a COW filesystem that doesn't overwrite in place. That's
how snapshots etc are implemented.

Greetings,

Andres Freund

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

From

Jerry Jelinek

Date:

20 June 2018, 20:35:30

As Dave described in his original email on this topic, we'd like to avoid recycling WAL files since that can cause performance issues when we have a read-modify-write on a file that has dropped out of the cache.

I have implemented a small change to allow WAL recycling to be disabled. It is visible at:

https://cr.joyent.us/#/c/4263/

I'd appreciate getting any feedback on this.

Thanks,

Jerry

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

From

David Pacheco

Date:

22 June 2018, 02:22:02

On Wed, Jun 20, 2018 at 10:35 AM, Jerry Jelinek <jerry.jelinek@joyent.com> wrote:

As Dave described in his original email on this topic, we'd like to avoid recycling WAL files since that can cause performance issues when we have a read-modify-write on a file that has dropped out of the cache.

I have implemented a small change to allow WAL recycling to be disabled. It is visible at:
https://cr.joyent.us/#/c/4263/

I'd appreciate getting any feedback on this.

Thanks,
Jerry

For reference, there's more context in this thread from several months ago:

https://www.postgresql.org/message-id/flat/CACukRjO7DJvub8e2AijOayj8BfKK3XXBTwu3KKARiTr67M3E3w%40mail.gmail.com#CACukRjO7DJvub8e2AijOayj8BfKK3XXBTwu3KKARiTr67M3E3w@mail.gmail.com

I'll repeat the relevant summary here:

tl;dr: We've found that under many conditions, PostgreSQL's re-use of old
WAL files appears to significantly degrade query latency on ZFS. The reason is
complicated and I have details below. Has it been considered to make this
behavior tunable, to cause PostgreSQL to always create new WAL files
instead of re-using old ones?

Thanks,

Dave

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

From

Thomas Munro

Date:

22 June 2018, 07:20:38

On Fri, Jun 22, 2018 at 11:22 AM, David Pacheco <dap@joyent.com> wrote:
> On Wed, Jun 20, 2018 at 10:35 AM, Jerry Jelinek <jerry.jelinek@joyent.com>
> wrote:
>> I have implemented a small change to allow WAL recycling to be disabled.
>> It is visible at:
>> https://cr.joyent.us/#/c/4263/
>>
>> I'd appreciate getting any feedback on this.

>> tl;dr: We've found that under many conditions, PostgreSQL's re-use of old
>> WAL files appears to significantly degrade query latency on ZFS.

I haven't tested by it looks reasonable to me.  It needs documentation
in doc/src/sgml/config.sgml.  It should be listed in
src/backend/utils/misc/postgresql.conf.sample.  We'd want a patch
against our master branch.  Could you please register it in
commitfest.postgresql.org so we don't lose track of it?

Hey, a question about PostgreSQL on ZFS: what do you guys think about
pg_flush_data() in fd.c?  It does mmap(), msync(), munmap() to try to
influence writeback?  I wonder if at least on some operating systems
that schlepps a bunch of data out of ZFS ARC into OS page cache, kinda
trashing the latter?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

From

Jerry Jelinek

Date:

22 June 2018, 16:16:31

Thomas,

Thanks for taking a look at this. I'll work on getting a patch together for the master branch. I'll also take a look at the other question you raised and get back to you once I have more information.

Thanks again,

Jerry

On Thu, Jun 21, 2018 at 10:20 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Fri, Jun 22, 2018 at 11:22 AM, David Pacheco <dap@joyent.com> wrote:
> On Wed, Jun 20, 2018 at 10:35 AM, Jerry Jelinek <jerry.jelinek@joyent.com>
> wrote:
>> I have implemented a small change to allow WAL recycling to be disabled.
>> It is visible at:
>> https://cr.joyent.us/#/c/4263/
>>
>> I'd appreciate getting any feedback on this.

>> tl;dr: We've found that under many conditions, PostgreSQL's re-use of old
>> WAL files appears to significantly degrade query latency on ZFS.

I haven't tested by it looks reasonable to me. It needs documentation
in doc/src/sgml/config.sgml. It should be listed in
src/backend/utils/misc/postgresql.conf.sample. We'd want a patch
against our master branch. Could you please register it in
commitfest.postgresql.org so we don't lose track of it?

Hey, a question about PostgreSQL on ZFS: what do you guys think about
pg_flush_data() in fd.c? It does mmap(), msync(), munmap() to try to
influence writeback? I wonder if at least on some operating systems
that schlepps a bunch of data out of ZFS ARC into OS page cache, kinda
trashing the latter?

--
Thomas Munro
http://www.enterprisedb.com

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

From

Vick Khera

Date:

22 June 2018, 18:19:13

On Wed, Jun 20, 2018 at 1:35 PM, Jerry Jelinek <jerry.jelinek@joyent.com> wrote:

As Dave described in his original email on this topic, we'd like to avoid recycling WAL files since that can cause performance issues when we have a read-modify-write on a file that has dropped out of the cache.

I have implemented a small change to allow WAL recycling to be disabled. It is visible at:
https://cr.joyent.us/#/c/4263/

I'd appreciate getting any feedback on this.

This looks so simple, yet so beneficial. Thanks for making it. Is there some other mechanism that already cleans out the old unneeded WAL files? I recall there is something that does it when you start up after changing the number of files to keep, but I don't recall if that is tested over some loop regularly.

Is there some way to make it auto-detect when it should be enabled? If not, please document that it should be used on ZFS and any other file system with CoW properties on files.

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

From

Adam Brusselback

Date:

22 June 2018, 18:41:45

> Is there some way to make it auto-detect when it should be enabled? If not, please document that it should be used on ZFS and any other file system with CoW properties on files.

In addition to this, wondering what type of performance regression this would show on something like ext4 (if any).

Re: Can PostgreSQL create new WAL files instead of reusing old ones?

From

Andres Freund

Date:

22 June 2018, 22:07:59

Hi,

On 2018-06-22 11:41:45 -0400, Adam Brusselback wrote:
> >  Is there some way to make it auto-detect when it should be enabled? If
> not, please document that it should be used on ZFS and any other file
> system with CoW properties on files.

> In addition to this, wondering what type of performance regression this
> would show on something like ext4 (if any).

It's a *massive* regression on ext4 & xfs.  You can very trivially
compare the performance of a new cluster (which doesn't have files to
recycle) against one that's running for a while.

Greetings,

Andres Freund