Thread: Can PostgreSQL create new WAL files instead of reusing old ones?
tl;dr: We've found that under many conditions, PostgreSQL's re-use of old WAL
files appears to significantly degrade query latency on ZFS. The reason is
complicated and I have details below. Has it been considered to make this
behavior tunable, to cause PostgreSQL to always create new WAL files instead of
re-using old ones?
Context: we're running a large fleet of PostgreSQL shards. Each shard consists
of a primary, a synchronous standby, and an asynchronous standby using chained
replication. For this problem, we can consider only the primary and
synchronous standby.
PostgreSQL: 9.6.3
OS: illumos (SmartOS, mixed versions, but all from 2017 or later)
FS: ZFS over mirrored HDDs (not SSDs), using a record size of 8K to match
PostgreSQL's record size. We have an SSD log device for completing synchronous
writes very quickly.
WAL files are 16MB each, and we're keeping a very large number of segments.
(There's likely a lot of improvement for WAL tuning here.)
Since we're using an 8K record size, when PostgreSQL writes small (or
non-aligned) records to the WAL files, ZFS has to read the old contents in
order to write the new 8K record (read-modify-write). If that's in cache,
that's not a big deal. But when PostgreSQL decides to reuse an old WAL file
whose contents have been evicted from the cache (because they haven't been used
in hours), this turns what should be a workload bottlenecked by synchronous write
performance (that's well-optimized with our SSD log device) into a random read
workload (that's much more expensive for any system).
What's significantly worse is that we saw this on synchronous standbys. When
that happened, the WAL receiver was blocked on a random read from disk, and
since it's single-threaded, all write queries on the primary stop until the
random read finishes. This is particularly bad for us when the sync is doing
other I/O (e.g., for an autovacuum or a database backup) that causes disk reads
to take hundreds of milliseconds.
Reusing old WAL files seems like an optimization intended for filesystems that
allocate disk blocks up front. With copy-on-write, it doesn't seem to make
much sense. If instead of using an old WAL file, PostgreSQL instead just
created a new one, there would be no random reads required to complete these
operations, and we believe we'd completely eliminate our latency outliers.
Thanks,
Dave
David Pacheco wrote: > tl;dr: We've found that under many conditions, PostgreSQL's re-use of old > WAL > files appears to significantly degrade query latency on ZFS. The reason is > complicated and I have details below. Has it been considered to make this > behavior tunable, to cause PostgreSQL to always create new WAL files > instead of re-using old ones? I don't think this has ever been proposed, because there was no use case for it. Maybe you want to work on a patch for it? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > David Pacheco wrote: >> tl;dr: We've found that under many conditions, PostgreSQL's re-use of old >> WAL >> files appears to significantly degrade query latency on ZFS. The reason is >> complicated and I have details below. Has it been considered to make this >> behavior tunable, to cause PostgreSQL to always create new WAL files >> instead of re-using old ones? > I don't think this has ever been proposed, because there was no use case > for it. Maybe you want to work on a patch for it? I think possibly the OP doesn't understand why it's designed that way. The point is not really to "recycle old WAL files", it's to avoid having disk space allocation occur during the critical section where we must PANIC on failure. Now, of course, that doesn't really work if the filesystem is COW underneath, because it's allocating fresh disk space anyway even though semantically we're overwriting existing data. But what I'd like to see is a fix that deals with that somehow, rather than continue to accept the possibility of ENOSPC occurring inside WAL writes on these file systems. I have no idea what such a fix would look like :-( regards, tom lane
On Tue, Apr 17, 2018 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> David Pacheco wrote:
>> tl;dr: We've found that under many conditions, PostgreSQL's re-use of old
>> WAL
>> files appears to significantly degrade query latency on ZFS. The reason is
>> complicated and I have details below. Has it been considered to make this
>> behavior tunable, to cause PostgreSQL to always create new WAL files
>> instead of re-using old ones?
> I don't think this has ever been proposed, because there was no use case
> for it. Maybe you want to work on a patch for it?
I think possibly the OP doesn't understand why it's designed that way.
The point is not really to "recycle old WAL files", it's to avoid having
disk space allocation occur during the critical section where we must
PANIC on failure. Now, of course, that doesn't really work if the
filesystem is COW underneath, because it's allocating fresh disk space
anyway even though semantically we're overwriting existing data.
But what I'd like to see is a fix that deals with that somehow, rather
than continue to accept the possibility of ENOSPC occurring inside WAL
writes on these file systems. I have no idea what such a fix would
look like :-(
I think I do understand, but as you've observed, recycling WAL files to avoid allocation relies on the implementation details of the filesystem -- details that I'd expect not to be true of any copy-on-write filesystem. On such systems, there may not be a way to avoid ENOSPC in special critical sections. (And that's not necessarily such a big deal -- to paraphrase a colleague, ensuring that the system doesn't run out of space does not seem like a particularly surprising or heavy burden for the operator. It's great that PostgreSQL can survive this event better on some systems, but the associated tradeoffs may not be worthwhile for everybody.) And given that, it seems worthwhile to provide the operator an option where they take on the risk that the database might crash if it runs out of space (assuming the result isn't data corruption) in exchange for a potentially tremendous improvement in tail latency and overall throughput.
To quantify this: in a recent incident, transaction latency on the primary was degraded about 2-3x (from a p90 of about 45ms to upwards of 120ms, with outliers exceeding 1s). Over 95% of the outliers above 1s spent over 90% of their time blocked on synchronous replication (based on tracing with DTrace). On the synchronous standby, almost 10% of the WAL receiver's wall clock time was spent blocked on disk reads in this read-modify-write path. The rest of the time was essentially idle -- there was plenty of headroom in other dimensions (CPU, synchronous write performance).
Thanks,
Dave
On Tue, Apr 17, 2018 at 02:57:03PM -0400, Tom Lane wrote: > I think possibly the OP doesn't understand why it's designed that way. > The point is not really to "recycle old WAL files", it's to avoid having > disk space allocation occur during the critical section where we must > PANIC on failure. Now, of course, that doesn't really work if the > filesystem is COW underneath, because it's allocating fresh disk space > anyway even though semantically we're overwriting existing data. > But what I'd like to see is a fix that deals with that somehow, rather > than continue to accept the possibility of ENOSPC occurring inside WAL > writes on these file systems. I have no idea what such a fix would > look like :-( That looks like a rather difficult problem to solve in PostgreSQL itself, as the operator running the cluster is in charge of setting up the FS options which would control the COW behavior, so it seems to me that there is room as well for an in-core option to tell the checkpointer to enforce the removal of past files instead of simple recycling them, because this actually breaks max_wal_size. max_wal_size is of course a soft limit, and it has been discussed a couple of times that it would be nice to get that to a hard limit, but it is really a hard problem to avoid the system to not slow down or even stop its I/O if the hard is close by or reached.. -- Michael
Attachment
On Wed, Apr 18, 2018 at 12:35 AM, Michael Paquier <michael@paquier.xyz> wrote:
That looks like a rather difficult problem to solve in PostgreSQL
itself, as the operator running the cluster is in charge of setting up
the FS options which would control the COW behavior, so it seems to me
You cannot turn off CoW on ZFS. What other behavior would you refer to here?
I suppose one could make a dedicated data set for the WAL and have ZFS make a reservation for about 2x the total expected WAL size. It would require careful attention to detail if you increase WAL segments configuration, though, and if you had any kind of hiccup with streaming replication that caused the segments to stick around longer than expected (but that's no different from any other file system).
On Wed, Apr 18, 2018 at 09:34:50AM -0400, Vick Khera wrote: > On Wed, Apr 18, 2018 at 12:35 AM, Michael Paquier <michael@paquier.xyz> wrote: > > > That looks like a rather difficult problem to solve in PostgreSQL > itself, as the operator running the cluster is in charge of setting up > the FS options which would control the COW behavior, so it seems to me > > > You cannot turn off CoW on ZFS. What other behavior would you refer to here? > > I suppose one could make a dedicated data set for the WAL and have ZFS make a > reservation for about 2x the total expected WAL size. It would require careful > attention to detail if you increase WAL segments configuration, though, and if > you had any kind of hiccup with streaming replication that caused the segments > to stick around longer than expected (but that's no different from any other > file system). Uh, at the risk of asking an obvious question, why is the WAL file COW if it was renamed? No one has the old WAL file open, as far as I know. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: Can PostgreSQL create new WAL files instead of reusing old ones?
From
Fabio Ugo Venchiarutti
Date:
I was wondering the same. The WAL writer is overwriting the same inodes again and again, so block COW logic should only happen once: at allocation. I'm no expert: does XFS track COW based on path (ugh?) Maybe I'm crazy but here's a possible workaround if the problem is effectively at that level: OP could use the archive_command to deliberately allocate a new segment and switch the old one with it before returning zero to the archiver. The WAL writer will then recycle what it thinks is the same inode and not your impostor. I'm rather confident this should work ok but you may want to make sure with the hackers that no file descriptors are open on a ready-to-archive segments while you shuffle things around in your command (or some other weird implication I'm missing). On 27/04/18 17:28, Bruce Momjian wrote: > On Wed, Apr 18, 2018 at 09:34:50AM -0400, Vick Khera wrote: >> On Wed, Apr 18, 2018 at 12:35 AM, Michael Paquier <michael@paquier.xyz> wrote: >> >> >> That looks like a rather difficult problem to solve in PostgreSQL >> itself, as the operator running the cluster is in charge of setting up >> the FS options which would control the COW behavior, so it seems to me >> >> >> You cannot turn off CoW on ZFS. What other behavior would you refer to here? >> >> I suppose one could make a dedicated data set for the WAL and have ZFS make a >> reservation for about 2x the total expected WAL size. It would require careful >> attention to detail if you increase WAL segments configuration, though, and if >> you had any kind of hiccup with streaming replication that caused the segments >> to stick around longer than expected (but that's no different from any other >> file system). > > Uh, at the risk of asking an obvious question, why is the WAL file COW > if it was renamed? No one has the old WAL file open, as far as I know. > -- Regards Fabio Ugo Venchiarutti Data Services Department Ocado Technology -- Notice: This email is confidential and may contain copyright material of members of the Ocado Group. Opinions and views expressed in this message may not necessarily reflect the opinions and views of the members of the Ocado Group. If you are not the intended recipient, please notify us immediately and delete all copies of this message. Please note that it is your responsibility to scan this message for viruses. Fetch and Sizzle are trading names of Speciality Stores Limited and Fabled is a trading name of Marie Claire Beauty Limited, both members of the Ocado Group. References to the “Ocado Group” are to Ocado Group plc (registered in England and Wales with number 7098618) and its subsidiary undertakings (as that expression is defined in the Companies Act 2006) from time to time. The registered office of Ocado Group plc is Buildings One & Two, Trident Place, Mosquito Way, Hatfield, Hertfordshire, AL10 9UL.
On 2018-04-27 12:28:25 -0400, Bruce Momjian wrote: > On Wed, Apr 18, 2018 at 09:34:50AM -0400, Vick Khera wrote: > > On Wed, Apr 18, 2018 at 12:35 AM, Michael Paquier <michael@paquier.xyz> wrote: > > > > > > That looks like a rather difficult problem to solve in PostgreSQL > > itself, as the operator running the cluster is in charge of setting up > > the FS options which would control the COW behavior, so it seems to me > > > > > > You cannot turn off CoW on ZFS. What other behavior would you refer to here? > > > > I suppose one could make a dedicated data set for the WAL and have ZFS make a > > reservation for about 2x the total expected WAL size. It would require careful > > attention to detail if you increase WAL segments configuration, though, and if > > you had any kind of hiccup with streaming replication that caused the segments > > to stick around longer than expected (but that's no different from any other > > file system). > > Uh, at the risk of asking an obvious question, why is the WAL file COW > if it was renamed? No one has the old WAL file open, as far as I know. Because it's a COW filesystem that doesn't overwrite in place. That's how snapshots etc are implemented. Greetings, Andres Freund
As Dave described in his original email on this topic, we'd like to avoid recycling WAL files since that can cause performance issues when we have a read-modify-write on a file that has dropped out of the cache.
I have implemented a small change to allow WAL recycling to be disabled. It is visible at:
I'd appreciate getting any feedback on this.
Thanks,
Jerry
On Wed, Jun 20, 2018 at 10:35 AM, Jerry Jelinek <jerry.jelinek@joyent.com> wrote:
As Dave described in his original email on this topic, we'd like to avoid recycling WAL files since that can cause performance issues when we have a read-modify-write on a file that has dropped out of the cache.I have implemented a small change to allow WAL recycling to be disabled. It is visible at:I'd appreciate getting any feedback on this.Thanks,Jerry
For reference, there's more context in this thread from several months ago:
I'll repeat the relevant summary here:
tl;dr: We've found that under many conditions, PostgreSQL's re-use of old
WAL files appears to significantly degrade query latency on ZFS. The reason is
complicated and I have details below. Has it been considered to make this
behavior tunable, to cause PostgreSQL to always create new WAL files
instead of re-using old ones?
Thanks,
Dave
On Fri, Jun 22, 2018 at 11:22 AM, David Pacheco <dap@joyent.com> wrote: > On Wed, Jun 20, 2018 at 10:35 AM, Jerry Jelinek <jerry.jelinek@joyent.com> > wrote: >> I have implemented a small change to allow WAL recycling to be disabled. >> It is visible at: >> https://cr.joyent.us/#/c/4263/ >> >> I'd appreciate getting any feedback on this. >> tl;dr: We've found that under many conditions, PostgreSQL's re-use of old >> WAL files appears to significantly degrade query latency on ZFS. I haven't tested by it looks reasonable to me. It needs documentation in doc/src/sgml/config.sgml. It should be listed in src/backend/utils/misc/postgresql.conf.sample. We'd want a patch against our master branch. Could you please register it in commitfest.postgresql.org so we don't lose track of it? Hey, a question about PostgreSQL on ZFS: what do you guys think about pg_flush_data() in fd.c? It does mmap(), msync(), munmap() to try to influence writeback? I wonder if at least on some operating systems that schlepps a bunch of data out of ZFS ARC into OS page cache, kinda trashing the latter? -- Thomas Munro http://www.enterprisedb.com
Thomas,
Thanks for taking a look at this. I'll work on getting a patch together for the master branch. I'll also take a look at the other question you raised and get back to you once I have more information.
Thanks again,
Jerry
On Thu, Jun 21, 2018 at 10:20 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Fri, Jun 22, 2018 at 11:22 AM, David Pacheco <dap@joyent.com> wrote:
> On Wed, Jun 20, 2018 at 10:35 AM, Jerry Jelinek <jerry.jelinek@joyent.com>
> wrote:
>> I have implemented a small change to allow WAL recycling to be disabled.
>> It is visible at:
>> https://cr.joyent.us/#/c/4263/
>>
>> I'd appreciate getting any feedback on this.
>> tl;dr: We've found that under many conditions, PostgreSQL's re-use of old
>> WAL files appears to significantly degrade query latency on ZFS.
I haven't tested by it looks reasonable to me. It needs documentation
in doc/src/sgml/config.sgml. It should be listed in
src/backend/utils/misc/postgresql.conf.sample. We'd want a patch
against our master branch. Could you please register it in
commitfest.postgresql.org so we don't lose track of it?
Hey, a question about PostgreSQL on ZFS: what do you guys think about
pg_flush_data() in fd.c? It does mmap(), msync(), munmap() to try to
influence writeback? I wonder if at least on some operating systems
that schlepps a bunch of data out of ZFS ARC into OS page cache, kinda
trashing the latter?
--
Thomas Munro
http://www.enterprisedb.com
On Wed, Jun 20, 2018 at 1:35 PM, Jerry Jelinek <jerry.jelinek@joyent.com> wrote:
As Dave described in his original email on this topic, we'd like to avoid recycling WAL files since that can cause performance issues when we have a read-modify-write on a file that has dropped out of the cache.I have implemented a small change to allow WAL recycling to be disabled. It is visible at:I'd appreciate getting any feedback on this.
This looks so simple, yet so beneficial. Thanks for making it. Is there some other mechanism that already cleans out the old unneeded WAL files? I recall there is something that does it when you start up after changing the number of files to keep, but I don't recall if that is tested over some loop regularly.
Is there some way to make it auto-detect when it should be enabled? If not, please document that it should be used on ZFS and any other file system with CoW properties on files.
> Is there some way to make it auto-detect when it should be enabled? If not, please document that it should be used on ZFS and any other file system with CoW properties on files.
In addition to this, wondering what type of performance regression this would show on something like ext4 (if any).
Hi, On 2018-06-22 11:41:45 -0400, Adam Brusselback wrote: > > Is there some way to make it auto-detect when it should be enabled? If > not, please document that it should be used on ZFS and any other file > system with CoW properties on files. > In addition to this, wondering what type of performance regression this > would show on something like ext4 (if any). It's a *massive* regression on ext4 & xfs. You can very trivially compare the performance of a new cluster (which doesn't have files to recycle) against one that's running for a while. Greetings, Andres Freund