Thread: Double-writes, take two?

Double-writes, take two?

From
Michael Paquier
Date:
Hi all,

Back in 2012, Dan Scales, who was working on VMware Postgres, has posted
a patch aimed at removing the need of full-page writes by introducing
the concept of double writes using a double-write buffer approach in
order to fix torn page problems:
https://www.postgresql.org/message-id/1962493974.656458.1327703514780.JavaMail.root%40zimbra-prod-mbox-4.vmware.com

A patch has been published back on the thread, and it has roughly the
following characteristics:
- Double writes happen when a dirty buffer is evicted.
- Two double-write buffers are used, one for the checkpointer and one for
other processes.
- LW locks are used with a bunch of memcpy used to maintain the batches
of double-write in a consistent state, and that's heavy.
- double-write buffers use a pre-decided numbers of pages (32 for the
checkpointer, 128 divided into 4 buckets for the backends), which are
synced into disk once each batch is full.
- The double-write file of the checkpointer uses ordering of pages using
blocks number and files to minimize the number of syncs to happen, using
a custom sequential I/O algorithm.
- The last point is aimed at improving performance. Processes willing to
write a page to the double-write file actually push pages to the buffer
first, which forces as well processes doing some smgrread() activity or
such to look at the double-write buffer.
- A custom page-level checksum was used, to make sure that a page in the
double-write file are not torned.  Those are not normally mandatory and
they were not yet implemented in Postgres
- The implementation relies heavily on LWlocks, which kind of sucks for
concurrency.
- If one looks at the patch, the amount of fsyncs done is actually
pretty high, and the patch uses an approach close to what WAL does...
More on that downthread.
- In order to identify each block in the double-write file, a 4k header
is used to store each page's meta-data, limiting the number of pages
which can be stored in single double-write file.
- There is a performance hit when using smgrread and smgrsync, as double
writes could be on the way to the DW file so it is necessary to look at
active bactches and see if a wanted page is still there.
- IO_DIRECT is used for the double-write files, which is normally not
mandatory.  Peter G has actually reminded me that the fork of Postgres
which VMware had was using IO_DIRECT, but this has been dropped when a
switch to pure upstream has happened.  There is also a trace on the
mailing lists about that matter:
https://www.postgresql.org/message-id/529EEC1C.2040207@vmware.com
- At recovery, files are replayed and truncated.  There is one file per
batch of pages written in a dedicated folder.  If the page's checksums
is inconsistent in the double write file, then it is discarded.  If the
page is consistent but that the original page of the data file is not,
then the block from the double-write file is copied back in place.

I have spent some time studying the patch, and I am getting pretty much
sure that the approach proposed has a lot of downsides and still
performs rather badly for cases where there is a number of dirty large
page evictions.  OLTP loads would be less prone to that, but workloads
working on analytics would get a hit, like large scans with aggregates.
Once case which would be rather bad is I imagine a post-checkpoint
SELECT where hint bits need to be set.

We already have wal_log_hint which has similar performance impact but by
my lookup of the code and the proposed approach, the way of handling
the double-writes is way lessthan optimal and we have already
battle-proven facilities that can be reused.

One database system which is known for tackling torn page problems using
double writes is InnoDB, a storage engine for MySQL/MariaDB.  In this
case, the main portion of the code is here:
storage/innobase/buf/buf0dblwr.c
storage/innobase/include/buf0dblwr.h
And here are the docs:
https://mariadb.com/kb/en/library/xtradbinnodb-doublewrite-buffer/
The approach used by those folks is a single-file approach, whose
concurrency is controlled by a set of mutex locks.

I was thinking about this problem, and it looks that one approach for
double-writes would be to introduce it as a secondary WAL stream
independent from the main one:
- Once a buffer is evicted from shared buffers and is dirty, write it to
double-write stream and to the data file, and only sync it to the
double-write stream.
- The low-level of WAL APIs need some refactoring, as the basic idea
would be to (ideally?) allow an initialization of a wanted WAL facility
using an API layer similar to what has been introduced for SLRUs which
is used for many facilities in the backend code.
- Compression of evicted pages can be supported the same way as we do
now for full-page writes using wal_compression.
- At recovery, replay the WAL stream for double-writes first.
Truncation and/or recycling of those files happens in a way similar to
the normal WAL stream and is controlled by checkpoints.
- At checkpoint, truncate the double-write files which are not needed
anymore as the corresponding data file's data have been sync'ed.
- Backups are a problem, so a first, clean, approach to make sure that
backups are consistent is to still enforce full-page writes when a
backup is taken, which is what currently happens internally in Postgres,
and then resume the double-writes once the backup is done.  Rewind is a
second one, as a rewind would need a proper tracking of the blocks
modified since the last checkpoint where WAL has forked, so the
operation would be unsupported.  Actually, this is not completely false
either, it seems to me that it could be possible to support both
operations with a double-write WAL stream for backups by making sure
that the stream is consistent with what's taken for backups.

I understand that this set of ideas is sort of crazy, but I wanted to
brainstorm a bit on the -hackers list and I got this set of ideas for
some time now, as there are many loads, particularly OLTP-like where
full-page writes are a large portion of the WAL stream traffic.

(I am still participating in the war effort to stabilize and test v11 of
course, don't worry about that.)

Thanks,
--
Michael

Attachment

Re: Double-writes, take two?

From
Fabien COELHO
Date:
Bonjour Michaël,

> - double-write buffers use a pre-decided numbers of pages (32 for the
> checkpointer, 128 divided into 4 buckets for the backends), which are
> synced into disk once each batch is full.

> - The double-write file of the checkpointer uses ordering of pages using
> blocks number and files to minimize the number of syncs to happen, using
> a custom sequential I/O algorithm.

I'm not sure from reading the descriptions.

Are these particular features related/similar to 9cd00c4 "Checkpoint 
sorting and balancing" and 428b1d6 "Allow to trigger kernel writeback 
after a configurable number of writes", committed in February 2016?

-- 
Fabien.

Re: Double-writes, take two?

From
Robert Haas
Date:
On Wed, Apr 18, 2018 at 2:22 AM, Michael Paquier <michael@paquier.xyz> wrote:
> I was thinking about this problem, and it looks that one approach for
> double-writes would be to introduce it as a secondary WAL stream
> independent from the main one:
> - Once a buffer is evicted from shared buffers and is dirty, write it to
> double-write stream and to the data file, and only sync it to the
> double-write stream.
> - At recovery, replay the WAL stream for double-writes first.

I don't really think that this can work.  If we're in archive recovery
(i.e. recovery of *indefinite* duration), what does it mean to replay
the double-writes "first"?

What I think probably needs to happen instead is that the secondary
WAL stream contains a bunch of records of the form < LSN, block ID,
page image >.  When recovery replays the WAL record for an LSN, it
also restores any double-write images for that LSN.  So in effect that
WAL format stays the way it is now, but the full page images are moved
out of line.

If this is all done right, the standby should be able to regenerate
the double-write stream without receiving it from the master.  That
would be good, because then the volume of WAL from master to standby
would drop by a large amount.

However, it's hard to see how this would perform well.  The
double-write stream would have to obey the WAL-before-data rule; that
is, every eviction from shared buffers would have to flush the WAL
*and the double-write buffer*.  Unless we're running on hardware where
fsync() is very cheap, such as NVRAM, that increase in the total
number of fsyncs is probably going to pinch.  You'd probably want to
have a dwbuf_writer process like wal_writer so that the fsyncs can be
issued concurrently, but I suspect that the filesystem will execute
them sequentially anyway, hence the pinch.

I think this is an interesting topic, but I don't plan to work on it
because I have no confidence that I could do it well enough to come
out ahead vs. the status quo.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Double-writes, take two?

From
Michael Paquier
Date:
On Wed, Apr 18, 2018 at 11:40:51AM +0200, Fabien COELHO wrote:
>> - double-write buffers use a pre-decided numbers of pages (32 for the
>> checkpointer, 128 divided into 4 buckets for the backends), which are
>> synced into disk once each batch is full.
>
>> - The double-write file of the checkpointer uses ordering of pages using
>> blocks number and files to minimize the number of syncs to happen, using
>> a custom sequential I/O algorithm.
>
> I'm not sure from reading the descriptions.
>
> Are these particular features related/similar to 9cd00c4 "Checkpoint sorting
> and balancing" and 428b1d6 "Allow to trigger kernel writeback after a
> configurable number of writes", committed in February 2016?

Not real direct links that I know of, but the work which has been done
could be benefitial for checkpoints which need to handle doublle-write
streams.
--
Michael

Attachment

Re: Double-writes, take two?

From
Michael Paquier
Date:
On Thu, Apr 19, 2018 at 06:28:01PM -0400, Robert Haas wrote:
> On Wed, Apr 18, 2018 at 2:22 AM, Michael Paquier <michael@paquier.xyz> wrote:
>> I was thinking about this problem, and it looks that one approach for
>> double-writes would be to introduce it as a secondary WAL stream
>> independent from the main one:
>> - Once a buffer is evicted from shared buffers and is dirty, write it to
>> double-write stream and to the data file, and only sync it to the
>> double-write stream.
>> - At recovery, replay the WAL stream for double-writes first.
>
> I don't really think that this can work.  If we're in archive recovery
> (i.e. recovery of *indefinite* duration), what does it mean to replay
> the double-writes "first"?

Ditto.  I really meant crash recovery for this description here.  The
former double-write patch suffers from the same limitation.

> What I think probably needs to happen instead is that the secondary
> WAL stream contains a bunch of records of the form < LSN, block ID,
> page image >.  When recovery replays the WAL record for an LSN, it
> also restores any double-write images for that LSN.  So in effect that
> WAL format stays the way it is now, but the full page images are moved
> out of line.
>
> If this is all done right, the standby should be able to regenerate
> the double-write stream without receiving it from the master.  That
> would be good, because then the volume of WAL from master to standby
> would drop by a large amount.

Agreed.  Actually you would need the same kind of logic for a base
backup, where both streams are received in parallel using two WAL
receivers.  After that can come up a new class of fun problems:
- Parallel redo using multiple streams.
- Parallel redo using one WAL stream.

> However, it's hard to see how this would perform well.  The
> double-write stream would have to obey the WAL-before-data rule; that
> is, every eviction from shared buffers would have to flush the WAL
> *and the double-write buffer*.  Unless we're running on hardware where
> fsync() is very cheap, such as NVRAM, that increase in the total
> number of fsyncs is probably going to pinch.  You'd probably want to
> have a dwbuf_writer process like wal_writer so that the fsyncs can be
> issued concurrently, but I suspect that the filesystem will execute
> them sequentially anyway, hence the pinch.
>
> I think this is an interesting topic, but I don't plan to work on it
> because I have no confidence that I could do it well enough to come
> out ahead vs. the status quo.

Actually, I was thinking about all that, and it can be actually easy
enough to come with a prototype patch if you just focus on the following
things and apply some restrictions:
- No support for replication and rewind.  Backups switch dynamically
full page writes to on, which is what happens now.
- Support for compression of double-write pages works the same way as in
the current WAL: skip hole in page if necessary, allow wal_compression.
- Tweak the XLogInsert interface so as it is able to apply a WAL record
generated to a wanted stream at insertion, in this case use a specific
double-write record which is build using the same interface as for
current WAL records, and insert it in either the "main" stream or the
"double-write" stream.

That would be enough to prove if this approach has value, as we could
run a battery of tests first and see if there is value in something like
that.

It could be even possible to come up with a patch which could be
presented, there are a bunch of embedded PostgreSQL boxes which do not
use replication by default but enable it later if user decides to do so
and where backup frequency does not justify to have full page writes
always on.
--
Michael

Attachment