Thread: Double-writes, take two?
Hi all, Back in 2012, Dan Scales, who was working on VMware Postgres, has posted a patch aimed at removing the need of full-page writes by introducing the concept of double writes using a double-write buffer approach in order to fix torn page problems: https://www.postgresql.org/message-id/1962493974.656458.1327703514780.JavaMail.root%40zimbra-prod-mbox-4.vmware.com A patch has been published back on the thread, and it has roughly the following characteristics: - Double writes happen when a dirty buffer is evicted. - Two double-write buffers are used, one for the checkpointer and one for other processes. - LW locks are used with a bunch of memcpy used to maintain the batches of double-write in a consistent state, and that's heavy. - double-write buffers use a pre-decided numbers of pages (32 for the checkpointer, 128 divided into 4 buckets for the backends), which are synced into disk once each batch is full. - The double-write file of the checkpointer uses ordering of pages using blocks number and files to minimize the number of syncs to happen, using a custom sequential I/O algorithm. - The last point is aimed at improving performance. Processes willing to write a page to the double-write file actually push pages to the buffer first, which forces as well processes doing some smgrread() activity or such to look at the double-write buffer. - A custom page-level checksum was used, to make sure that a page in the double-write file are not torned. Those are not normally mandatory and they were not yet implemented in Postgres - The implementation relies heavily on LWlocks, which kind of sucks for concurrency. - If one looks at the patch, the amount of fsyncs done is actually pretty high, and the patch uses an approach close to what WAL does... More on that downthread. - In order to identify each block in the double-write file, a 4k header is used to store each page's meta-data, limiting the number of pages which can be stored in single double-write file. - There is a performance hit when using smgrread and smgrsync, as double writes could be on the way to the DW file so it is necessary to look at active bactches and see if a wanted page is still there. - IO_DIRECT is used for the double-write files, which is normally not mandatory. Peter G has actually reminded me that the fork of Postgres which VMware had was using IO_DIRECT, but this has been dropped when a switch to pure upstream has happened. There is also a trace on the mailing lists about that matter: https://www.postgresql.org/message-id/529EEC1C.2040207@vmware.com - At recovery, files are replayed and truncated. There is one file per batch of pages written in a dedicated folder. If the page's checksums is inconsistent in the double write file, then it is discarded. If the page is consistent but that the original page of the data file is not, then the block from the double-write file is copied back in place. I have spent some time studying the patch, and I am getting pretty much sure that the approach proposed has a lot of downsides and still performs rather badly for cases where there is a number of dirty large page evictions. OLTP loads would be less prone to that, but workloads working on analytics would get a hit, like large scans with aggregates. Once case which would be rather bad is I imagine a post-checkpoint SELECT where hint bits need to be set. We already have wal_log_hint which has similar performance impact but by my lookup of the code and the proposed approach, the way of handling the double-writes is way lessthan optimal and we have already battle-proven facilities that can be reused. One database system which is known for tackling torn page problems using double writes is InnoDB, a storage engine for MySQL/MariaDB. In this case, the main portion of the code is here: storage/innobase/buf/buf0dblwr.c storage/innobase/include/buf0dblwr.h And here are the docs: https://mariadb.com/kb/en/library/xtradbinnodb-doublewrite-buffer/ The approach used by those folks is a single-file approach, whose concurrency is controlled by a set of mutex locks. I was thinking about this problem, and it looks that one approach for double-writes would be to introduce it as a secondary WAL stream independent from the main one: - Once a buffer is evicted from shared buffers and is dirty, write it to double-write stream and to the data file, and only sync it to the double-write stream. - The low-level of WAL APIs need some refactoring, as the basic idea would be to (ideally?) allow an initialization of a wanted WAL facility using an API layer similar to what has been introduced for SLRUs which is used for many facilities in the backend code. - Compression of evicted pages can be supported the same way as we do now for full-page writes using wal_compression. - At recovery, replay the WAL stream for double-writes first. Truncation and/or recycling of those files happens in a way similar to the normal WAL stream and is controlled by checkpoints. - At checkpoint, truncate the double-write files which are not needed anymore as the corresponding data file's data have been sync'ed. - Backups are a problem, so a first, clean, approach to make sure that backups are consistent is to still enforce full-page writes when a backup is taken, which is what currently happens internally in Postgres, and then resume the double-writes once the backup is done. Rewind is a second one, as a rewind would need a proper tracking of the blocks modified since the last checkpoint where WAL has forked, so the operation would be unsupported. Actually, this is not completely false either, it seems to me that it could be possible to support both operations with a double-write WAL stream for backups by making sure that the stream is consistent with what's taken for backups. I understand that this set of ideas is sort of crazy, but I wanted to brainstorm a bit on the -hackers list and I got this set of ideas for some time now, as there are many loads, particularly OLTP-like where full-page writes are a large portion of the WAL stream traffic. (I am still participating in the war effort to stabilize and test v11 of course, don't worry about that.) Thanks, -- Michael
Attachment
Bonjour Michaël, > - double-write buffers use a pre-decided numbers of pages (32 for the > checkpointer, 128 divided into 4 buckets for the backends), which are > synced into disk once each batch is full. > - The double-write file of the checkpointer uses ordering of pages using > blocks number and files to minimize the number of syncs to happen, using > a custom sequential I/O algorithm. I'm not sure from reading the descriptions. Are these particular features related/similar to 9cd00c4 "Checkpoint sorting and balancing" and 428b1d6 "Allow to trigger kernel writeback after a configurable number of writes", committed in February 2016? -- Fabien.
On Wed, Apr 18, 2018 at 2:22 AM, Michael Paquier <michael@paquier.xyz> wrote: > I was thinking about this problem, and it looks that one approach for > double-writes would be to introduce it as a secondary WAL stream > independent from the main one: > - Once a buffer is evicted from shared buffers and is dirty, write it to > double-write stream and to the data file, and only sync it to the > double-write stream. > - At recovery, replay the WAL stream for double-writes first. I don't really think that this can work. If we're in archive recovery (i.e. recovery of *indefinite* duration), what does it mean to replay the double-writes "first"? What I think probably needs to happen instead is that the secondary WAL stream contains a bunch of records of the form < LSN, block ID, page image >. When recovery replays the WAL record for an LSN, it also restores any double-write images for that LSN. So in effect that WAL format stays the way it is now, but the full page images are moved out of line. If this is all done right, the standby should be able to regenerate the double-write stream without receiving it from the master. That would be good, because then the volume of WAL from master to standby would drop by a large amount. However, it's hard to see how this would perform well. The double-write stream would have to obey the WAL-before-data rule; that is, every eviction from shared buffers would have to flush the WAL *and the double-write buffer*. Unless we're running on hardware where fsync() is very cheap, such as NVRAM, that increase in the total number of fsyncs is probably going to pinch. You'd probably want to have a dwbuf_writer process like wal_writer so that the fsyncs can be issued concurrently, but I suspect that the filesystem will execute them sequentially anyway, hence the pinch. I think this is an interesting topic, but I don't plan to work on it because I have no confidence that I could do it well enough to come out ahead vs. the status quo. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 18, 2018 at 11:40:51AM +0200, Fabien COELHO wrote: >> - double-write buffers use a pre-decided numbers of pages (32 for the >> checkpointer, 128 divided into 4 buckets for the backends), which are >> synced into disk once each batch is full. > >> - The double-write file of the checkpointer uses ordering of pages using >> blocks number and files to minimize the number of syncs to happen, using >> a custom sequential I/O algorithm. > > I'm not sure from reading the descriptions. > > Are these particular features related/similar to 9cd00c4 "Checkpoint sorting > and balancing" and 428b1d6 "Allow to trigger kernel writeback after a > configurable number of writes", committed in February 2016? Not real direct links that I know of, but the work which has been done could be benefitial for checkpoints which need to handle doublle-write streams. -- Michael
Attachment
On Thu, Apr 19, 2018 at 06:28:01PM -0400, Robert Haas wrote: > On Wed, Apr 18, 2018 at 2:22 AM, Michael Paquier <michael@paquier.xyz> wrote: >> I was thinking about this problem, and it looks that one approach for >> double-writes would be to introduce it as a secondary WAL stream >> independent from the main one: >> - Once a buffer is evicted from shared buffers and is dirty, write it to >> double-write stream and to the data file, and only sync it to the >> double-write stream. >> - At recovery, replay the WAL stream for double-writes first. > > I don't really think that this can work. If we're in archive recovery > (i.e. recovery of *indefinite* duration), what does it mean to replay > the double-writes "first"? Ditto. I really meant crash recovery for this description here. The former double-write patch suffers from the same limitation. > What I think probably needs to happen instead is that the secondary > WAL stream contains a bunch of records of the form < LSN, block ID, > page image >. When recovery replays the WAL record for an LSN, it > also restores any double-write images for that LSN. So in effect that > WAL format stays the way it is now, but the full page images are moved > out of line. > > If this is all done right, the standby should be able to regenerate > the double-write stream without receiving it from the master. That > would be good, because then the volume of WAL from master to standby > would drop by a large amount. Agreed. Actually you would need the same kind of logic for a base backup, where both streams are received in parallel using two WAL receivers. After that can come up a new class of fun problems: - Parallel redo using multiple streams. - Parallel redo using one WAL stream. > However, it's hard to see how this would perform well. The > double-write stream would have to obey the WAL-before-data rule; that > is, every eviction from shared buffers would have to flush the WAL > *and the double-write buffer*. Unless we're running on hardware where > fsync() is very cheap, such as NVRAM, that increase in the total > number of fsyncs is probably going to pinch. You'd probably want to > have a dwbuf_writer process like wal_writer so that the fsyncs can be > issued concurrently, but I suspect that the filesystem will execute > them sequentially anyway, hence the pinch. > > I think this is an interesting topic, but I don't plan to work on it > because I have no confidence that I could do it well enough to come > out ahead vs. the status quo. Actually, I was thinking about all that, and it can be actually easy enough to come with a prototype patch if you just focus on the following things and apply some restrictions: - No support for replication and rewind. Backups switch dynamically full page writes to on, which is what happens now. - Support for compression of double-write pages works the same way as in the current WAL: skip hole in page if necessary, allow wal_compression. - Tweak the XLogInsert interface so as it is able to apply a WAL record generated to a wanted stream at insertion, in this case use a specific double-write record which is build using the same interface as for current WAL records, and insert it in either the "main" stream or the "double-write" stream. That would be enough to prove if this approach has value, as we could run a battery of tests first and see if there is value in something like that. It could be even possible to come up with a patch which could be presented, there are a bunch of embedded PostgreSQL boxes which do not use replication by default but enable it later if user decides to do so and where backup frequency does not justify to have full page writes always on. -- Michael