Double-writes, take two? - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Double-writes, take two? |
Date | |
Msg-id | 20180418062240.GJ18178@paquier.xyz Whole thread Raw |
Responses |
Re: Double-writes, take two?
Re: Double-writes, take two? |
List | pgsql-hackers |
Hi all, Back in 2012, Dan Scales, who was working on VMware Postgres, has posted a patch aimed at removing the need of full-page writes by introducing the concept of double writes using a double-write buffer approach in order to fix torn page problems: https://www.postgresql.org/message-id/1962493974.656458.1327703514780.JavaMail.root%40zimbra-prod-mbox-4.vmware.com A patch has been published back on the thread, and it has roughly the following characteristics: - Double writes happen when a dirty buffer is evicted. - Two double-write buffers are used, one for the checkpointer and one for other processes. - LW locks are used with a bunch of memcpy used to maintain the batches of double-write in a consistent state, and that's heavy. - double-write buffers use a pre-decided numbers of pages (32 for the checkpointer, 128 divided into 4 buckets for the backends), which are synced into disk once each batch is full. - The double-write file of the checkpointer uses ordering of pages using blocks number and files to minimize the number of syncs to happen, using a custom sequential I/O algorithm. - The last point is aimed at improving performance. Processes willing to write a page to the double-write file actually push pages to the buffer first, which forces as well processes doing some smgrread() activity or such to look at the double-write buffer. - A custom page-level checksum was used, to make sure that a page in the double-write file are not torned. Those are not normally mandatory and they were not yet implemented in Postgres - The implementation relies heavily on LWlocks, which kind of sucks for concurrency. - If one looks at the patch, the amount of fsyncs done is actually pretty high, and the patch uses an approach close to what WAL does... More on that downthread. - In order to identify each block in the double-write file, a 4k header is used to store each page's meta-data, limiting the number of pages which can be stored in single double-write file. - There is a performance hit when using smgrread and smgrsync, as double writes could be on the way to the DW file so it is necessary to look at active bactches and see if a wanted page is still there. - IO_DIRECT is used for the double-write files, which is normally not mandatory. Peter G has actually reminded me that the fork of Postgres which VMware had was using IO_DIRECT, but this has been dropped when a switch to pure upstream has happened. There is also a trace on the mailing lists about that matter: https://www.postgresql.org/message-id/529EEC1C.2040207@vmware.com - At recovery, files are replayed and truncated. There is one file per batch of pages written in a dedicated folder. If the page's checksums is inconsistent in the double write file, then it is discarded. If the page is consistent but that the original page of the data file is not, then the block from the double-write file is copied back in place. I have spent some time studying the patch, and I am getting pretty much sure that the approach proposed has a lot of downsides and still performs rather badly for cases where there is a number of dirty large page evictions. OLTP loads would be less prone to that, but workloads working on analytics would get a hit, like large scans with aggregates. Once case which would be rather bad is I imagine a post-checkpoint SELECT where hint bits need to be set. We already have wal_log_hint which has similar performance impact but by my lookup of the code and the proposed approach, the way of handling the double-writes is way lessthan optimal and we have already battle-proven facilities that can be reused. One database system which is known for tackling torn page problems using double writes is InnoDB, a storage engine for MySQL/MariaDB. In this case, the main portion of the code is here: storage/innobase/buf/buf0dblwr.c storage/innobase/include/buf0dblwr.h And here are the docs: https://mariadb.com/kb/en/library/xtradbinnodb-doublewrite-buffer/ The approach used by those folks is a single-file approach, whose concurrency is controlled by a set of mutex locks. I was thinking about this problem, and it looks that one approach for double-writes would be to introduce it as a secondary WAL stream independent from the main one: - Once a buffer is evicted from shared buffers and is dirty, write it to double-write stream and to the data file, and only sync it to the double-write stream. - The low-level of WAL APIs need some refactoring, as the basic idea would be to (ideally?) allow an initialization of a wanted WAL facility using an API layer similar to what has been introduced for SLRUs which is used for many facilities in the backend code. - Compression of evicted pages can be supported the same way as we do now for full-page writes using wal_compression. - At recovery, replay the WAL stream for double-writes first. Truncation and/or recycling of those files happens in a way similar to the normal WAL stream and is controlled by checkpoints. - At checkpoint, truncate the double-write files which are not needed anymore as the corresponding data file's data have been sync'ed. - Backups are a problem, so a first, clean, approach to make sure that backups are consistent is to still enforce full-page writes when a backup is taken, which is what currently happens internally in Postgres, and then resume the double-writes once the backup is done. Rewind is a second one, as a rewind would need a proper tracking of the blocks modified since the last checkpoint where WAL has forked, so the operation would be unsupported. Actually, this is not completely false either, it seems to me that it could be possible to support both operations with a double-write WAL stream for backups by making sure that the stream is consistent with what's taken for backups. I understand that this set of ideas is sort of crazy, but I wanted to brainstorm a bit on the -hackers list and I got this set of ideas for some time now, as there are many loads, particularly OLTP-like where full-page writes are a large portion of the WAL stream traffic. (I am still participating in the war effort to stabilize and test v11 of course, don't worry about that.) Thanks, -- Michael
Attachment
pgsql-hackers by date: