Re: Double-writes, take two? - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: Double-writes, take two? |
Date | |
Msg-id | 20180419235635.GB2024@paquier.xyz Whole thread Raw |
In response to | Re: Double-writes, take two? (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
On Thu, Apr 19, 2018 at 06:28:01PM -0400, Robert Haas wrote: > On Wed, Apr 18, 2018 at 2:22 AM, Michael Paquier <michael@paquier.xyz> wrote: >> I was thinking about this problem, and it looks that one approach for >> double-writes would be to introduce it as a secondary WAL stream >> independent from the main one: >> - Once a buffer is evicted from shared buffers and is dirty, write it to >> double-write stream and to the data file, and only sync it to the >> double-write stream. >> - At recovery, replay the WAL stream for double-writes first. > > I don't really think that this can work. If we're in archive recovery > (i.e. recovery of *indefinite* duration), what does it mean to replay > the double-writes "first"? Ditto. I really meant crash recovery for this description here. The former double-write patch suffers from the same limitation. > What I think probably needs to happen instead is that the secondary > WAL stream contains a bunch of records of the form < LSN, block ID, > page image >. When recovery replays the WAL record for an LSN, it > also restores any double-write images for that LSN. So in effect that > WAL format stays the way it is now, but the full page images are moved > out of line. > > If this is all done right, the standby should be able to regenerate > the double-write stream without receiving it from the master. That > would be good, because then the volume of WAL from master to standby > would drop by a large amount. Agreed. Actually you would need the same kind of logic for a base backup, where both streams are received in parallel using two WAL receivers. After that can come up a new class of fun problems: - Parallel redo using multiple streams. - Parallel redo using one WAL stream. > However, it's hard to see how this would perform well. The > double-write stream would have to obey the WAL-before-data rule; that > is, every eviction from shared buffers would have to flush the WAL > *and the double-write buffer*. Unless we're running on hardware where > fsync() is very cheap, such as NVRAM, that increase in the total > number of fsyncs is probably going to pinch. You'd probably want to > have a dwbuf_writer process like wal_writer so that the fsyncs can be > issued concurrently, but I suspect that the filesystem will execute > them sequentially anyway, hence the pinch. > > I think this is an interesting topic, but I don't plan to work on it > because I have no confidence that I could do it well enough to come > out ahead vs. the status quo. Actually, I was thinking about all that, and it can be actually easy enough to come with a prototype patch if you just focus on the following things and apply some restrictions: - No support for replication and rewind. Backups switch dynamically full page writes to on, which is what happens now. - Support for compression of double-write pages works the same way as in the current WAL: skip hole in page if necessary, allow wal_compression. - Tweak the XLogInsert interface so as it is able to apply a WAL record generated to a wanted stream at insertion, in this case use a specific double-write record which is build using the same interface as for current WAL records, and insert it in either the "main" stream or the "double-write" stream. That would be enough to prove if this approach has value, as we could run a battery of tests first and see if there is value in something like that. It could be even possible to come up with a patch which could be presented, there are a bunch of embedded PostgreSQL boxes which do not use replication by default but enable it later if user decides to do so and where backup frequency does not justify to have full page writes always on. -- Michael
Attachment
pgsql-hackers by date: