Re: Double-writes, take two? - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Double-writes, take two?
Date
Msg-id 20180419235635.GB2024@paquier.xyz
Whole thread Raw
In response to Re: Double-writes, take two?  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Thu, Apr 19, 2018 at 06:28:01PM -0400, Robert Haas wrote:
> On Wed, Apr 18, 2018 at 2:22 AM, Michael Paquier <michael@paquier.xyz> wrote:
>> I was thinking about this problem, and it looks that one approach for
>> double-writes would be to introduce it as a secondary WAL stream
>> independent from the main one:
>> - Once a buffer is evicted from shared buffers and is dirty, write it to
>> double-write stream and to the data file, and only sync it to the
>> double-write stream.
>> - At recovery, replay the WAL stream for double-writes first.
>
> I don't really think that this can work.  If we're in archive recovery
> (i.e. recovery of *indefinite* duration), what does it mean to replay
> the double-writes "first"?

Ditto.  I really meant crash recovery for this description here.  The
former double-write patch suffers from the same limitation.

> What I think probably needs to happen instead is that the secondary
> WAL stream contains a bunch of records of the form < LSN, block ID,
> page image >.  When recovery replays the WAL record for an LSN, it
> also restores any double-write images for that LSN.  So in effect that
> WAL format stays the way it is now, but the full page images are moved
> out of line.
>
> If this is all done right, the standby should be able to regenerate
> the double-write stream without receiving it from the master.  That
> would be good, because then the volume of WAL from master to standby
> would drop by a large amount.

Agreed.  Actually you would need the same kind of logic for a base
backup, where both streams are received in parallel using two WAL
receivers.  After that can come up a new class of fun problems:
- Parallel redo using multiple streams.
- Parallel redo using one WAL stream.

> However, it's hard to see how this would perform well.  The
> double-write stream would have to obey the WAL-before-data rule; that
> is, every eviction from shared buffers would have to flush the WAL
> *and the double-write buffer*.  Unless we're running on hardware where
> fsync() is very cheap, such as NVRAM, that increase in the total
> number of fsyncs is probably going to pinch.  You'd probably want to
> have a dwbuf_writer process like wal_writer so that the fsyncs can be
> issued concurrently, but I suspect that the filesystem will execute
> them sequentially anyway, hence the pinch.
>
> I think this is an interesting topic, but I don't plan to work on it
> because I have no confidence that I could do it well enough to come
> out ahead vs. the status quo.

Actually, I was thinking about all that, and it can be actually easy
enough to come with a prototype patch if you just focus on the following
things and apply some restrictions:
- No support for replication and rewind.  Backups switch dynamically
full page writes to on, which is what happens now.
- Support for compression of double-write pages works the same way as in
the current WAL: skip hole in page if necessary, allow wal_compression.
- Tweak the XLogInsert interface so as it is able to apply a WAL record
generated to a wanted stream at insertion, in this case use a specific
double-write record which is build using the same interface as for
current WAL records, and insert it in either the "main" stream or the
"double-write" stream.

That would be enough to prove if this approach has value, as we could
run a battery of tests first and see if there is value in something like
that.

It could be even possible to come up with a patch which could be
presented, there are a bunch of embedded PostgreSQL boxes which do not
use replication by default but enable it later if user decides to do so
and where backup frequency does not justify to have full page writes
always on.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Double-writes, take two?
Next
From: Michael Paquier
Date:
Subject: Re: Built-in connection pooling