Re: [PoC] Non-volatile WAL buffer - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [PoC] Non-volatile WAL buffer
Date
Msg-id CA+TgmoZWvm36GyYNDn3gksVAkuPrc86G9W4of8AgYR=SSU7Lmw@mail.gmail.com
Whole thread Raw
In response to RE: [PoC] Non-volatile WAL buffer  (Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>)
Responses RE: [PoC] Non-volatile WAL buffer  (Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>)
Re: [PoC] Non-volatile WAL buffer  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Mon, Jan 27, 2020 at 2:01 AM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp> wrote:
> It sounds reasonable, but I'm sorry that I haven't tested such a program
> yet.  I'll try it to compare with my non-volatile WAL buffer.  For now, I'm
> a little worried about the overhead of mmap()/munmap() for each WAL segment
> file.

I guess the question here is how the cost of one mmap() and munmap()
pair per WAL segment (normally 16MB) compares to the cost of one
write() per block (normally 8kB). It could be that mmap() is a more
expensive call than read(), but by a small enough margin that the
vastly reduced number of system calls makes it a winner. But that's
just speculation, because I don't know how heavy mmap() actually is.

I have a different concern. I think that, right now, when we reuse a
WAL segment, we write entire blocks at a time, so the old contents of
the WAL segment are overwritten without ever being read. But that
behavior might not be maintained when using mmap(). It might be that
as soon as we write the first byte to a mapped page, the old contents
have to be faulted into memory. Indeed, it's unclear how it could be
otherwise, since the VM page must be made read-write at that point and
the system cannot know that we will overwrite the whole page. But
reading in the old contents of a recycled WAL file just to overwrite
them seems like it would be disastrously expensive.

A related, but more minor, concern is whether there are any
differences in in the write-back behavior when modifying a mapped
region vs. using write(). Either way, the same pages of the same file
will get dirtied, but the kernel might not have the same idea in
either case about when the changed pages should be written back down
to disk, and that could make a big difference to performance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: JIT performance bug/regression & JIT EXPLAIN
Next
From: Robert Haas
Date:
Subject: Re: JIT performance bug/regression & JIT EXPLAIN