Re: [PoC] Non-volatile WAL buffer - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [PoC] Non-volatile WAL buffer
Date
Msg-id 20200203155911.blji2hpwfngndbkd@alap3.anarazel.de
Whole thread Raw
In response to Re: [PoC] Non-volatile WAL buffer  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

On 2020-01-27 13:54:38 -0500, Robert Haas wrote:
> On Mon, Jan 27, 2020 at 2:01 AM Takashi Menjo
> <takashi.menjou.vg@hco.ntt.co.jp> wrote:
> > It sounds reasonable, but I'm sorry that I haven't tested such a program
> > yet.  I'll try it to compare with my non-volatile WAL buffer.  For now, I'm
> > a little worried about the overhead of mmap()/munmap() for each WAL segment
> > file.
> 
> I guess the question here is how the cost of one mmap() and munmap()
> pair per WAL segment (normally 16MB) compares to the cost of one
> write() per block (normally 8kB). It could be that mmap() is a more
> expensive call than read(), but by a small enough margin that the
> vastly reduced number of system calls makes it a winner. But that's
> just speculation, because I don't know how heavy mmap() actually is.

mmap()/munmap() on a regular basis does have pretty bad scalability
impacts. I don't think they'd fully hit us, because we're not in a
threaded world however.


My issue with the proposal to go towards mmap()/munmap() is that I think
doing so forcloses a lot of improvements. Even today, on fast storage,
using the open_datasync is faster (at least when somehow hitting the
O_DIRECT path, which isn't that easy these days) - and that's despite it
being really unoptimized.  I think our WAL scalability is a serious
issue. There's a fair bit that we can improve by just fix without really
changing the way we do IO:

- Split WALWriteLock into one lock for writing and one for flushing the
  WAL. Right now we prevent other sessions from writing out WAL - even
  to other segments - when one session is doing a WAL flush. But there's
  absolutely no need for that.
- Stop increasing the size of the flush request to the max when flushing
  WAL (cf "try to write/flush later additions to XLOG as well" in
  XLogFlush()) - that currently reduces throughput in OLTP workloads
  quite noticably. It made some sense in the spinning disk times, but I
  don't think it does for a halfway decent SSD. By writing the maximum
  ready to write, we hold the lock for longer, increasing latency for
  the committing transaction *and* preventing more WAL from being written.
- We should immediately ask the OS to flush writes for full XLOG pages
  back to the OS. Right now the IO for that will never be started before
  the commit comes around in an OLTP workload, which means that we just
  waste the time between the XLogWrite() and the commit.

That'll gain us 2-3x, I think. But after that I think we're going to
have to actually change more fundamentally how we do IO for WAL
writes. Using async IO I can do like 18k individual durable 8kb writes
(using O_DSYNC) a second, at a queue depth of 32. On my laptop. If I
make it 4k writes, it's 22k.

That's not directly comparable with postgres WAL flushes, of course, as
it's all separate blocks, whereas WAL will often end up overwriting the
last block. But it doesn't at all account for group commits either,
which we *constantly* end up doing.

Postgres manages somewhere between ~450 (multiple users) ~800 (single
user) individually durable WAL writes  / sec on the same hardware. Yes,
that's more than an order of magnitude less. Of course some of that is
just that postgres does more than just IO - but that's not effect on the
order of a magnitude.

So, why am I bringing this up in this thread? Only because I do not see
a way to actually utilize non-pmem hardware to a much higher degree than
we are doing now by using mmap(). Doing so requires using direct IO,
which is fundamentally incompatible with using mmap().



> I have a different concern. I think that, right now, when we reuse a
> WAL segment, we write entire blocks at a time, so the old contents of
> the WAL segment are overwritten without ever being read. But that
> behavior might not be maintained when using mmap(). It might be that
> as soon as we write the first byte to a mapped page, the old contents
> have to be faulted into memory. Indeed, it's unclear how it could be
> otherwise, since the VM page must be made read-write at that point and
> the system cannot know that we will overwrite the whole page. But
> reading in the old contents of a recycled WAL file just to overwrite
> them seems like it would be disastrously expensive.

Yea, that's a serious concern.


> A related, but more minor, concern is whether there are any
> differences in in the write-back behavior when modifying a mapped
> region vs. using write(). Either way, the same pages of the same file
> will get dirtied, but the kernel might not have the same idea in
> either case about when the changed pages should be written back down
> to disk, and that could make a big difference to performance.

I don't think there's a significant difference in case of linux - no
idea about others. And either way we probably should force the kernels
hand to start flushing much sooner.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #16171: Potential malformed JSON in explain output
Next
From: Tomas Vondra
Date:
Subject: Re: Complete data erasure