Re: [PoC] Non-volatile WAL buffer - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: [PoC] Non-volatile WAL buffer
Date
Msg-id 451659d8-10a5-f5a4-6003-bea8dfcda9e6@iki.fi
Whole thread Raw
In response to Re: [PoC] Non-volatile WAL buffer  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses Re: [PoC] Non-volatile WAL buffer  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On 26/11/2020 21:27, Tomas Vondra wrote:
> Hi,
> 
> Here's the "simple patch" that I'm currently experimenting with. It
> essentially replaces open/close/write/fsync with pmem calls
> (map/unmap/memcpy/persist variants), and it's by no means committable.
> But it works well enough for experiments / measurements, etc.
> 
> The numbers (5-minute pgbench runs on scale 500) look like this:
> 
>           master/btt    master/dax           ntt        simple
>     -----------------------------------------------------------
>       1         5469          7402          7977          6746
>      16        48222         80869        107025         82343
>      32        73974        158189        214718        158348
>      64        85921        154540        225715        164248
>      96       150602        221159        237008        217253
> 
> A chart illustrating these results is attached. The four columns are
> showing unpatched master with WAL on a pmem device, in BTT or DAX modes,
> "ntt" is the patch submitted to this thread, and "simple" is the patch
> I've hacked together.
> 
> As expected, the BTT case performs poorly (compared to the rest).
> 
> The "master/dax" and "simple" perform about the same. There are some
> differences, but those may be attributed to noise. The NTT patch does
> outperform these cases by ~20-40% in some cases.
> 
> The question is why. I recall suggestions this is due to page faults
> when writing data into the WAL, but I did experiment with various
> settings that I think should prevent that (e.g. disabling WAL reuse
> and/or disabling zeroing the segments) but that made no measurable
> difference.

The page faults are only a problem when mmap() is used *without* DAX.

Takashi tried a patch earlier to mmap() WAL segments and insert WAL to 
them directly. See 0002-Use-WAL-segments-as-WAL-buffers.patch at 
https://www.postgresql.org/message-id/000001d5dff4%24995ed180%24cc1c7480%24%40hco.ntt.co.jp_1. 
Could you test that patch too, please? Using your nomenclature, that 
patch skips wal_buffers and does:

   clients -> wal segments (PMEM DAX)

He got good results with that with DAX, but otherwise it performed 
worse. And then we discussed why that might be, and the page fault 
hypothesis was brought up.

I think 0002-Use-WAL-segments-as-WAL-buffers.patch is the most promising 
approach here. But because it's slower without DAX, we need to keep the 
current code for non-DAX systems. Unfortunately it means that we need to 
maintain both implementations, selectable with a GUC or some DAX 
detection magic. The question then is whether the code complexity is 
worth the performance gin on DAX-enabled systems.

Andres was not excited about mmapping the WAL segments because of 
performance reasons. I'm not sure how much of his critique applies if we 
keep supporting both methods and only use mmap() if so configured.

- Heikki



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: remove spurious CREATE INDEX CONCURRENTLY wait
Next
From: Tomas Vondra
Date:
Subject: Re: [PoC] Non-volatile WAL buffer