Re: [PoC] Non-volatile WAL buffer - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [PoC] Non-volatile WAL buffer
Date
Msg-id 545d5f28-9094-02ae-34f1-39edfb447ee4@enterprisedb.com
Whole thread Raw
In response to Re: [PoC] Non-volatile WAL buffer  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Responses RE: [PoC] Non-volatile WAL buffer  ("tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com>)
List pgsql-hackers
Hi,

On 11/23/20 3:01 AM, Tomas Vondra wrote:
> Hi,
> 
> On 10/30/20 6:57 AM, Takashi Menjo wrote:
>> Hi Heikki,
>>
>>> I had a new look at this thread today, trying to figure out where 
>>> we are.
>>
>> I'm a bit confused.
>>>
>>> One thing we have established: mmap()ing WAL files performs worse 
>>> than the current method, if pg_wal is not on a persistent memory 
>>> device. This is because the kernel faults in existing content of 
>>> each page, even though we're overwriting everything.
>>
>> Yes. In addition, after a certain page (in the sense of OS page) is 
>> msync()ed, another page fault will occur again when something is 
>> stored into that page.
>>
>>> That's unfortunate. I was hoping that mmap() would be a good option
>>> even without persistent memory hardware. I wish we could tell the
>>> kernel to zero the pages instead of reading them from the file.
>>> Maybe clear the file with ftruncate() before mmapping it?
>>
>> The area extended by ftruncate() appears as if it were zero-filled 
>> [1]. Please note that it merely "appears as if." It might not be 
>> actually zero-filled as data blocks on devices, so pre-allocating 
>> files should improve transaction performance. At least, on Linux 5.7
>>  and ext4, it takes more time to store into the mapped file just 
>> open(O_CREAT)ed and ftruncate()d than into the one filled already and
>> actually.
>>
> 
> Does is really matter that it only appears zero-filled? I think Heikki's
> point was that maybe ftruncate() would prevent the kernel from faulting
> the existing page content when we're overwriting it.
> 
> Not sure I understand what the benchmark with ext4 was doing, exactly.
> How was that measured? Might be interesting to have some simple
> benchmarking tool to demonstrate this (I believe a small standalone tool
> written in C should do the trick).
> 

One more thought about this - if ftruncate() is not enough to convince
the mmap() to not load existing data from the file, what about not
reusing the WAL segments at all? I haven't tried, though.

>>> That should not be problem with a real persistent memory device, 
>>> however (or when emulating it with DRAM). With DAX, the storage is 
>>> memory-mapped directly and there is no page cache, and no 
>>> pre-faulting.
>>
>> Yes, with filesystem DAX, there is no page cache for file data. A 
>> page fault still occurs but for each 2MiB DAX hugepage, so its 
>> overhead decreases compared with 4KiB page fault. Such a DAX
>> hugepage fault is only applied to DAX-mapped files and is different
>> from a general transparent hugepage fault.
>>
> 
> I don't follow - if there are page faults even when overwriting all the
> data, I'd say it's still an issue even with 2MB DAX pages. How big is
> the difference between 4kB and 2MB pages?
> 
> Not sure I understand how is this different from general THP fault?
> 
>>> Because of that, I'm baffled by what the 
>>> v4-0002-Non-volatile-WAL-buffer.patch does. If I understand it 
>>> correctly, it puts the WAL buffers in a separate file, which is 
>>> stored on the NVRAM. Why? I realize that this is just a Proof of 
>>> Concept, but I'm very much not interested in anything that requires
>>> the DBA to manage a second WAL location. Did you test the mmap()
>>> patches with persistent memory hardware? Did you compare that with
>>> the pmem patchset, on the same hardware? If there's a meaningful
>>> performance difference between the two, what's causing it?
> 
>> Yes, this patchset puts the WAL buffers into the file specified by 
>> "nvwal_path" in postgresql.conf.
>>
>> Why this patchset puts the buffers into the separated file, not 
>> existing segment files in PGDATA/pg_wal, is because it reduces the 
>> overhead due to system calls such as open(), mmap(), munmap(), and 
>> close(). It open()s and mmap()s the file "nvwal_path" once, and keeps
>> that file mapped while running. On the other hand, as for the 
>> patchset mmap()ing the segment files, a backend process should 
>> munmap() and close() the current mapped file and open() and mmap() 
>> the new one for each time the inserting location for that process 
>> goes over segments. This causes the performance difference between 
>> the two.
>>
> 
> I kinda agree with Heikki here - having to manage yet another location
> for WAL data is rather inconvenient. We should aim not to make the life
> of DBAs unnecessarily difficult, IMO.
> 
> I wonder how significant the syscall overhead is - can you show share
> some numbers? I don't see any such results in this thread, so I'm not
> sure if it means losing 1% or 10% throughput.
> 
> Also, maybe there are alternative ways to reduce the overhead? For
> example, we can increase the size of the WAL segment, and with 1GB
> segments we'd do 1/64 of syscalls. Or maybe we could do some of this
> asynchronously - request a segment ahead, and let another process do the
> actual work etc. so that the running process does not wait.
> 
> 
> Do I understand correctly that the patch removes "regular" WAL buffers
> and instead writes the data into the non-volatile PMEM buffer, without
> writing that to the WAL segments at all (unless in archiving mode)?
> 
> Firstly, I guess many (most?) instances will have to write the WAL
> segments anyway because of PITR/backups, so I'm not sure we can save
> much here.
> 
> But more importantly - doesn't that mean the nvwal_size value is
> essentially a hard limit? With max_wal_size, it's a soft limit i.e.
> we're allowed to temporarily use more WAL when needed. But with a
> pre-allocated file, that's clearly not possible. So what would happen in
> those cases?
> 
> Also, is it possible to change nvwal_size? I haven't tried, but I wonder
> what happens with the current contents of the file.
> 

I've been thinking about the current design (which essentially places
the WAL buffers on PMEM) a bit more. I wonder whether that's actually
the right design ...

The way I understand the current design is that we're essentially
switching from this architecture:

    clients -> wal buffers (DRAM) -> wal segments (storage)

to this

    clients -> wal buffers (PMEM)

(Assuming there we don't have to write segments because of archiving.)

The first thing to consider is that PMEM is actually somewhat slower
than DRAM, the difference is roughly 100ns vs. 300ns (see [1] and [2]).
From this POV it's a bit strange that we're moving the WAL buffer to a
slower medium.

Of course, PMEM is significantly faster than other storage types (e.g.
order of magnitude faster than flash) and we're eliminating the need to
write the WAL from PMEM in some cases, and that may help.

The second thing I notice is that PMEM does not seem to handle many
clients particularly well - if you look at Figure 2 in [2], you'll see
that there's a clear drop-off in write bandwidth after only a few
clients. For DRAM there's no such issue. (The total PMEM bandwidth seems
much worse than for DRAM too.)

So I wonder if using PMEM for the WAL buffer is the right way forward.
AFAIK the WAL buffer is quite concurrent (multiple clients writing
data), which seems to contradict the PMEM vs. DRAM trade-offs.

The design I've originally expected would look more like this

    clients -> wal buffers (DRAM) -> wal segments (PMEM DAX)

i.e. mostly what we have now, but instead of writing the WAL segments
"the usual way" we'd write them using mmap/memcpy, without fsync.

I suppose that's what Heikki meant too, but I'm not sure.


regards


[1] https://pmem.io/2019/12/19/performance.html
[2] https://arxiv.org/pdf/1904.01614.pdf

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: abstract Unix-domain sockets
Next
From: John Naylor
Date:
Subject: Re: cutting down the TODO list thread