Thread: WIP: WAL prefetch (another approach)

WIP: WAL prefetch (another approach)

From

Thomas Munro

Date:

01 January 2020, 13:39:04

Hello hackers,

Based on ideas from earlier discussions[1][2], here is an experimental
WIP patch to improve recovery speed by prefetching blocks. If you set
wal_prefetch_distance to a positive distance, measured in bytes, then
the recovery loop will look ahead in the WAL and call PrefetchBuffer()
for referenced blocks. This can speed things up with cold caches
(example: after a server reboot) and working sets that don't fit in
memory (example: large scale pgbench).

Results vary, but in contrived larger-than-memory pgbench crash
recovery experiments on a Linux development system, I've seen recovery
running as much as 20x faster with full_page_writes=off and
wal_prefetch_distance=8kB. FPWs reduce the potential speed-up as
discussed in the other thread.

Some notes:

* PrefetchBuffer() is only beneficial if your kernel and filesystem
have a working POSIX_FADV_WILLNEED implementation. That includes
Linux ext4 and xfs, but excludes macOS and Windows. In future we
might use asynchronous I/O to bring data all the way into our own
buffer pool; hopefully the PrefetchBuffer() interface wouldn't change
much and this code would automatically benefit.

* For now, for proof-of-concept purposes, the patch uses a second
XLogReader to read ahead in the WAL. I am thinking about how to write
a two-cursor XLogReader that reads and decodes each record just once.

* It can handle simple crash recovery and streaming replication
scenarios, but doesn't yet deal with complications like timeline
changes (the way to do that might depend on how the previous point
works out). The integration with WAL receiver probably needs some
work, I've been testing pretty narrow cases so far, and the way I
hijacked read_local_xlog_page() probably isn't right.

* On filesystems with block size <= BLCKSZ, it's a waste of a syscall
to try to prefetch a block that we have a FPW for, but otherwise it
can avoid a later stall due to a read-before-write at pwrite() time,
so I added a second GUC wal_prefetch_fpw to make that optional.

Earlier work, and how this patch compares:

* Sean Chittenden wrote pg_prefaulter[1], an external process that
uses worker threads to pread() referenced pages some time before
recovery does, and demonstrated very good speed-up, triggering a lot
of discussion of this topic. My WIP patch differs mainly in that it's
integrated with PostgreSQL, and it uses POSIX_FADV_WILLNEED rather
than synchronous I/O from worker threads/processes. Sean wouldn't
have liked my patch much because he was working on ZFS and that
doesn't support POSIX_FADV_WILLNEED, but with a small patch to ZFS it
works pretty well, and I'll try to get that upstreamed.

* Konstantin Knizhnik proposed a dedicated PostgreSQL process that
would do approximately the same thing[2]. My WIP patch differs mainly
in that it does the prefetching work in the recovery loop itself, and
uses PrefetchBuffer() rather than FilePrefetch() directly. This
avoids a bunch of communication and complications, but admittedly does
introduce new system calls into a hot loop (for now); perhaps I could
pay for that by removing more lseek(SEEK_END) noise. It also deals
with various edge cases relating to created, dropped and truncated
relations a bit differently. It also tries to avoid generating
sequential WILLNEED advice, based on experimental evidence[3] that
that affects Linux's readahead heuristics negatively, though I don't
understand the exact mechanism there.

Here are some cases where I expect this patch to perform badly:

* Your WAL has multiple intermixed sequential access streams (ie
sequential access to N different relations), so that sequential access
is not detected, and then all the WILLNEED advice prevents Linux's
automagic readahead from working well. Perhaps that could be
mitigated by having a system that can detect up to N concurrent
streams, where N is more than the current 1, or by flagging buffers in
the WAL as part of a sequential stream. I haven't looked into this.

* The data is always found in our buffer pool, so PrefetchBuffer() is
doing nothing useful and you might as well not be calling it or doing
the extra work that leads up to that. Perhaps that could be mitigated
with an adaptive approach: too many PrefetchBuffer() hits and we stop
trying to prefetch, too many XLogReadBufferForRedo() misses and we
start trying to prefetch. That might work nicely for systems that
start out with cold caches but eventually warm up. I haven't looked
into this.

* The data is actually always in the kernel's cache, so the advice is
a waste of a syscall. That might imply that you should probably be
running with a larger shared_buffers (?). It's technically possible
to ask the operating system if a region is cached on many systems,
which could in theory be used for some kind of adaptive heuristic that
would disable pointless prefetching, but I'm not proposing that.
Ultimately this problem would be avoided by moving to true async I/O,
where we'd be initiating the read all the way into our buffers (ie it
replaces the later pread() so it's a wash, at worst).

* The prefetch distance is set too low so that pread() waits are not
avoided, or your storage subsystem can't actually perform enough
concurrent I/O to get ahead of the random access pattern you're
generating, so no distance would be far enough ahead. To help with
the former case, perhaps we could invent something smarter than a
user-supplied distance (something like "N cold block references
ahead", possibly using effective_io_concurrency, rather than "N bytes
ahead").

[1] https://www.pgcon.org/2018/schedule/track/Case%20Studies/1204.en.html
[2] https://www.postgresql.org/message-id/flat/49df9cd2-7086-02d0-3f8d-535a32d44c82%40postgrespro.ru
[3] https://github.com/macdice/some-io-tests

Thread: WIP: WAL prefetch (another approach)

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment