Thread: Is there a way to speed up WAL replay?
Hi,
I am working on restoring a database from a base backup + WAL. With the default settings the database replays about 3-4 WAL files per second. The startup process takes about 65% of a CPU and writes data with something between 50 and 100 MB/sec.
Is there a way to speed that up? The disk can easily sustain 400-500 MB/sec.
Thanks,
Torsten
On Wed, Oct 31, 2018 at 1:38 AM Torsten Förtsch <tfoertsch123@gmail.com> wrote:
Hi,I am working on restoring a database from a base backup + WAL. With the default settings the database replays about 3-4 WAL files per second. The startup process takes about 65% of a CPU and writes data with something between 50 and 100 MB/sec.Is there a way to speed that up? The disk can easily sustain 400-500 MB/sec.
WAL replay is single-threaded, so the most you would be able to speed it up is 50%, to where it would be taking 100% CPU.
Is the time spent not on the CPU being spent waiting for WAL files to arrive from the restore_command, or waiting for the blocks it needs to replay into to get read from disk, or waiting for dirty pages to get written to disk so they can be evicted to make way for new ones?
One way I found to speed up restore_command is to have another program run a few WAL files ahead of it, copying the WAL from the real archive into a scratch space which is on the same filesystem as pg_xlog/pg_wal. Then have restore_command simply move (not copy) the requested files from the scratch space onto %p. The intra-filesystem move completes much faster than a copy.
If it spends time waiting for blocks that need to be recovered into to get read from disk, and you have enough RAM, you could speed it up by pre-warming the file system cache. Something like:
tar -cf - $PGDATA | wc -c
Cheers,
Jeff
Greetings, * Torsten Förtsch (tfoertsch123@gmail.com) wrote: > I am working on restoring a database from a base backup + WAL. With the > default settings the database replays about 3-4 WAL files per second. The > startup process takes about 65% of a CPU and writes data with something > between 50 and 100 MB/sec. What are you using for a restore_command..? You can typically get some improvement by using a restore_command that's faster or pre-stages files locally to minimize the time required to run, or dumping all the WAL into the xlog, but that's not something I typically recommend. > Is there a way to speed that up? The disk can easily sustain 400-500 MB/sec. Kind of depends what you're ultimately going for.. If you've made everything else faster and you're still only seeing 65% CPU utilization on the one CPU, then maybe there's just enough latency to the disks to be an issue, in which case you could possibly restore on to a ramdisk if you've got enough memory/space and don't mind that. There's other things that can be done too, like adjusting the amount of shared buffers; depending on what you're doing that can also make a difference in replay speed (we've been working to improve that though). I tend to find that it's better to just reduce the amount of WAL that needs to be replayed by taking incremental backups more frequently and using things like pgbackrest's delta restore ability for doing repeated restores to the same location. Of course, that's not likely to help you out much here. Thanks! Stephen
Attachment
Greetings, * Jeff Janes (jeff.janes@gmail.com) wrote: > One way I found to speed up restore_command is to have another program run > a few WAL files ahead of it, copying the WAL from the real archive into a > scratch space which is on the same filesystem as pg_xlog/pg_wal. Then have > restore_command simply move (not copy) the requested files from the scratch > space onto %p. The intra-filesystem move completes much faster than a copy. Right, that can definitely be a big help and is more-or-less what pgbackrest does too. > If it spends time waiting for blocks that need to be recovered into to get > read from disk, and you have enough RAM, you could speed it up by > pre-warming the file system cache. Something like: > > tar -cf - $PGDATA | wc -c Yeah, that's also a good idea. We've discussed having something in pgbackrest to basically go pre-load things off disk in advance of WAL replay, and I seem to recall someone had also written an external tool to do that. Thanks! Stephen
Attachment
This tool may be useful:
Faults pages into PostgreSQL shared_buffers or filesystem caches in advance of WAL apply
Nicolas
On Wed, Oct 31, 2018 at 6:38 AM Torsten Förtsch <tfoertsch123@gmail.com> wrote:
Hi,I am working on restoring a database from a base backup + WAL. With the default settings the database replays about 3-4 WAL files per second. The startup process takes about 65% of a CPU and writes data with something between 50 and 100 MB/sec.Is there a way to speed that up? The disk can easily sustain 400-500 MB/sec.Thanks,Torsten
On Thu, Nov 1, 2018 at 4:25 AM Jeff Janes <jeff.janes@gmail.com> wrote: > On Wed, Oct 31, 2018 at 1:38 AM Torsten Förtsch <tfoertsch123@gmail.com> wrote: >> I am working on restoring a database from a base backup + WAL. With the default settings the database replays about 3-4WAL files per second. The startup process takes about 65% of a CPU and writes data with something between 50 and 100 MB/sec. >> >> Is there a way to speed that up? The disk can easily sustain 400-500 MB/sec. > > > WAL replay is single-threaded, so the most you would be able to speed it up is 50%, to where it would be taking 100% CPU. > > Is the time spent not on the CPU being spent waiting for WAL files to arrive from the restore_command, or waiting for theblocks it needs to replay into to get read from disk, or waiting for dirty pages to get written to disk so they can beevicted to make way for new ones? > > One way I found to speed up restore_command is to have another program run a few WAL files ahead of it, copying the WALfrom the real archive into a scratch space which is on the same filesystem as pg_xlog/pg_wal. Then have restore_commandsimply move (not copy) the requested files from the scratch space onto %p. The intra-filesystem move completesmuch faster than a copy. > > If it spends time waiting for blocks that need to be recovered into to get read from disk, and you have enough RAM, youcould speed it up by pre-warming the file system cache. Something like: > > tar -cf - $PGDATA | wc -c For more targeted prewarming of large systems that don't fit in RAM and to get all the way into PostgreSQL's buffer pool, I suppose you could write a small Python/whatever script that extracts the relfilenode + block references from the output of pg_waldump (one file ahead, or whatever), sorts and uniques them, merges them into block ranges, converts the relfilenode reference to relation OID, and then calls pg_prewarm() for each range. -- Thomas Munro http://www.enterprisedb.com