While reading pg_rewind code I found two things could speed up pg_rewind.
Attached are the patches.
First one: pg_rewind would fsync the whole pgdata directory on the target by default,
but that is a waste since usually just part of the files/directories on
the target are modified. Other files on the target should have been flushed
since pg_rewind requires a clean shutdown before doing the real work. This
would help the scenario that the target postgres instance includes millions of
files, which has been seen in a real environment.
There are several things that may need further discussions:
1. PG_FLUSH_DATA_WORKS was introduced as "Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data”,
but now the code guarded by it is just pre_sync_fname() relevant so we might want
to rename it as HAVE_PRE_SYNC kind of name?
2. Pre_sync_fname() implementation
The code looks like this:
#if defined(HAVE_SYNC_FILE_RANGE)
(void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
(void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);
I’m a bit suspicious about calling posix_fadvise() with POSIX_FADV_DONTNEED.
I did not check the Linux Kernel code but according to the man
page I suspect that this option might cause the kernel tends to evict the related kernel
pages from the page cache, which might not be something we expect. This is
not a big issue since sync_file_range() should exist on many widely used Linux.
Also I’m not sure how much we could benefit from the pre_sync code. Also note if the
directory has a lot of files or the IO is fast, pre_sync_fname() might slow down
the process instead. The reasons are: If there are a lot of files it is possible that we need
to read the already-synced-and-evicted inode from disk (by open()-ing) after rewinding since
the inode cache in Linux Kernel is limited; also if the IO is faster and kernel do background
dirty page flush quickly, pre_sync_fname() might just waste cpu cycles.
A better solution might be launch a separate pthread and do fsync one by one
when pg_rewind finishes handling one file. pg_basebackup could use the solution also.
Anyway this is independent of this patch.
Second one is use copy_file_range() for the local rewind case to replace read()+write().
This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other
code could use copy_file_range() if needed. copy_file_range() was introduced
In high-version Linux Kernel, in low-version Linux or other Unix-like OS mmap()
might be better than read()+write() but copy_file_range() is more interesting
given that it could skip the data copying in some file systems - this could benefit more
on Linux fs on network-based block storage.
Regards,
Paul