Two patches to speed up pg_rewind. - Mailing list pgsql-hackers

From Paul Guo
Subject Two patches to speed up pg_rewind.
Date
Msg-id 7C1703E7-F3F3-43FA-86EB-177C671BF33C@vmware.com
Whole thread Raw
Responses Re: Two patches to speed up pg_rewind.
List pgsql-hackers
While reading pg_rewind code I found two things could speed up pg_rewind.
Attached are the patches.

First one: pg_rewind would fsync the whole pgdata directory on the target by default,
but that is a waste since usually just part of the files/directories on
the target are modified. Other files on the target should have been flushed
since pg_rewind requires a clean shutdown before doing the real work. This
would help the scenario that the target postgres instance includes millions of
files, which has been seen in a real environment.

There are several things that may need further discussions:

1. PG_FLUSH_DATA_WORKS was introduced as "Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data”,
    but now the code guarded by it is just pre_sync_fname() relevant so we might want
    to rename it as HAVE_PRE_SYNC kind of name?

2. Pre_sync_fname() implementation

    The code looks like this:
  #if defined(HAVE_SYNC_FILE_RANGE)
      (void) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
  #elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
      (void) posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);

    I’m a bit suspicious about calling posix_fadvise() with POSIX_FADV_DONTNEED.
    I did not check the Linux Kernel code but according to the man
    page I suspect that this option might cause the kernel tends to evict the related kernel
    pages from the page cache, which might not be something we expect. This is
    not a big issue since sync_file_range() should exist on many widely used Linux.

    Also I’m not sure how much we could benefit from the pre_sync code. Also note if the
    directory has a lot of files or the IO is fast, pre_sync_fname() might slow down
    the process instead. The reasons are: If there are a lot of files it is possible that we need
    to read the already-synced-and-evicted inode from disk (by open()-ing) after rewinding since
   the inode cache in Linux Kernel is limited; also if the IO is faster and kernel do background
   dirty page flush quickly, pre_sync_fname() might just waste cpu cycles.

   A better solution might be launch a separate pthread and do fsync one by one
   when pg_rewind finishes handling one file. pg_basebackup could use the solution also.

   Anyway this is independent of this patch.

Second one is use copy_file_range() for the local rewind case to replace read()+write().
This introduces copy_file_range() check and HAVE_COPY_FILE_RANGE so other
code could use copy_file_range() if needed. copy_file_range() was introduced
In high-version Linux Kernel, in low-version Linux or other Unix-like OS mmap()
might be better than read()+write() but copy_file_range() is more interesting
given that it could skip the data copying in some file systems - this could benefit more
on Linux fs on network-based block storage.

Regards,
Paul
Attachment

pgsql-hackers by date:

Previous
From: "kuroda.hayato@fujitsu.com"
Date:
Subject: RE: ECPG: proposal for new DECLARE STATEMENT
Next
From: Bharath Rupireddy
Date:
Subject: Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax