Re: Two patches to speed up pg_rewind. - Mailing list pgsql-hackers

From Paul Guo
Subject Re: Two patches to speed up pg_rewind.
Date
Msg-id CABQrize7pf4hbKpeCMT_gsh0Q9QROqvseoS7GqVfyhNqVis4nQ@mail.gmail.com
Whole thread Raw
In response to Re: Two patches to speed up pg_rewind.  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Two patches to speed up pg_rewind.  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
Thanks for reviewing, please see the replies below.

On Tue, Aug 17, 2021 at 3:47 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Aug 05, 2021 at 06:18:03PM +0800, Paul Guo wrote:
> > I modified the copy_file_range() patch using the below logic:
> >
> > If the first call of copy_file_range() fails with errno EXDEV or
> > ENOTSUP, pg_rewind
> > would not use copy_file_range() in rest code, and if copy_file_range() fails we
> > fallback to use the previous read()+write() code logic for the file.
>
> I have looked at 0001, and I don't really like it.  One argument
> against this approach is that if pg_rewind fails in the middle of its
> operation then we would have done a set of fsync() for nothing, with
> the data folder still unusable.  I would be curious to see some
> numbers to see how much it matters with many physical files (say cases
> with thousands of small relations?).
> +/* Define PG_FLUSH_DATA_WORKS if we have an implementation for pg_flush_data */
> +#if defined(HAVE_SYNC_FILE_RANGE)
> +#define PG_FLUSH_DATA_WORKS 1
> +#elif !defined(WIN32) && defined(MS_ASYNC)
> +#define PG_FLUSH_DATA_WORKS 1
> +#elif defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
> +#define PG_FLUSH_DATA_WORKS 1
>
> This is wrong for the code frontend on platforms that may finish by
> using MS_ASYNC, no?  There is no such implementation in file_utils.c
> but there is one in fd.c.

Yes, it seems that we need to add the MS_ASYNC code (refer that in fd.c) in
src/common/file_utils.c:pre_sync_fname().

> +   fsync_fname("global/pg_control", false);
> +   fsync_fname("backup_label", false);
> +   if (access("recovery.conf", F_OK) == 0)
> +       fsync_fname("recovery.conf", false);
> +   if (access("postgresql.auto.conf", F_OK) == 0)
> +       fsync_fname("postgresql.auto.conf", false);
>
> This list is wrong on various aspects, no?  This would miss custom
> configuration files, or included files.

I did not understand this. Can you please clarify? Anyway let me
explain, here we fsync
these files additionally because pg_rewind (possibly) modified these
files after rewinding.
These files may not be handled/logged in filemap

pg_control action is FILE_ACTION_NONE
backup_label is excluded
recovery.conf is not logged in filemap
postgresql.auto.conf may be logged but let's fsync this file for safety.

>
> -   if (showprogress)
> -       pg_log_info("syncing target data directory");
> -   sync_target_dir();
> -
>     /* Also update the standby configuration, if requested. */
>     if (writerecoveryconf && !dry_run)
>         WriteRecoveryConfig(conn, datadir_target,
>                             GenerateRecoveryConfig(conn, NULL));
>
> +   if (showprogress)
> +       pg_log_info("syncing target data directory");
> +   perform_sync(filemap);
>
> Why inverting the order here?

We need to synchronize the recoveryconf change finally in perform_sync().

>
> +           * Pre Linux 5.3 does not allow cross-fs copy_file_range() call
> +           * (return EXDEV). Some fs do not support copy_file_range() (return
> +           * ENOTSUP). Here we explicitly disable copy_file_range() for the
> +           * two scenarios. For other failures we still allow subsequent
> +           * copy_file_range() try.
> +           */
> +           if (errno == ENOTSUP || errno == EXDEV)
> +               copy_file_range_support = false;
> Are you sure that it is right to always cancel the support of
> copy_file_range() after it does not work once?  Couldn't it be
> possible that things work properly depending on the tablespace being
> worked on by pg_rewind?

Ideally we should retry when first running into a symlink (e.g.
tablespace, wal),
but it seems not easy to do gracefully.

> Having the facility for copy_file_range() in pg_rewind is not nice at
> the end, and we are going to need a run-time check to fallback
> dynamically to an equivalent implementation on errno={EXDEV,ENOTSUP}.
> Hmm.  What about locating all that in file_utils.c instead, with a
> brand new routine name (pg_copy_file_range would be one choice)?  We
> still need the ./configure check, except that the conditions to use
> the fallback implementation is in this routine, aka fallback on EXDEV,
> ENOTSUP or !HAVE_COPY_FILE_RANGE.  The backend may need to solve this
> problem at some point, but logging and fd handling will likely just
> locate that in fd.c, so having one routine for the purpose of all
> frontends looks like a step in the right direction.

Yes, seems better to make it generic.



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Nitpick/question: Use of aliases for global variables in functions
Next
From: Andrey Borodin
Date:
Subject: Re: Subtransactions + a long-running transaction leading to performance degradation on standbys