pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions) - Mailing list pgsql-hackers

From Michael Paquier
Subject pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
Date
Msg-id CAB7nPqQ_B0j3n1t=8c1ZLHXF1b8Tf4XsXoUC9bP9t5Hab--SMg@mail.gmail.com
Whole thread Raw
Responses Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
List pgsql-hackers
Hi all,

Beginning a new thread because the ext4 issues are closed, and because
pg_basebackup data durability meritates a new thread. And in short
about the problem: pg_basebackup makes no effort in being sure that
the data it backs up is on disk, which is bad... One possible
recommendation is to use initdb -S after running pg_basebackup, but
making sure that data is on disk should be done before pg_basebackup
ends.

On Thu, May 12, 2016 at 8:09 PM, I wrote:
> And actually this won't fly high if there is no equivalent of
> walkdir() or if the fsync()'s are not applied recursively. On master
> at least the refactoring had better be done cleanly first... For the
> back branches, we could just have some recursive call like
> fsync_recursively and keep that in src/bin/pg_basebackup. Andres, do
> you think that this should be part of fe_utils or src/common/? I'd
> tend to think the latter is more adapted as there is an equivalent in
> the backend. On back-branches, we could just have something like
> fsync_recursively that walks though the paths. An even more simple
> approach would be to fsync() individually things that have been
> written, but that would suck in performance.

So, attached are two patches that apply on HEAD to address the problem
of pg_basebackup that does not sync the data it writes. As
pg_basebackup cannot use directly initdb -S because, as a client-side
utility, it may be installed while initdb is not (see Fedora and
RHEL), I have refactored the code so as the routines in initdb.c doing
the fsync of PGDATA and other fsync stuff are in src/fe_utils/, and
this is 0001.

Patch 0002 is a set of fixes for pg_basebackup:
- In plain mode, fsync_pgdata is used so as all the tablespaces are
fsync'd at once. This takes care as well of the case where pg_xlog is
a symlink.
- In tar mode (no stdout), each tar file is synced individually, and
the base directory is synced once at the end.
In both cases, failures are not considered fatal.

With pg_basebackup -X and pg_receivexlog, the manipulation of WAL
files is made durable by using fsync and durable_rename where needed
(credits to Andres mainly for this part).

This set of patches is aimed only at HEAD. Back-patchable versions of
this patch would need to copy fsync_pgdata and friends into
streamutil.c for example.

I am adding that to the next CF for review as a bug fix.
Regards,
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: [sqlsmith] Failed assertion in parallel worker (ExecInitSubPlan)
Next
From: Etsuro Fujita
Date:
Subject: Re: Use %u to print user mapping's umid and userid