Thread: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)

Hi all,

Beginning a new thread because the ext4 issues are closed, and because
pg_basebackup data durability meritates a new thread. And in short
about the problem: pg_basebackup makes no effort in being sure that
the data it backs up is on disk, which is bad... One possible
recommendation is to use initdb -S after running pg_basebackup, but
making sure that data is on disk should be done before pg_basebackup
ends.

On Thu, May 12, 2016 at 8:09 PM, I wrote:
> And actually this won't fly high if there is no equivalent of
> walkdir() or if the fsync()'s are not applied recursively. On master
> at least the refactoring had better be done cleanly first... For the
> back branches, we could just have some recursive call like
> fsync_recursively and keep that in src/bin/pg_basebackup. Andres, do
> you think that this should be part of fe_utils or src/common/? I'd
> tend to think the latter is more adapted as there is an equivalent in
> the backend. On back-branches, we could just have something like
> fsync_recursively that walks though the paths. An even more simple
> approach would be to fsync() individually things that have been
> written, but that would suck in performance.

So, attached are two patches that apply on HEAD to address the problem
of pg_basebackup that does not sync the data it writes. As
pg_basebackup cannot use directly initdb -S because, as a client-side
utility, it may be installed while initdb is not (see Fedora and
RHEL), I have refactored the code so as the routines in initdb.c doing
the fsync of PGDATA and other fsync stuff are in src/fe_utils/, and
this is 0001.

Patch 0002 is a set of fixes for pg_basebackup:
- In plain mode, fsync_pgdata is used so as all the tablespaces are
fsync'd at once. This takes care as well of the case where pg_xlog is
a symlink.
- In tar mode (no stdout), each tar file is synced individually, and
the base directory is synced once at the end.
In both cases, failures are not considered fatal.

With pg_basebackup -X and pg_receivexlog, the manipulation of WAL
files is made durable by using fsync and durable_rename where needed
(credits to Andres mainly for this part).

This set of patches is aimed only at HEAD. Back-patchable versions of
this patch would need to copy fsync_pgdata and friends into
streamutil.c for example.

I am adding that to the next CF for review as a bug fix.
Regards,
--
Michael

Attachment
On 13.05.2016 9:39, Michael Paquier wrote:
> Hi all,
>
> Beginning a new thread because the ext4 issues are closed, and because
> pg_basebackup data durability meritates a new thread. And in short
> about the problem: pg_basebackup makes no effort in being sure that
> the data it backs up is on disk, which is bad... One possible
> recommendation is to use initdb -S after running pg_basebackup, but
> making sure that data is on disk should be done before pg_basebackup
> ends.
>
> On Thu, May 12, 2016 at 8:09 PM, I wrote:
>> And actually this won't fly high if there is no equivalent of
>> walkdir() or if the fsync()'s are not applied recursively. On master
>> at least the refactoring had better be done cleanly first... For the
>> back branches, we could just have some recursive call like
>> fsync_recursively and keep that in src/bin/pg_basebackup. Andres, do
>> you think that this should be part of fe_utils or src/common/? I'd
>> tend to think the latter is more adapted as there is an equivalent in
>> the backend. On back-branches, we could just have something like
>> fsync_recursively that walks though the paths. An even more simple
>> approach would be to fsync() individually things that have been
>> written, but that would suck in performance.
>
> So, attached are two patches that apply on HEAD to address the problem
> of pg_basebackup that does not sync the data it writes. As
> pg_basebackup cannot use directly initdb -S because, as a client-side
> utility, it may be installed while initdb is not (see Fedora and
> RHEL), I have refactored the code so as the routines in initdb.c doing
> the fsync of PGDATA and other fsync stuff are in src/fe_utils/, and
> this is 0001.
>
> Patch 0002 is a set of fixes for pg_basebackup:
> - In plain mode, fsync_pgdata is used so as all the tablespaces are
> fsync'd at once. This takes care as well of the case where pg_xlog is
> a symlink.
> - In tar mode (no stdout), each tar file is synced individually, and
> the base directory is synced once at the end.
> In both cases, failures are not considered fatal.
>
> With pg_basebackup -X and pg_receivexlog, the manipulation of WAL
> files is made durable by using fsync and durable_rename where needed
> (credits to Andres mainly for this part).
>
> This set of patches is aimed only at HEAD. Back-patchable versions of
> this patch would need to copy fsync_pgdata and friends into
> streamutil.c for example.
>
> I am adding that to the next CF for review as a bug fix.
> Regards,
>
>
>
>
Hi!
Do we have any confidence that data file is not being corrupted? I.e 
contains some corrupted page? Can pg_basebackup check page checksum (db 
init with initdb -k) while backing up files?

Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



On Fri, May 13, 2016 at 11:49 PM, Alex Ignatov <a.ignatov@postgrespro.ru> wrote:
>
> On 13.05.2016 9:39, Michael Paquier wrote:
> Do we have any confidence that data file is not being corrupted? I.e
> contains some corrupted page? Can pg_basebackup check page checksum (db init
> with initdb -k) while backing up files?

That may be an idea to consider, for one class of users, though this
patch is not aimed at doing something regarding that.
-- 
Michael



On 5/13/16 2:39 AM, Michael Paquier wrote:
> So, attached are two patches that apply on HEAD to address the problem
> of pg_basebackup that does not sync the data it writes. As
> pg_basebackup cannot use directly initdb -S because, as a client-side
> utility, it may be installed while initdb is not (see Fedora and
> RHEL), I have refactored the code so as the routines in initdb.c doing
> the fsync of PGDATA and other fsync stuff are in src/fe_utils/, and
> this is 0001.

Why fe_utils?  initdb is not a front-end program.

> Patch 0002 is a set of fixes for pg_basebackup:
> - In plain mode, fsync_pgdata is used so as all the tablespaces are
> fsync'd at once. This takes care as well of the case where pg_xlog is
> a symlink.
> - In tar mode (no stdout), each tar file is synced individually, and
> the base directory is synced once at the end.
> In both cases, failures are not considered fatal.

Maybe there should be --nosync options like initdb has?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Fri, Sep 2, 2016 at 2:20 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 5/13/16 2:39 AM, Michael Paquier wrote:
>> So, attached are two patches that apply on HEAD to address the problem
>> of pg_basebackup that does not sync the data it writes. As
>> pg_basebackup cannot use directly initdb -S because, as a client-side
>> utility, it may be installed while initdb is not (see Fedora and
>> RHEL), I have refactored the code so as the routines in initdb.c doing
>> the fsync of PGDATA and other fsync stuff are in src/fe_utils/, and
>> this is 0001.
>
> Why fe_utils?  initdb is not a front-end program.

Thinking about that, you are right. Let's move it to src/common,
frontend-only though.

>> Patch 0002 is a set of fixes for pg_basebackup:
>> - In plain mode, fsync_pgdata is used so as all the tablespaces are
>> fsync'd at once. This takes care as well of the case where pg_xlog is
>> a symlink.
>> - In tar mode (no stdout), each tar file is synced individually, and
>> the base directory is synced once at the end.
>> In both cases, failures are not considered fatal.
>
> Maybe there should be --nosync options like initdb has?

What do others think about that? I could implement that on top of 0002
with some extra options. But to be honest that looks to be just some
extra sugar for what is basically a bug fix... And I am feeling that
providing such a switch to users would be a way for one to shoot
himself badly, particularly for pg_receivexlog where a crash can cause
segments to go missing.
--
Michael

Attachment

On Fri, Sep 2, 2016 at 8:50 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Fri, Sep 2, 2016 at 2:20 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 5/13/16 2:39 AM, Michael Paquier wrote:
>> So, attached are two patches that apply on HEAD to address the problem
>> of pg_basebackup that does not sync the data it writes. As
>> pg_basebackup cannot use directly initdb -S because, as a client-side
>> utility, it may be installed while initdb is not (see Fedora and
>> RHEL), I have refactored the code so as the routines in initdb.c doing
>> the fsync of PGDATA and other fsync stuff are in src/fe_utils/, and
>> this is 0001.
>
> Why fe_utils?  initdb is not a front-end program.

Thinking about that, you are right. Let's move it to src/common,
frontend-only though.

>> Patch 0002 is a set of fixes for pg_basebackup:
>> - In plain mode, fsync_pgdata is used so as all the tablespaces are
>> fsync'd at once. This takes care as well of the case where pg_xlog is
>> a symlink.
>> - In tar mode (no stdout), each tar file is synced individually, and
>> the base directory is synced once at the end.
>> In both cases, failures are not considered fatal.
>
> Maybe there should be --nosync options like initdb has?

What do others think about that? I could implement that on top of 0002
with some extra options. But to be honest that looks to be just some
extra sugar for what is basically a bug fix... And I am feeling that
providing such a switch to users would be a way for one to shoot
himself badly, particularly for pg_receivexlog where a crash can cause
segments to go missing.


Well, why do we provide a --nosync option for initdb? Wouldn't the argument basically be the same? 

I agree it kind of feels like overkill, but it would be consistent overkill? :)

--
On Sat, Sep 3, 2016 at 12:42 AM, Magnus Hagander <magnus@hagander.net> wrote:
> On Fri, Sep 2, 2016 at 8:50 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>> On Fri, Sep 2, 2016 at 2:20 AM, Peter Eisentraut
>> <peter.eisentraut@2ndquadrant.com> wrote:
>> > On 5/13/16 2:39 AM, Michael Paquier wrote:
>> What do others think about that? I could implement that on top of 0002
>> with some extra options. But to be honest that looks to be just some
>> extra sugar for what is basically a bug fix... And I am feeling that
>> providing such a switch to users would be a way for one to shoot
>> himself badly, particularly for pg_receivexlog where a crash can cause
>> segments to go missing.
>>
>
> Well, why do we provide a --nosync option for initdb? Wouldn't the argument
> basically be the same?

Yes, the good-for-testing-but-not-production argument.

> I agree it kind of feels like overkill, but it would be consistent overkill?
> :)

Oh, well. I have just implemented it on top of the two other patches
for pg_basebackup. For pg_receivexlog, I am wondering if it makes
sense to have it. That would be trivial to implement it, and I think
that we had better make the combination of --synchronous and --nosync
just leave with an error. Thoughts about having that for
pg_receivexlog?
-- 
Michael



On Sat, Sep 3, 2016 at 10:21 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sat, Sep 3, 2016 at 12:42 AM, Magnus Hagander <magnus@hagander.net> wrote:
>> On Fri, Sep 2, 2016 at 8:50 AM, Michael Paquier <michael.paquier@gmail.com>
>> wrote:
>>> On Fri, Sep 2, 2016 at 2:20 AM, Peter Eisentraut
>>> <peter.eisentraut@2ndquadrant.com> wrote:
>>> > On 5/13/16 2:39 AM, Michael Paquier wrote:
>>> What do others think about that? I could implement that on top of 0002
>>> with some extra options. But to be honest that looks to be just some
>>> extra sugar for what is basically a bug fix... And I am feeling that
>>> providing such a switch to users would be a way for one to shoot
>>> himself badly, particularly for pg_receivexlog where a crash can cause
>>> segments to go missing.
>>>
>>
>> Well, why do we provide a --nosync option for initdb? Wouldn't the argument
>> basically be the same?
>
> Yes, the good-for-testing-but-not-production argument.
>
>> I agree it kind of feels like overkill, but it would be consistent overkill?
>> :)
>
> Oh, well. I have just implemented it on top of the two other patches
> for pg_basebackup. For pg_receivexlog, I am wondering if it makes
> sense to have it. That would be trivial to implement it, and I think
> that we had better make the combination of --synchronous and --nosync
> just leave with an error. Thoughts about having that for
> pg_receivexlog?

With patches that's actually better..
-- 
Michael



On Sat, Sep 3, 2016 at 10:22 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sat, Sep 3, 2016 at 10:21 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Oh, well. I have just implemented it on top of the two other patches
>> for pg_basebackup. For pg_receivexlog, I am wondering if it makes
>> sense to have it. That would be trivial to implement it, and I think
>> that we had better make the combination of --synchronous and --nosync
>> just leave with an error. Thoughts about having that for
>> pg_receivexlog?
>
> With patches that's actually better..

Meh, meh et meh.
--
Michael

Attachment
On Sat, Sep 3, 2016 at 3:21 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Sat, Sep 3, 2016 at 12:42 AM, Magnus Hagander <magnus@hagander.net> wrote:
> On Fri, Sep 2, 2016 at 8:50 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>> On Fri, Sep 2, 2016 at 2:20 AM, Peter Eisentraut
>> <peter.eisentraut@2ndquadrant.com> wrote:
>> > On 5/13/16 2:39 AM, Michael Paquier wrote:
>> What do others think about that? I could implement that on top of 0002
>> with some extra options. But to be honest that looks to be just some
>> extra sugar for what is basically a bug fix... And I am feeling that
>> providing such a switch to users would be a way for one to shoot
>> himself badly, particularly for pg_receivexlog where a crash can cause
>> segments to go missing.
>>
>
> Well, why do we provide a --nosync option for initdb? Wouldn't the argument
> basically be the same?

Yes, the good-for-testing-but-not-production argument.

> I agree it kind of feels like overkill, but it would be consistent overkill?
> :)

Oh, well. I have just implemented it on top of the two other patches
for pg_basebackup. For pg_receivexlog, I am wondering if it makes
sense to have it. That would be trivial to implement it, and I think
that we had better make the combination of --synchronous and --nosync
just leave with an error. Thoughts about having that for
pg_receivexlog?

Yes, we should definitely not allow that combination. In fact, maybe that argument in itself is enough not to have it for pg_receivexlog -- the cause of confusion.

I can see how you might want to avoid it for pg_basebackup during testing for example,. but the overhead on pg_receivexlog shouldn't be as bad in testing, should it? 


--
On Sat, Sep 3, 2016 at 10:26 PM, Magnus Hagander <magnus@hagander.net> wrote:
> Yes, we should definitely not allow that combination. In fact, maybe that
> argument in itself is enough not to have it for pg_receivexlog -- the cause
> of confusion.
>
> I can see how you might want to avoid it for pg_basebackup during testing
> for example,. but the overhead on pg_receivexlog shouldn't be as bad in
> testing, should it?

No, I haven't tested directly but it should not be.. pg_basebackup
does always a bulk write, while pg_receivexlog depends on the server
activity.
-- 
Michael



On 9/3/16 9:23 AM, Michael Paquier wrote:
> On Sat, Sep 3, 2016 at 10:22 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Sat, Sep 3, 2016 at 10:21 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> Oh, well. I have just implemented it on top of the two other patches
>>> for pg_basebackup. For pg_receivexlog, I am wondering if it makes
>>> sense to have it. That would be trivial to implement it, and I think
>>> that we had better make the combination of --synchronous and --nosync
>>> just leave with an error. Thoughts about having that for
>>> pg_receivexlog?
>>
>> With patches that's actually better..
> 
> Meh, meh et meh.

In fsync_pgdata(), you have removed the progname from one error
message, even though it is passed into the function.  Also, I think
fsync_pgdata() should not be printing initdb's progress messages.
That should stay in initdb.  Worse, in 0002 you are then changing the
output, presumably to suit pg_basebackup or something else, which
messes up the initdb output.

durable_rename() does not fsync an existing newfile before renaming,
in contrast to the backend code in fd.c.

I'm slightly concerned about lots of code duplication in
durable_rename, fsync_parent_path between backend and standalone code.
Also, I'm equally slightly concerned that having duplicate function
names will mess up tag search for someone.

This is a preexisting mistake, but fsync_fname_ext() should just be
fsync_fname() to match the backend naming.  In the backend,
fsync_fname_ext() "extends" fsync_fname() by accepting an ignore_perm
argument, but the initdb code doesn't do that.

I don't think tar file output in pg_basebackup needs to be fsynced.
It's just a backup file like what pg_dump produces, and we don't fsync
that either.  The point of this change is to leave a data directory in
a synced state equivalent to what initdb leaves behind.

Some of the changes in receivelog.c seem excessive.  Replacing a plain
fsync() with fsync_fname_ext() plus fsync_parent_path() isn't
justified by the references you have provided.  This would need more
explanation.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



<p dir="ltr"><p dir="ltr">On 3 Sep. 2016 9:22 pm, "Michael Paquier" <<a
href="mailto:michael.paquier@gmail.com">michael.paquier@gmail.com</a>>wrote:<br /> ><br /> > On Sat, Sep 3,
2016at 12:42 AM, Magnus Hagander <<a href="mailto:magnus@hagander.net">magnus@hagander.net</a>> wrote:<br /> >
>On Fri, Sep 2, 2016 at 8:50 AM, Michael Paquier <<a
href="mailto:michael.paquier@gmail.com">michael.paquier@gmail.com</a>><br/> > > wrote:<br /> > >> On
Fri,Sep 2, 2016 at 2:20 AM, Peter Eisentraut<br /> > >> <<a
href="mailto:peter.eisentraut@2ndquadrant.com">peter.eisentraut@2ndquadrant.com</a>>wrote:<br /> > >> >
On5/13/16 2:39 AM, Michael Paquier wrote:<br /> > >> What do others think about that? I could implement that
ontop of 0002<br /> > >> with some extra options. But to be honest that looks to be just some<br /> >
>>extra sugar for what is basically a bug fix... And I am feeling that<br /> > >> providing such a
switchto users would be a way for one to shoot<br /> > >> himself badly, particularly for pg_receivexlog where
acrash can cause<br /> > >> segments to go missing.<br /> > >><br /> > ><br /> > > Well,
whydo we provide a --nosync option for initdb? Wouldn't the argument<br /> > > basically be the same?<br />
><br/> > Yes, the good-for-testing-but-not-production argument.<p dir="ltr">We need it for tap tests. More and
morewill use pg_basebackup and it'll start hurting test speeds badly.<br /> 
On Sat, Sep 10, 2016 at 7:36 PM, Craig Ringer
<craig.ringer@2ndquadrant.com> wrote:
> We need it for tap tests. More and more will use pg_basebackup and it'll
> start hurting test speeds badly.

Ah yes, that's a good argument. hamster would suffer pretty badly
after that if nothing is done. I'll get an extra patch out for that,
with --no-sync and not --nosync by the way.
-- 
Michael



On 2016-09-13 10:35:38 +0900, Michael Paquier wrote:
> On Sat, Sep 10, 2016 at 7:36 PM, Craig Ringer
> <craig.ringer@2ndquadrant.com> wrote:
> > We need it for tap tests. More and more will use pg_basebackup and it'll
> > start hurting test speeds badly.
> 
> Ah yes, that's a good argument. hamster would suffer pretty badly
> after that if nothing is done. I'll get an extra patch out for that,
> with --no-sync and not --nosync by the way.

FWIW, it might be better to instead use eatmydata in the cron
invocations on such slow machines, that way we also test the fsync paths
in them.



On Tue, Sep 13, 2016 at 10:37 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-09-13 10:35:38 +0900, Michael Paquier wrote:
>> On Sat, Sep 10, 2016 at 7:36 PM, Craig Ringer
>> <craig.ringer@2ndquadrant.com> wrote:
>> > We need it for tap tests. More and more will use pg_basebackup and it'll
>> > start hurting test speeds badly.
>>
>> Ah yes, that's a good argument. hamster would suffer pretty badly
>> after that if nothing is done. I'll get an extra patch out for that,
>> with --no-sync and not --nosync by the way.
>
> FWIW, it might be better to instead use eatmydata in the cron
> invocations on such slow machines, that way we also test the fsync paths
> in them.

Er, well.. libeatmydata is only available in AUR for Archlinux x86_64,
and not in sight for Arch ARM...
-- 
Michael



On Sat, Sep 10, 2016 at 6:27 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> In fsync_pgdata(), you have removed the progname from one error
> message, even though it is passed into the function.

Right. Good catch.

> Also, I think
> fsync_pgdata() should not be printing initdb's progress messages.
> That should stay in initdb.

That's a reference to that.
+   fputs(_("syncing data to disk ... "), stdout);
+   fflush(stdout);
Oops, fixed. I missed that now that I look at it. Perhaps I was
thinking something else when hacking that lastly, like getting this
message out for pg_basebackup as well.

> Worse, in 0002 you are then changing the
> output, presumably to suit pg_basebackup or something else, which
> messes up the initdb output.

Yes, fixed.

> durable_rename() does not fsync an existing newfile before renaming,
> in contrast to the backend code in fd.c.

That's in 0002. In the case of initdb it did not really matter, but
that matters for pg_receivexlog, so let's do it unconditionally. I
reworked the code to check if the existing new file is here, and
fsync() if it exists.

> I'm slightly concerned about lots of code duplication in
> durable_rename, fsync_parent_path between backend and standalone code.

Based on the ground of code readability, I am not sure that it would
be worth sharing the code of durable_rename or even fsync_parent_path
between the backend and the frontend, particularly because they are
actually not really duplicated... One good step in favor of that would
be to get a elog()/ereport() layer in src/common as well, but that's
really something that this set of patches should tackle, no?

> Also, I'm equally slightly concerned that having duplicate function
> names will mess up tag search for someone.

There are similar cases, take palloc().. Now perhaps some of those
functions could be renamed pg_fsync_... Thoughts?

> This is a preexisting mistake, but fsync_fname_ext() should just be
> fsync_fname() to match the backend naming.  In the backend,
> fsync_fname_ext() "extends" fsync_fname() by accepting an ignore_perm
> argument, but the initdb code doesn't do that.

OK.

> I don't think tar file output in pg_basebackup needs to be fsynced.
> It's just a backup file like what pg_dump produces, and we don't fsync
> that either.  The point of this change is to leave a data directory in
> a synced state equivalent to what initdb leaves behind.

Here I don't agree. The point of the patch is to make sure that what
gets generated by pg_basebackup is consistent on disk, for all the
formats. It seems weird to me that we could say that the plain format
makes things consistent while the tar format may cause some files to
be lost in case of power outage. The docs make it clear I think:
+        By default, <command>pg_basebackup</command> will wait for all files
+        to be written safely to disk.
But perhaps this should directly mention that this is done for all the formats?

> Some of the changes in receivelog.c seem excessive.  Replacing a plain
> fsync() with fsync_fname_ext() plus fsync_parent_path() isn't
> justified by the references you have provided.  This would need more
> explanation.

In mark_file_as_archived(), we had better do it or in case of a crash
the .done file would just be gone. open_walfile() as well need that,
to ensure that the segment is created and zeroed, and in the case
where it was padded (is this one overthinking it?).

Patch 0003 had conflicts caused by 9083353.
--
Michael

Attachment
On 9/12/16 11:16 PM, Michael Paquier wrote:
>> I don't think tar file output in pg_basebackup needs to be fsynced.
>> > It's just a backup file like what pg_dump produces, and we don't fsync
>> > that either.  The point of this change is to leave a data directory in
>> > a synced state equivalent to what initdb leaves behind.
> Here I don't agree. The point of the patch is to make sure that what
> gets generated by pg_basebackup is consistent on disk, for all the
> formats. It seems weird to me that we could say that the plain format
> makes things consistent while the tar format may cause some files to
> be lost in case of power outage. The docs make it clear I think:
> +        By default, <command>pg_basebackup</command> will wait for all files
> +        to be written safely to disk.
> But perhaps this should directly mention that this is done for all the formats?

That doesn't really explain why we fsync.  If we think that all
"important" files should be fsynced, why aren't we doing it in pg_dump?
Or psql, or server-side copy?  Similarly, there is no straightforward
mechanism by which you can unpack the tar file generated by
pg_basebackup and get the unpacked directory fsynced properly.  (I
suppose initdb -S should be recommended.)

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Thu, Sep 15, 2016 at 9:44 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 9/12/16 11:16 PM, Michael Paquier wrote:
>>> I don't think tar file output in pg_basebackup needs to be fsynced.
>>> > It's just a backup file like what pg_dump produces, and we don't fsync
>>> > that either.  The point of this change is to leave a data directory in
>>> > a synced state equivalent to what initdb leaves behind.
>> Here I don't agree. The point of the patch is to make sure that what
>> gets generated by pg_basebackup is consistent on disk, for all the
>> formats. It seems weird to me that we could say that the plain format
>> makes things consistent while the tar format may cause some files to
>> be lost in case of power outage. The docs make it clear I think:
>> +        By default, <command>pg_basebackup</command> will wait for all files
>> +        to be written safely to disk.
>> But perhaps this should directly mention that this is done for all the formats?
>
> That doesn't really explain why we fsync.

Data durability, particularly on ext4 as it has been discussed a
couple of months back [1]. In case if a crash it could be perfectly
possible to lose files, hence we need to be sure that the files
themselves are flushed, as well as their parent directory to prevent
any problems. I think that we should protect users' backups as much as
we can, in the range we can.

> If we think that all
> "important" files should be fsynced, why aren't we doing it in pg_dump?

pg_dump should do it where it can, see thread [2]. I am tackling
problems once at a time, and that's as well a reason why I would like
us to have a common set of routines in src/common or src/fe_utils to
help improve this handling.

> Or psql, or server-side copy?  Similarly, there is no straightforward
> mechanism by which you can unpack the tar file generated by
> pg_basebackup and get the unpacked directory fsynced properly.  (I
> suppose initdb -S should be recommended.)

Yes, those are cases that you cannot cover. Imagine for example
pg_basebackup's tar or pg_dump data sent to stdout. There is nothing
we can actually do for all those cases. However what we can do it
giving a set of options making possible to get consistent backups for
users.

[1] Silent data loss on ext4:
https://www.postgresql.org/message-id/56583BDD.9060302@2ndquadrant.com

[2] Data durability:
https://www.postgresql.org/message-id/20160327233033.GD20662@awork2.anarazel.de
-- 
Michael



This is looking pretty good.  More comments on this patch set:

0001:

Keep the file order alphabetical in Mkvcbuild.pm.

Needs nls.mk updates for file move (in initdb and pg_basebackup
directories).

0002:

durable_rename needs close(fd) in non-error path (compare backend code).

Changing from fsync() to fsync_name() in some cases means that
inaccessible files are now ignored.  I guess this would only happen in
very obscure circumstances, but it's worth considering if that is OK.

You added this comment:
    * XXX: This means that we might not restart if a crash occurs
before the    * fsync below. We probably should create the file in a temporary path    * like the backend does...

pg_receivexlog uses the .partial suffix for this reason.  Why doesn't
pg_basebackup do that?

In open_walfile, could we move the fsync calls before the fstat or
somewhere around there so we don't have to make the same calls in two
different branches?

0003:

There was a discussion about renaming the --nosync option in initdb to
--no-sync, but until that is done, the option in pg_basebackup should
stay what initdb has right now.

There is a whitespace alignment error in usage().

The -N option should be listed after -n.

Some fsync calls are not covered by a do_sync conditional.  I see some
in close_walfile(), HandleCopyStream(), ProcessKeepaliveMsg().

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Fri, Sep 23, 2016 at 10:54 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> This is looking pretty good.  More comments on this patch set:

Thanks for the review.

> 0001:
>
> Keep the file order alphabetical in Mkvcbuild.pm.
>
> Needs nls.mk updates for file move (in initdb and pg_basebackup
> directories).

Fixed.

> 0002:
>
> durable_rename needs close(fd) in non-error path (compare backend code).

Oops, leak.

> Changing from fsync() to fsync_name() in some cases means that
> inaccessible files are now ignored.  I guess this would only happen in
> very obscure circumstances, but it's worth considering if that is OK.

In writeTimeLineHistoryFile() that's fine, the user should have
ownership of it as it has been created by receivelog.c. Similar remark
for .
In mark_file_as_archived, the previous open() call would have just failed.

> You added this comment:
>      * XXX: This means that we might not restart if a crash occurs
> before the
>      * fsync below. We probably should create the file in a temporary path
>      * like the backend does...
> pg_receivexlog uses the .partial suffix for this reason.  Why doesn't
> pg_basebackup do that?

Because it matters for pg_receivexlog as it has a retry logic and it
is able to rework on a partial segment. Thinking more about that this
comment does not make much sense, because for pg_basebackup a crash
would just cause the backup to be incomplete, so I have removed it.

> In open_walfile, could we move the fsync calls before the fstat or
> somewhere around there so we don't have to make the same calls in two
> different branches?

We could refactor a bit the code so as follows:
if (size == 16MB)
{
    walfile = f;
}
else if (size == 0)
{
    //do padding and lseek
}
else
    error();
fsync();
I am not sure if this gains in clarity though, perhaps I looked too
much the current code.

> 0003:
>
> There was a discussion about renaming the --nosync option in initdb to
> --no-sync, but until that is done, the option in pg_basebackup should
> stay what initdb has right now.

Changed.

> There is a whitespace alignment error in usage().

Not sure I follow here.

> The -N option should be listed after -n.

In the switch()? Fixed.

> Some fsync calls are not covered by a do_sync conditional.  I see some
> in close_walfile(), HandleCopyStream(), ProcessKeepaliveMsg().

Right. I looked at the rest and did not notice any extra mistakes.
--
Michael

Attachment
On 9/23/16 11:15 AM, Michael Paquier wrote:
>> 0002:
>> >
>> > durable_rename needs close(fd) in non-error path (compare backend code).
> Oops, leak.

Why did you remove the errno save and restore?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Tue, Sep 27, 2016 at 11:16 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 9/23/16 11:15 AM, Michael Paquier wrote:
>>> 0002:
>>> >
>>> > durable_rename needs close(fd) in non-error path (compare backend code).
>> Oops, leak.
>
> Why did you remove the errno save and restore?

That's this bit in durable_rename, patch 0002:
+        if (fsync(fd) != 0)
+        {
+            fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"),
+                    progname, newfile, strerror(errno));
+            close(fd);
+            return -1;
+        }
+        close(fd);
I thought that as long as the error string is shown to the user, it
does not matter much if errno is still saved or not. All the callers
of durable_rename() on frontends don't check for strerrno(errno)
afterwards. Do you think it matters? Changing that back is trivial.
Sorry for not mentioning directly in the thread that I modified that
when dropping the last patch set.
-- 
Michael



On 9/26/16 10:34 PM, Michael Paquier wrote:
> I thought that as long as the error string is shown to the user, it
> does not matter much if errno is still saved or not. All the callers
> of durable_rename() on frontends don't check for strerrno(errno)
> afterwards. Do you think it matters? Changing that back is trivial.
> Sorry for not mentioning directly in the thread that I modified that
> when dropping the last patch set.

Actually, I think the equivalent backend code only does this to save the
errno across the close call because the elog call comes after the close.So it's OK to not do that in the frontend.

With that in mind, I have committed the v3 series now.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Thu, Sep 29, 2016 at 8:33 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 9/26/16 10:34 PM, Michael Paquier wrote:
> I thought that as long as the error string is shown to the user, it
> does not matter much if errno is still saved or not. All the callers
> of durable_rename() on frontends don't check for strerrno(errno)
> afterwards. Do you think it matters? Changing that back is trivial.
> Sorry for not mentioning directly in the thread that I modified that
> when dropping the last patch set.

Actually, I think the equivalent backend code only does this to save the
errno across the close call because the elog call comes after the close.
 So it's OK to not do that in the frontend.

With that in mind, I have committed the v3 series now.

I'm getting compiler warnings:

file_utils.c: In function 'fsync_pgdata':
file_utils.c:86: warning: passing argument 2 of 'walkdir' from incompatible pointer type
file_utils.c:36: note: expected 'int (*)(const char *, bool,  const char *)' but argument is of type 'void (*)(const char *, bool,  const char *)'
file_utils.c:88: warning: passing argument 2 of 'walkdir' from incompatible pointer type
file_utils.c:36: note: expected 'int (*)(const char *, bool,  const char *)' but argument is of type 'void (*)(const char *, bool,  const char *)'
file_utils.c:89: warning: passing argument 2 of 'walkdir' from incompatible pointer type
file_utils.c:36: note: expected 'int (*)(const char *, bool,  const char *)' but argument is of type 'void (*)(const char *, bool,  const char *)'

Cheers,

Jeff

Re: pg_basebackup, pg_receivexlog and data durability

From
Peter Eisentraut
Date:
On 9/29/16 12:31 PM, Jeff Janes wrote:
>     With that in mind, I have committed the v3 series now.
> 
> 
> I'm getting compiler warnings:

Fixed.

> 
> file_utils.c: In function 'fsync_pgdata':
> file_utils.c:86: warning: passing argument 2 of 'walkdir' from
> incompatible pointer type
> file_utils.c:36: note: expected 'int (*)(const char *, bool,  const char
> *)' but argument is of type 'void (*)(const char *, bool,  const char *)'
> file_utils.c:88: warning: passing argument 2 of 'walkdir' from
> incompatible pointer type
> file_utils.c:36: note: expected 'int (*)(const char *, bool,  const char
> *)' but argument is of type 'void (*)(const char *, bool,  const char *)'
> file_utils.c:89: warning: passing argument 2 of 'walkdir' from
> incompatible pointer type
> file_utils.c:36: note: expected 'int (*)(const char *, bool,  const char
> *)' but argument is of type 'void (*)(const char *, bool,  const char *)'

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Fri, Sep 30, 2016 at 1:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Sep 29, 2016 at 8:33 AM, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
>>
>> On 9/26/16 10:34 PM, Michael Paquier wrote:
>> > I thought that as long as the error string is shown to the user, it
>> > does not matter much if errno is still saved or not. All the callers
>> > of durable_rename() on frontends don't check for strerrno(errno)
>> > afterwards. Do you think it matters? Changing that back is trivial.
>> > Sorry for not mentioning directly in the thread that I modified that
>> > when dropping the last patch set.
>>
>> Actually, I think the equivalent backend code only does this to save the
>> errno across the close call because the elog call comes after the close.
>>  So it's OK to not do that in the frontend.
>>
>> With that in mind, I have committed the v3 series now.

Thanks!

> I'm getting compiler warnings:
>
> file_utils.c: In function 'fsync_pgdata':
> file_utils.c:86: warning: passing argument 2 of 'walkdir' from incompatible
> pointer type
> file_utils.c:36: note: expected 'int (*)(const char *, bool,  const char *)'
> but argument is of type 'void (*)(const char *, bool,  const char *)'
> file_utils.c:88: warning: passing argument 2 of 'walkdir' from incompatible
> pointer type
> file_utils.c:36: note: expected 'int (*)(const char *, bool,  const char *)'
> but argument is of type 'void (*)(const char *, bool,  const char *)'
> file_utils.c:89: warning: passing argument 2 of 'walkdir' from incompatible
> pointer type
> file_utils.c:36: note: expected 'int (*)(const char *, bool,  const char *)'
> but argument is of type 'void (*)(const char *, bool,  const char *)'

Oops. Are you using gcc to see that? I compiled with clang and on
Windows without noticing it.
-- 
Michael



Michael Paquier <michael.paquier@gmail.com> writes:
> Oops. Are you using gcc to see that? I compiled with clang and on
> Windows without noticing it.

Peter already noted that you'd only see it on platforms that define
PG_FLUSH_DATA_WORKS.
        regards, tom lane