Thread: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Michael Paquier
Date:
Hi all, Beginning a new thread because the ext4 issues are closed, and because pg_basebackup data durability meritates a new thread. And in short about the problem: pg_basebackup makes no effort in being sure that the data it backs up is on disk, which is bad... One possible recommendation is to use initdb -S after running pg_basebackup, but making sure that data is on disk should be done before pg_basebackup ends. On Thu, May 12, 2016 at 8:09 PM, I wrote: > And actually this won't fly high if there is no equivalent of > walkdir() or if the fsync()'s are not applied recursively. On master > at least the refactoring had better be done cleanly first... For the > back branches, we could just have some recursive call like > fsync_recursively and keep that in src/bin/pg_basebackup. Andres, do > you think that this should be part of fe_utils or src/common/? I'd > tend to think the latter is more adapted as there is an equivalent in > the backend. On back-branches, we could just have something like > fsync_recursively that walks though the paths. An even more simple > approach would be to fsync() individually things that have been > written, but that would suck in performance. So, attached are two patches that apply on HEAD to address the problem of pg_basebackup that does not sync the data it writes. As pg_basebackup cannot use directly initdb -S because, as a client-side utility, it may be installed while initdb is not (see Fedora and RHEL), I have refactored the code so as the routines in initdb.c doing the fsync of PGDATA and other fsync stuff are in src/fe_utils/, and this is 0001. Patch 0002 is a set of fixes for pg_basebackup: - In plain mode, fsync_pgdata is used so as all the tablespaces are fsync'd at once. This takes care as well of the case where pg_xlog is a symlink. - In tar mode (no stdout), each tar file is synced individually, and the base directory is synced once at the end. In both cases, failures are not considered fatal. With pg_basebackup -X and pg_receivexlog, the manipulation of WAL files is made durable by using fsync and durable_rename where needed (credits to Andres mainly for this part). This set of patches is aimed only at HEAD. Back-patchable versions of this patch would need to copy fsync_pgdata and friends into streamutil.c for example. I am adding that to the next CF for review as a bug fix. Regards, -- Michael
Attachment
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Alex Ignatov
Date:
On 13.05.2016 9:39, Michael Paquier wrote: > Hi all, > > Beginning a new thread because the ext4 issues are closed, and because > pg_basebackup data durability meritates a new thread. And in short > about the problem: pg_basebackup makes no effort in being sure that > the data it backs up is on disk, which is bad... One possible > recommendation is to use initdb -S after running pg_basebackup, but > making sure that data is on disk should be done before pg_basebackup > ends. > > On Thu, May 12, 2016 at 8:09 PM, I wrote: >> And actually this won't fly high if there is no equivalent of >> walkdir() or if the fsync()'s are not applied recursively. On master >> at least the refactoring had better be done cleanly first... For the >> back branches, we could just have some recursive call like >> fsync_recursively and keep that in src/bin/pg_basebackup. Andres, do >> you think that this should be part of fe_utils or src/common/? I'd >> tend to think the latter is more adapted as there is an equivalent in >> the backend. On back-branches, we could just have something like >> fsync_recursively that walks though the paths. An even more simple >> approach would be to fsync() individually things that have been >> written, but that would suck in performance. > > So, attached are two patches that apply on HEAD to address the problem > of pg_basebackup that does not sync the data it writes. As > pg_basebackup cannot use directly initdb -S because, as a client-side > utility, it may be installed while initdb is not (see Fedora and > RHEL), I have refactored the code so as the routines in initdb.c doing > the fsync of PGDATA and other fsync stuff are in src/fe_utils/, and > this is 0001. > > Patch 0002 is a set of fixes for pg_basebackup: > - In plain mode, fsync_pgdata is used so as all the tablespaces are > fsync'd at once. This takes care as well of the case where pg_xlog is > a symlink. > - In tar mode (no stdout), each tar file is synced individually, and > the base directory is synced once at the end. > In both cases, failures are not considered fatal. > > With pg_basebackup -X and pg_receivexlog, the manipulation of WAL > files is made durable by using fsync and durable_rename where needed > (credits to Andres mainly for this part). > > This set of patches is aimed only at HEAD. Back-patchable versions of > this patch would need to copy fsync_pgdata and friends into > streamutil.c for example. > > I am adding that to the next CF for review as a bug fix. > Regards, > > > > Hi! Do we have any confidence that data file is not being corrupted? I.e contains some corrupted page? Can pg_basebackup check page checksum (db init with initdb -k) while backing up files? Alex Ignatov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Michael Paquier
Date:
On Fri, May 13, 2016 at 11:49 PM, Alex Ignatov <a.ignatov@postgrespro.ru> wrote: > > On 13.05.2016 9:39, Michael Paquier wrote: > Do we have any confidence that data file is not being corrupted? I.e > contains some corrupted page? Can pg_basebackup check page checksum (db init > with initdb -k) while backing up files? That may be an idea to consider, for one class of users, though this patch is not aimed at doing something regarding that. -- Michael
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Peter Eisentraut
Date:
On 5/13/16 2:39 AM, Michael Paquier wrote: > So, attached are two patches that apply on HEAD to address the problem > of pg_basebackup that does not sync the data it writes. As > pg_basebackup cannot use directly initdb -S because, as a client-side > utility, it may be installed while initdb is not (see Fedora and > RHEL), I have refactored the code so as the routines in initdb.c doing > the fsync of PGDATA and other fsync stuff are in src/fe_utils/, and > this is 0001. Why fe_utils? initdb is not a front-end program. > Patch 0002 is a set of fixes for pg_basebackup: > - In plain mode, fsync_pgdata is used so as all the tablespaces are > fsync'd at once. This takes care as well of the case where pg_xlog is > a symlink. > - In tar mode (no stdout), each tar file is synced individually, and > the base directory is synced once at the end. > In both cases, failures are not considered fatal. Maybe there should be --nosync options like initdb has? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Michael Paquier
Date:
On Fri, Sep 2, 2016 at 2:20 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 5/13/16 2:39 AM, Michael Paquier wrote: >> So, attached are two patches that apply on HEAD to address the problem >> of pg_basebackup that does not sync the data it writes. As >> pg_basebackup cannot use directly initdb -S because, as a client-side >> utility, it may be installed while initdb is not (see Fedora and >> RHEL), I have refactored the code so as the routines in initdb.c doing >> the fsync of PGDATA and other fsync stuff are in src/fe_utils/, and >> this is 0001. > > Why fe_utils? initdb is not a front-end program. Thinking about that, you are right. Let's move it to src/common, frontend-only though. >> Patch 0002 is a set of fixes for pg_basebackup: >> - In plain mode, fsync_pgdata is used so as all the tablespaces are >> fsync'd at once. This takes care as well of the case where pg_xlog is >> a symlink. >> - In tar mode (no stdout), each tar file is synced individually, and >> the base directory is synced once at the end. >> In both cases, failures are not considered fatal. > > Maybe there should be --nosync options like initdb has? What do others think about that? I could implement that on top of 0002 with some extra options. But to be honest that looks to be just some extra sugar for what is basically a bug fix... And I am feeling that providing such a switch to users would be a way for one to shoot himself badly, particularly for pg_receivexlog where a crash can cause segments to go missing. -- Michael
Attachment
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Magnus Hagander
Date:
On Fri, Sep 2, 2016 at 8:50 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
-- On Fri, Sep 2, 2016 at 2:20 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 5/13/16 2:39 AM, Michael Paquier wrote:
>> So, attached are two patches that apply on HEAD to address the problem
>> of pg_basebackup that does not sync the data it writes. As
>> pg_basebackup cannot use directly initdb -S because, as a client-side
>> utility, it may be installed while initdb is not (see Fedora and
>> RHEL), I have refactored the code so as the routines in initdb.c doing
>> the fsync of PGDATA and other fsync stuff are in src/fe_utils/, and
>> this is 0001.
>
> Why fe_utils? initdb is not a front-end program.
Thinking about that, you are right. Let's move it to src/common,
frontend-only though.
>> Patch 0002 is a set of fixes for pg_basebackup:
>> - In plain mode, fsync_pgdata is used so as all the tablespaces are
>> fsync'd at once. This takes care as well of the case where pg_xlog is
>> a symlink.
>> - In tar mode (no stdout), each tar file is synced individually, and
>> the base directory is synced once at the end.
>> In both cases, failures are not considered fatal.
>
> Maybe there should be --nosync options like initdb has?
What do others think about that? I could implement that on top of 0002
with some extra options. But to be honest that looks to be just some
extra sugar for what is basically a bug fix... And I am feeling that
providing such a switch to users would be a way for one to shoot
himself badly, particularly for pg_receivexlog where a crash can cause
segments to go missing.
Well, why do we provide a --nosync option for initdb? Wouldn't the argument basically be the same?
I agree it kind of feels like overkill, but it would be consistent overkill? :)
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Michael Paquier
Date:
On Sat, Sep 3, 2016 at 12:42 AM, Magnus Hagander <magnus@hagander.net> wrote: > On Fri, Sep 2, 2016 at 8:50 AM, Michael Paquier <michael.paquier@gmail.com> > wrote: >> On Fri, Sep 2, 2016 at 2:20 AM, Peter Eisentraut >> <peter.eisentraut@2ndquadrant.com> wrote: >> > On 5/13/16 2:39 AM, Michael Paquier wrote: >> What do others think about that? I could implement that on top of 0002 >> with some extra options. But to be honest that looks to be just some >> extra sugar for what is basically a bug fix... And I am feeling that >> providing such a switch to users would be a way for one to shoot >> himself badly, particularly for pg_receivexlog where a crash can cause >> segments to go missing. >> > > Well, why do we provide a --nosync option for initdb? Wouldn't the argument > basically be the same? Yes, the good-for-testing-but-not-production argument. > I agree it kind of feels like overkill, but it would be consistent overkill? > :) Oh, well. I have just implemented it on top of the two other patches for pg_basebackup. For pg_receivexlog, I am wondering if it makes sense to have it. That would be trivial to implement it, and I think that we had better make the combination of --synchronous and --nosync just leave with an error. Thoughts about having that for pg_receivexlog? -- Michael
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Michael Paquier
Date:
On Sat, Sep 3, 2016 at 10:21 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sat, Sep 3, 2016 at 12:42 AM, Magnus Hagander <magnus@hagander.net> wrote: >> On Fri, Sep 2, 2016 at 8:50 AM, Michael Paquier <michael.paquier@gmail.com> >> wrote: >>> On Fri, Sep 2, 2016 at 2:20 AM, Peter Eisentraut >>> <peter.eisentraut@2ndquadrant.com> wrote: >>> > On 5/13/16 2:39 AM, Michael Paquier wrote: >>> What do others think about that? I could implement that on top of 0002 >>> with some extra options. But to be honest that looks to be just some >>> extra sugar for what is basically a bug fix... And I am feeling that >>> providing such a switch to users would be a way for one to shoot >>> himself badly, particularly for pg_receivexlog where a crash can cause >>> segments to go missing. >>> >> >> Well, why do we provide a --nosync option for initdb? Wouldn't the argument >> basically be the same? > > Yes, the good-for-testing-but-not-production argument. > >> I agree it kind of feels like overkill, but it would be consistent overkill? >> :) > > Oh, well. I have just implemented it on top of the two other patches > for pg_basebackup. For pg_receivexlog, I am wondering if it makes > sense to have it. That would be trivial to implement it, and I think > that we had better make the combination of --synchronous and --nosync > just leave with an error. Thoughts about having that for > pg_receivexlog? With patches that's actually better.. -- Michael
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Michael Paquier
Date:
On Sat, Sep 3, 2016 at 10:22 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sat, Sep 3, 2016 at 10:21 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> Oh, well. I have just implemented it on top of the two other patches >> for pg_basebackup. For pg_receivexlog, I am wondering if it makes >> sense to have it. That would be trivial to implement it, and I think >> that we had better make the combination of --synchronous and --nosync >> just leave with an error. Thoughts about having that for >> pg_receivexlog? > > With patches that's actually better.. Meh, meh et meh. -- Michael
Attachment
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Magnus Hagander
Date:
On Sat, Sep 3, 2016 at 3:21 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Sat, Sep 3, 2016 at 12:42 AM, Magnus Hagander <magnus@hagander.net> wrote:
> On Fri, Sep 2, 2016 at 8:50 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>> On Fri, Sep 2, 2016 at 2:20 AM, Peter Eisentraut
>> <peter.eisentraut@2ndquadrant.com> wrote:
>> > On 5/13/16 2:39 AM, Michael Paquier wrote:
>> What do others think about that? I could implement that on top of 0002
>> with some extra options. But to be honest that looks to be just some
>> extra sugar for what is basically a bug fix... And I am feeling that
>> providing such a switch to users would be a way for one to shoot
>> himself badly, particularly for pg_receivexlog where a crash can cause
>> segments to go missing.
>>
>
> Well, why do we provide a --nosync option for initdb? Wouldn't the argument
> basically be the same?
Yes, the good-for-testing-but-not-production argument.
> I agree it kind of feels like overkill, but it would be consistent overkill?
> :)
Oh, well. I have just implemented it on top of the two other patches
for pg_basebackup. For pg_receivexlog, I am wondering if it makes
sense to have it. That would be trivial to implement it, and I think
that we had better make the combination of --synchronous and --nosync
just leave with an error. Thoughts about having that for
pg_receivexlog?
Yes, we should definitely not allow that combination. In fact, maybe that argument in itself is enough not to have it for pg_receivexlog -- the cause of confusion.
I can see how you might want to avoid it for pg_basebackup during testing for example,. but the overhead on pg_receivexlog shouldn't be as bad in testing, should it?
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Michael Paquier
Date:
On Sat, Sep 3, 2016 at 10:26 PM, Magnus Hagander <magnus@hagander.net> wrote: > Yes, we should definitely not allow that combination. In fact, maybe that > argument in itself is enough not to have it for pg_receivexlog -- the cause > of confusion. > > I can see how you might want to avoid it for pg_basebackup during testing > for example,. but the overhead on pg_receivexlog shouldn't be as bad in > testing, should it? No, I haven't tested directly but it should not be.. pg_basebackup does always a bulk write, while pg_receivexlog depends on the server activity. -- Michael
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Peter Eisentraut
Date:
On 9/3/16 9:23 AM, Michael Paquier wrote: > On Sat, Sep 3, 2016 at 10:22 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Sat, Sep 3, 2016 at 10:21 PM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >>> Oh, well. I have just implemented it on top of the two other patches >>> for pg_basebackup. For pg_receivexlog, I am wondering if it makes >>> sense to have it. That would be trivial to implement it, and I think >>> that we had better make the combination of --synchronous and --nosync >>> just leave with an error. Thoughts about having that for >>> pg_receivexlog? >> >> With patches that's actually better.. > > Meh, meh et meh. In fsync_pgdata(), you have removed the progname from one error message, even though it is passed into the function. Also, I think fsync_pgdata() should not be printing initdb's progress messages. That should stay in initdb. Worse, in 0002 you are then changing the output, presumably to suit pg_basebackup or something else, which messes up the initdb output. durable_rename() does not fsync an existing newfile before renaming, in contrast to the backend code in fd.c. I'm slightly concerned about lots of code duplication in durable_rename, fsync_parent_path between backend and standalone code. Also, I'm equally slightly concerned that having duplicate function names will mess up tag search for someone. This is a preexisting mistake, but fsync_fname_ext() should just be fsync_fname() to match the backend naming. In the backend, fsync_fname_ext() "extends" fsync_fname() by accepting an ignore_perm argument, but the initdb code doesn't do that. I don't think tar file output in pg_basebackup needs to be fsynced. It's just a backup file like what pg_dump produces, and we don't fsync that either. The point of this change is to leave a data directory in a synced state equivalent to what initdb leaves behind. Some of the changes in receivelog.c seem excessive. Replacing a plain fsync() with fsync_fname_ext() plus fsync_parent_path() isn't justified by the references you have provided. This would need more explanation. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Craig Ringer
Date:
<p dir="ltr"><p dir="ltr">On 3 Sep. 2016 9:22 pm, "Michael Paquier" <<a href="mailto:michael.paquier@gmail.com">michael.paquier@gmail.com</a>>wrote:<br /> ><br /> > On Sat, Sep 3, 2016at 12:42 AM, Magnus Hagander <<a href="mailto:magnus@hagander.net">magnus@hagander.net</a>> wrote:<br /> > >On Fri, Sep 2, 2016 at 8:50 AM, Michael Paquier <<a href="mailto:michael.paquier@gmail.com">michael.paquier@gmail.com</a>><br/> > > wrote:<br /> > >> On Fri,Sep 2, 2016 at 2:20 AM, Peter Eisentraut<br /> > >> <<a href="mailto:peter.eisentraut@2ndquadrant.com">peter.eisentraut@2ndquadrant.com</a>>wrote:<br /> > >> > On5/13/16 2:39 AM, Michael Paquier wrote:<br /> > >> What do others think about that? I could implement that ontop of 0002<br /> > >> with some extra options. But to be honest that looks to be just some<br /> > >>extra sugar for what is basically a bug fix... And I am feeling that<br /> > >> providing such a switchto users would be a way for one to shoot<br /> > >> himself badly, particularly for pg_receivexlog where acrash can cause<br /> > >> segments to go missing.<br /> > >><br /> > ><br /> > > Well, whydo we provide a --nosync option for initdb? Wouldn't the argument<br /> > > basically be the same?<br /> ><br/> > Yes, the good-for-testing-but-not-production argument.<p dir="ltr">We need it for tap tests. More and morewill use pg_basebackup and it'll start hurting test speeds badly.<br />
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Michael Paquier
Date:
On Sat, Sep 10, 2016 at 7:36 PM, Craig Ringer <craig.ringer@2ndquadrant.com> wrote: > We need it for tap tests. More and more will use pg_basebackup and it'll > start hurting test speeds badly. Ah yes, that's a good argument. hamster would suffer pretty badly after that if nothing is done. I'll get an extra patch out for that, with --no-sync and not --nosync by the way. -- Michael
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Andres Freund
Date:
On 2016-09-13 10:35:38 +0900, Michael Paquier wrote: > On Sat, Sep 10, 2016 at 7:36 PM, Craig Ringer > <craig.ringer@2ndquadrant.com> wrote: > > We need it for tap tests. More and more will use pg_basebackup and it'll > > start hurting test speeds badly. > > Ah yes, that's a good argument. hamster would suffer pretty badly > after that if nothing is done. I'll get an extra patch out for that, > with --no-sync and not --nosync by the way. FWIW, it might be better to instead use eatmydata in the cron invocations on such slow machines, that way we also test the fsync paths in them.
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Michael Paquier
Date:
On Tue, Sep 13, 2016 at 10:37 AM, Andres Freund <andres@anarazel.de> wrote: > On 2016-09-13 10:35:38 +0900, Michael Paquier wrote: >> On Sat, Sep 10, 2016 at 7:36 PM, Craig Ringer >> <craig.ringer@2ndquadrant.com> wrote: >> > We need it for tap tests. More and more will use pg_basebackup and it'll >> > start hurting test speeds badly. >> >> Ah yes, that's a good argument. hamster would suffer pretty badly >> after that if nothing is done. I'll get an extra patch out for that, >> with --no-sync and not --nosync by the way. > > FWIW, it might be better to instead use eatmydata in the cron > invocations on such slow machines, that way we also test the fsync paths > in them. Er, well.. libeatmydata is only available in AUR for Archlinux x86_64, and not in sight for Arch ARM... -- Michael
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Michael Paquier
Date:
On Sat, Sep 10, 2016 at 6:27 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > In fsync_pgdata(), you have removed the progname from one error > message, even though it is passed into the function. Right. Good catch. > Also, I think > fsync_pgdata() should not be printing initdb's progress messages. > That should stay in initdb. That's a reference to that. + fputs(_("syncing data to disk ... "), stdout); + fflush(stdout); Oops, fixed. I missed that now that I look at it. Perhaps I was thinking something else when hacking that lastly, like getting this message out for pg_basebackup as well. > Worse, in 0002 you are then changing the > output, presumably to suit pg_basebackup or something else, which > messes up the initdb output. Yes, fixed. > durable_rename() does not fsync an existing newfile before renaming, > in contrast to the backend code in fd.c. That's in 0002. In the case of initdb it did not really matter, but that matters for pg_receivexlog, so let's do it unconditionally. I reworked the code to check if the existing new file is here, and fsync() if it exists. > I'm slightly concerned about lots of code duplication in > durable_rename, fsync_parent_path between backend and standalone code. Based on the ground of code readability, I am not sure that it would be worth sharing the code of durable_rename or even fsync_parent_path between the backend and the frontend, particularly because they are actually not really duplicated... One good step in favor of that would be to get a elog()/ereport() layer in src/common as well, but that's really something that this set of patches should tackle, no? > Also, I'm equally slightly concerned that having duplicate function > names will mess up tag search for someone. There are similar cases, take palloc().. Now perhaps some of those functions could be renamed pg_fsync_... Thoughts? > This is a preexisting mistake, but fsync_fname_ext() should just be > fsync_fname() to match the backend naming. In the backend, > fsync_fname_ext() "extends" fsync_fname() by accepting an ignore_perm > argument, but the initdb code doesn't do that. OK. > I don't think tar file output in pg_basebackup needs to be fsynced. > It's just a backup file like what pg_dump produces, and we don't fsync > that either. The point of this change is to leave a data directory in > a synced state equivalent to what initdb leaves behind. Here I don't agree. The point of the patch is to make sure that what gets generated by pg_basebackup is consistent on disk, for all the formats. It seems weird to me that we could say that the plain format makes things consistent while the tar format may cause some files to be lost in case of power outage. The docs make it clear I think: + By default, <command>pg_basebackup</command> will wait for all files + to be written safely to disk. But perhaps this should directly mention that this is done for all the formats? > Some of the changes in receivelog.c seem excessive. Replacing a plain > fsync() with fsync_fname_ext() plus fsync_parent_path() isn't > justified by the references you have provided. This would need more > explanation. In mark_file_as_archived(), we had better do it or in case of a crash the .done file would just be gone. open_walfile() as well need that, to ensure that the segment is created and zeroed, and in the case where it was padded (is this one overthinking it?). Patch 0003 had conflicts caused by 9083353. -- Michael
Attachment
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Peter Eisentraut
Date:
On 9/12/16 11:16 PM, Michael Paquier wrote: >> I don't think tar file output in pg_basebackup needs to be fsynced. >> > It's just a backup file like what pg_dump produces, and we don't fsync >> > that either. The point of this change is to leave a data directory in >> > a synced state equivalent to what initdb leaves behind. > Here I don't agree. The point of the patch is to make sure that what > gets generated by pg_basebackup is consistent on disk, for all the > formats. It seems weird to me that we could say that the plain format > makes things consistent while the tar format may cause some files to > be lost in case of power outage. The docs make it clear I think: > + By default, <command>pg_basebackup</command> will wait for all files > + to be written safely to disk. > But perhaps this should directly mention that this is done for all the formats? That doesn't really explain why we fsync. If we think that all "important" files should be fsynced, why aren't we doing it in pg_dump? Or psql, or server-side copy? Similarly, there is no straightforward mechanism by which you can unpack the tar file generated by pg_basebackup and get the unpacked directory fsynced properly. (I suppose initdb -S should be recommended.) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Michael Paquier
Date:
On Thu, Sep 15, 2016 at 9:44 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 9/12/16 11:16 PM, Michael Paquier wrote: >>> I don't think tar file output in pg_basebackup needs to be fsynced. >>> > It's just a backup file like what pg_dump produces, and we don't fsync >>> > that either. The point of this change is to leave a data directory in >>> > a synced state equivalent to what initdb leaves behind. >> Here I don't agree. The point of the patch is to make sure that what >> gets generated by pg_basebackup is consistent on disk, for all the >> formats. It seems weird to me that we could say that the plain format >> makes things consistent while the tar format may cause some files to >> be lost in case of power outage. The docs make it clear I think: >> + By default, <command>pg_basebackup</command> will wait for all files >> + to be written safely to disk. >> But perhaps this should directly mention that this is done for all the formats? > > That doesn't really explain why we fsync. Data durability, particularly on ext4 as it has been discussed a couple of months back [1]. In case if a crash it could be perfectly possible to lose files, hence we need to be sure that the files themselves are flushed, as well as their parent directory to prevent any problems. I think that we should protect users' backups as much as we can, in the range we can. > If we think that all > "important" files should be fsynced, why aren't we doing it in pg_dump? pg_dump should do it where it can, see thread [2]. I am tackling problems once at a time, and that's as well a reason why I would like us to have a common set of routines in src/common or src/fe_utils to help improve this handling. > Or psql, or server-side copy? Similarly, there is no straightforward > mechanism by which you can unpack the tar file generated by > pg_basebackup and get the unpacked directory fsynced properly. (I > suppose initdb -S should be recommended.) Yes, those are cases that you cannot cover. Imagine for example pg_basebackup's tar or pg_dump data sent to stdout. There is nothing we can actually do for all those cases. However what we can do it giving a set of options making possible to get consistent backups for users. [1] Silent data loss on ext4: https://www.postgresql.org/message-id/56583BDD.9060302@2ndquadrant.com [2] Data durability: https://www.postgresql.org/message-id/20160327233033.GD20662@awork2.anarazel.de -- Michael
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Peter Eisentraut
Date:
This is looking pretty good. More comments on this patch set: 0001: Keep the file order alphabetical in Mkvcbuild.pm. Needs nls.mk updates for file move (in initdb and pg_basebackup directories). 0002: durable_rename needs close(fd) in non-error path (compare backend code). Changing from fsync() to fsync_name() in some cases means that inaccessible files are now ignored. I guess this would only happen in very obscure circumstances, but it's worth considering if that is OK. You added this comment: * XXX: This means that we might not restart if a crash occurs before the * fsync below. We probably should create the file in a temporary path * like the backend does... pg_receivexlog uses the .partial suffix for this reason. Why doesn't pg_basebackup do that? In open_walfile, could we move the fsync calls before the fstat or somewhere around there so we don't have to make the same calls in two different branches? 0003: There was a discussion about renaming the --nosync option in initdb to --no-sync, but until that is done, the option in pg_basebackup should stay what initdb has right now. There is a whitespace alignment error in usage(). The -N option should be listed after -n. Some fsync calls are not covered by a do_sync conditional. I see some in close_walfile(), HandleCopyStream(), ProcessKeepaliveMsg(). -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Michael Paquier
Date:
On Fri, Sep 23, 2016 at 10:54 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > This is looking pretty good. More comments on this patch set: Thanks for the review. > 0001: > > Keep the file order alphabetical in Mkvcbuild.pm. > > Needs nls.mk updates for file move (in initdb and pg_basebackup > directories). Fixed. > 0002: > > durable_rename needs close(fd) in non-error path (compare backend code). Oops, leak. > Changing from fsync() to fsync_name() in some cases means that > inaccessible files are now ignored. I guess this would only happen in > very obscure circumstances, but it's worth considering if that is OK. In writeTimeLineHistoryFile() that's fine, the user should have ownership of it as it has been created by receivelog.c. Similar remark for . In mark_file_as_archived, the previous open() call would have just failed. > You added this comment: > * XXX: This means that we might not restart if a crash occurs > before the > * fsync below. We probably should create the file in a temporary path > * like the backend does... > pg_receivexlog uses the .partial suffix for this reason. Why doesn't > pg_basebackup do that? Because it matters for pg_receivexlog as it has a retry logic and it is able to rework on a partial segment. Thinking more about that this comment does not make much sense, because for pg_basebackup a crash would just cause the backup to be incomplete, so I have removed it. > In open_walfile, could we move the fsync calls before the fstat or > somewhere around there so we don't have to make the same calls in two > different branches? We could refactor a bit the code so as follows: if (size == 16MB) { walfile = f; } else if (size == 0) { //do padding and lseek } else error(); fsync(); I am not sure if this gains in clarity though, perhaps I looked too much the current code. > 0003: > > There was a discussion about renaming the --nosync option in initdb to > --no-sync, but until that is done, the option in pg_basebackup should > stay what initdb has right now. Changed. > There is a whitespace alignment error in usage(). Not sure I follow here. > The -N option should be listed after -n. In the switch()? Fixed. > Some fsync calls are not covered by a do_sync conditional. I see some > in close_walfile(), HandleCopyStream(), ProcessKeepaliveMsg(). Right. I looked at the rest and did not notice any extra mistakes. -- Michael
Attachment
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Peter Eisentraut
Date:
On 9/23/16 11:15 AM, Michael Paquier wrote: >> 0002: >> > >> > durable_rename needs close(fd) in non-error path (compare backend code). > Oops, leak. Why did you remove the errno save and restore? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Michael Paquier
Date:
On Tue, Sep 27, 2016 at 11:16 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 9/23/16 11:15 AM, Michael Paquier wrote: >>> 0002: >>> > >>> > durable_rename needs close(fd) in non-error path (compare backend code). >> Oops, leak. > > Why did you remove the errno save and restore? That's this bit in durable_rename, patch 0002: + if (fsync(fd) != 0) + { + fprintf(stderr, _("%s: could not fsync file \"%s\": %s\n"), + progname, newfile, strerror(errno)); + close(fd); + return -1; + } + close(fd); I thought that as long as the error string is shown to the user, it does not matter much if errno is still saved or not. All the callers of durable_rename() on frontends don't check for strerrno(errno) afterwards. Do you think it matters? Changing that back is trivial. Sorry for not mentioning directly in the thread that I modified that when dropping the last patch set. -- Michael
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Peter Eisentraut
Date:
On 9/26/16 10:34 PM, Michael Paquier wrote: > I thought that as long as the error string is shown to the user, it > does not matter much if errno is still saved or not. All the callers > of durable_rename() on frontends don't check for strerrno(errno) > afterwards. Do you think it matters? Changing that back is trivial. > Sorry for not mentioning directly in the thread that I modified that > when dropping the last patch set. Actually, I think the equivalent backend code only does this to save the errno across the close call because the elog call comes after the close.So it's OK to not do that in the frontend. With that in mind, I have committed the v3 series now. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Jeff Janes
Date:
On Thu, Sep 29, 2016 at 8:33 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 9/26/16 10:34 PM, Michael Paquier wrote:
> I thought that as long as the error string is shown to the user, it
> does not matter much if errno is still saved or not. All the callers
> of durable_rename() on frontends don't check for strerrno(errno)
> afterwards. Do you think it matters? Changing that back is trivial.
> Sorry for not mentioning directly in the thread that I modified that
> when dropping the last patch set.
Actually, I think the equivalent backend code only does this to save the
errno across the close call because the elog call comes after the close.
So it's OK to not do that in the frontend.
With that in mind, I have committed the v3 series now.
I'm getting compiler warnings:
file_utils.c: In function 'fsync_pgdata':
file_utils.c:86: warning: passing argument 2 of 'walkdir' from incompatible pointer type
file_utils.c:36: note: expected 'int (*)(const char *, bool, const char *)' but argument is of type 'void (*)(const char *, bool, const char *)'
file_utils.c:88: warning: passing argument 2 of 'walkdir' from incompatible pointer type
file_utils.c:36: note: expected 'int (*)(const char *, bool, const char *)' but argument is of type 'void (*)(const char *, bool, const char *)'
file_utils.c:89: warning: passing argument 2 of 'walkdir' from incompatible pointer type
file_utils.c:36: note: expected 'int (*)(const char *, bool, const char *)' but argument is of type 'void (*)(const char *, bool, const char *)'
Cheers,
Jeff
On 9/29/16 12:31 PM, Jeff Janes wrote: > With that in mind, I have committed the v3 series now. > > > I'm getting compiler warnings: Fixed. > > file_utils.c: In function 'fsync_pgdata': > file_utils.c:86: warning: passing argument 2 of 'walkdir' from > incompatible pointer type > file_utils.c:36: note: expected 'int (*)(const char *, bool, const char > *)' but argument is of type 'void (*)(const char *, bool, const char *)' > file_utils.c:88: warning: passing argument 2 of 'walkdir' from > incompatible pointer type > file_utils.c:36: note: expected 'int (*)(const char *, bool, const char > *)' but argument is of type 'void (*)(const char *, bool, const char *)' > file_utils.c:89: warning: passing argument 2 of 'walkdir' from > incompatible pointer type > file_utils.c:36: note: expected 'int (*)(const char *, bool, const char > *)' but argument is of type 'void (*)(const char *, bool, const char *)' -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Michael Paquier
Date:
On Fri, Sep 30, 2016 at 1:31 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Thu, Sep 29, 2016 at 8:33 AM, Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: >> >> On 9/26/16 10:34 PM, Michael Paquier wrote: >> > I thought that as long as the error string is shown to the user, it >> > does not matter much if errno is still saved or not. All the callers >> > of durable_rename() on frontends don't check for strerrno(errno) >> > afterwards. Do you think it matters? Changing that back is trivial. >> > Sorry for not mentioning directly in the thread that I modified that >> > when dropping the last patch set. >> >> Actually, I think the equivalent backend code only does this to save the >> errno across the close call because the elog call comes after the close. >> So it's OK to not do that in the frontend. >> >> With that in mind, I have committed the v3 series now. Thanks! > I'm getting compiler warnings: > > file_utils.c: In function 'fsync_pgdata': > file_utils.c:86: warning: passing argument 2 of 'walkdir' from incompatible > pointer type > file_utils.c:36: note: expected 'int (*)(const char *, bool, const char *)' > but argument is of type 'void (*)(const char *, bool, const char *)' > file_utils.c:88: warning: passing argument 2 of 'walkdir' from incompatible > pointer type > file_utils.c:36: note: expected 'int (*)(const char *, bool, const char *)' > but argument is of type 'void (*)(const char *, bool, const char *)' > file_utils.c:89: warning: passing argument 2 of 'walkdir' from incompatible > pointer type > file_utils.c:36: note: expected 'int (*)(const char *, bool, const char *)' > but argument is of type 'void (*)(const char *, bool, const char *)' Oops. Are you using gcc to see that? I compiled with clang and on Windows without noticing it. -- Michael
Re: pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes: > Oops. Are you using gcc to see that? I compiled with clang and on > Windows without noticing it. Peter already noted that you'd only see it on platforms that define PG_FLUSH_DATA_WORKS. regards, tom lane