Thread: pgsql: Recursively fsync() the data directory after a crash.

pgsql: Recursively fsync() the data directory after a crash.

From
Robert Haas
Date:
Recursively fsync() the data directory after a crash.

Otherwise, if there's another crash, some writes from after the first
crash might make it to disk while writes from before the crash fail
to make it to disk.  This could lead to data corruption.

Back-patch to all supported versions.

Abhijit Menon-Sen, reviewed by Andres Freund and slightly revised
by me.

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/2ce439f3379aed857517c8ce207485655000fc8e

Modified Files
--------------
src/backend/access/transam/xlog.c |   42 ++++++++++++++
src/backend/storage/file/fd.c     |  115 +++++++++++++++++++++++++++++++++++++
src/include/storage/fd.h          |    2 +
3 files changed, 159 insertions(+)


Re: pgsql: Recursively fsync() the data directory after a crash.

From
Andrew Dunstan
Date:
On 05/04/2015 02:23 PM, Robert Haas wrote:
> Recursively fsync() the data directory after a crash.
>
> Otherwise, if there's another crash, some writes from after the first
> crash might make it to disk while writes from before the crash fail
> to make it to disk.  This could lead to data corruption.
>
> Back-patch to all supported versions.
>

This appears to have broken Windows builds. At least it's broken by
using the two argument form of open instead of the three argument form,
for which we have a #define in the win32 case, and pg_win32_is_junction
is a typo - the first _ should not be there.

cheers

andrew


Re: pgsql: Recursively fsync() the data directory after a crash.

From
Robert Haas
Date:
On Mon, May 4, 2015 at 11:37 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
> On 05/04/2015 02:23 PM, Robert Haas wrote:
>> Recursively fsync() the data directory after a crash.
>>
>> Otherwise, if there's another crash, some writes from after the first
>> crash might make it to disk while writes from before the crash fail
>> to make it to disk.  This could lead to data corruption.
>>
>> Back-patch to all supported versions.
>>
>
> This appears to have broken Windows builds. At least it's broken by using
> the two argument form of open instead of the three argument form, for which
> we have a #define in the win32 case, and pg_win32_is_junction is a typo -
> the first _ should not be there.

Sorry about that.  Working on it now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: pgsql: Recursively fsync() the data directory after a crash.

From
Peter Eisentraut
Date:
On 5/4/15 2:23 PM, Robert Haas wrote:
> Recursively fsync() the data directory after a crash.
>
> Otherwise, if there's another crash, some writes from after the first
> crash might make it to disk while writes from before the crash fail
> to make it to disk.  This could lead to data corruption.

If there a reason why in pre_sync_fname(), the error message does not
contain a %m?



Re: pgsql: Recursively fsync() the data directory after a crash.

From
Robert Haas
Date:
On Sun, May 17, 2015 at 9:44 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On 5/4/15 2:23 PM, Robert Haas wrote:
>> Recursively fsync() the data directory after a crash.
>>
>> Otherwise, if there's another crash, some writes from after the first
>> crash might make it to disk while writes from before the crash fail
>> to make it to disk.  This could lead to data corruption.
>
> If there a reason why in pre_sync_fname(), the error message does not
> contain a %m?

For consistency with the rest of the file, I suppose.  Not sure why
it's like that, but all the functions in the file do it that way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: pgsql: Recursively fsync() the data directory after a crash.

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, May 17, 2015 at 9:44 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
>> If there a reason why in pre_sync_fname(), the error message does not
>> contain a %m?

> For consistency with the rest of the file, I suppose.  Not sure why
> it's like that, but all the functions in the file do it that way.

Huh?  All the ones that are reporting kernel call failures certainly
have %m.

            regards, tom lane


Re: pgsql: Recursively fsync() the data directory after a crash.

From
Robert Haas
Date:
On Mon, May 18, 2015 at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sun, May 17, 2015 at 9:44 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
>>> If there a reason why in pre_sync_fname(), the error message does not
>>> contain a %m?
>
>> For consistency with the rest of the file, I suppose.  Not sure why
>> it's like that, but all the functions in the file do it that way.
>
> Huh?  All the ones that are reporting kernel call failures certainly
> have %m.

Oops, you're right.  I was looking at the wrong code.  Yeah, that
should probably be fixed.  I'm not sure it's a good idea to do that
right now though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: pgsql: Recursively fsync() the data directory after a crash.

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, May 18, 2015 at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Huh?  All the ones that are reporting kernel call failures certainly
>> have %m.

> Oops, you're right.  I was looking at the wrong code.  Yeah, that
> should probably be fixed.  I'm not sure it's a good idea to do that
> right now though.

It's a trivial enough change that I wouldn't see a problem with doing it
now.  But if you do, please get it in by say 2PM EDT.

            regards, tom lane


Re: pgsql: Recursively fsync() the data directory after a crash.

From
Bruce Momjian
Date:
On Mon, May 18, 2015 at 12:15:03PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Mon, May 18, 2015 at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Huh?  All the ones that are reporting kernel call failures certainly
> >> have %m.
>
> > Oops, you're right.  I was looking at the wrong code.  Yeah, that
> > should probably be fixed.  I'm not sure it's a good idea to do that
> > right now though.
>
> It's a trivial enough change that I wouldn't see a problem with doing it
> now.  But if you do, please get it in by say 2PM EDT.

Uh, doesn't that change the translated strings?  Is that a problem?

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +


Re: pgsql: Recursively fsync() the data directory after a crash.

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> On Mon, May 18, 2015 at 12:15:03PM -0400, Tom Lane wrote:
>> It's a trivial enough change that I wouldn't see a problem with doing it
>> now.  But if you do, please get it in by say 2PM EDT.

> Uh, doesn't that change the translated strings?  Is that a problem?

Better an untranslated message than a translated one that lacks critical
info.  But anyway, there are likely other instances of that same string
*with* %m ...

            regards, tom lane


Re: pgsql: Recursively fsync() the data directory after a crash.

From
Robert Haas
Date:
On Mon, May 18, 2015 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>> On Mon, May 18, 2015 at 12:15:03PM -0400, Tom Lane wrote:
>>> It's a trivial enough change that I wouldn't see a problem with doing it
>>> now.  But if you do, please get it in by say 2PM EDT.
>
>> Uh, doesn't that change the translated strings?  Is that a problem?
>
> Better an untranslated message than a translated one that lacks critical
> info.  But anyway, there are likely other instances of that same string
> *with* %m ...

Probably not, because of the way the message is worded.  But we could do this:

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 6fa75d1..bed8478 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2461,7 +2461,7 @@ pre_sync_fname(char *fname, bool isdir)

        if (fd < 0)
                ereport(FATAL,
-                               (errmsg("could not open file \"%s\"
before fsync",
+                               (errmsg("could not open file \"%s\": %m",
                                                fname)));

        pg_flush_data(fd, 0, 0);

Does that sound good?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: pgsql: Recursively fsync() the data directory after a crash.

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, May 18, 2015 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Better an untranslated message than a translated one that lacks critical
>> info.  But anyway, there are likely other instances of that same string
>> *with* %m ...

> Probably not, because of the way the message is worded.  But we could do this:

> diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
> index 6fa75d1..bed8478 100644
> --- a/src/backend/storage/file/fd.c
> +++ b/src/backend/storage/file/fd.c
> @@ -2461,7 +2461,7 @@ pre_sync_fname(char *fname, bool isdir)

>         if (fd < 0)
>                 ereport(FATAL,
> -                               (errmsg("could not open file \"%s\"
> before fsync",
> +                               (errmsg("could not open file \"%s\": %m",
>                                                 fname)));

>         pg_flush_data(fd, 0, 0);

> Does that sound good?

Works for me.  The file/line info for the message would provide the
"before fsync" context anyway, if someone needed it.

            regards, tom lane


Re: pgsql: Recursively fsync() the data directory after a crash.

From
Alvaro Herrera
Date:
Robert Haas wrote:
> On Mon, May 18, 2015 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> >> Uh, doesn't that change the translated strings?  Is that a problem?
> >
> > Better an untranslated message than a translated one that lacks critical
> > info.  But anyway, there are likely other instances of that same string
> > *with* %m ...
>
> Probably not, because of the way the message is worded.  But we could do this:
>
> diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
> index 6fa75d1..bed8478 100644
> --- a/src/backend/storage/file/fd.c
> +++ b/src/backend/storage/file/fd.c
> @@ -2461,7 +2461,7 @@ pre_sync_fname(char *fname, bool isdir)
>
>         if (fd < 0)
>                 ereport(FATAL,
> -                               (errmsg("could not open file \"%s\"
> before fsync",
> +                               (errmsg("could not open file \"%s\": %m",
>                                                 fname)));
>
>         pg_flush_data(fd, 0, 0);
>
> Does that sound good?

+1.  It's not like the extra two words of context would be of much use
anyway.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: pgsql: Recursively fsync() the data directory after a crash.

From
Robert Haas
Date:
On Mon, May 18, 2015 at 1:00 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> On Mon, May 18, 2015 at 12:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> >> Uh, doesn't that change the translated strings?  Is that a problem?
>> >
>> > Better an untranslated message than a translated one that lacks critical
>> > info.  But anyway, there are likely other instances of that same string
>> > *with* %m ...
>>
>> Probably not, because of the way the message is worded.  But we could do this:
>>
>> diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
>> index 6fa75d1..bed8478 100644
>> --- a/src/backend/storage/file/fd.c
>> +++ b/src/backend/storage/file/fd.c
>> @@ -2461,7 +2461,7 @@ pre_sync_fname(char *fname, bool isdir)
>>
>>         if (fd < 0)
>>                 ereport(FATAL,
>> -                               (errmsg("could not open file \"%s\"
>> before fsync",
>> +                               (errmsg("could not open file \"%s\": %m",
>>                                                 fname)));
>>
>>         pg_flush_data(fd, 0, 0);
>>
>> Does that sound good?
>
> +1.  It's not like the extra two words of context would be of much use
> anyway.

Done.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company