Home > mailing lists

Thread: fsync reliability

fsync reliability

From

Simon Riggs

Date:

21 April 2011, 05:26:26

Daniel Farina points out to me that the Linux man page for fsync() says
"Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached
disk. For that an
explicit fsync() on a file descriptor for the directory is also needed."
http://www.kernel.org/doc/man-pages/online/pages/man2/fsync.2.html

That phrase does not exist here
http://pubs.opengroup.org/onlinepubs/007908799/xsh/fsync.html

This point appears to have been discussed before
http://postgresql.1045698.n5.nabble.com/ALTER-DATABASE-SET-TABLESPACE-vs-crash-safety-td1995703.html

Tom said
"We don't try to "fsync the
directory" after a normal table create for instance"

which is fine because we don't need to. In the event of a crash a
missing table would be recreated during crash recovery.

However, that begs the question of what happens with WAL. At present,
we do nothing to ensure that "the entry in the directory containing
the file has also reached disk".

ISTM that we can easily do this, since we preallocate WAL files during
RemoveOldXlogFiles() and rarely extend the number of files.
So it seems easily possible to fsync the pg_xlog directory at the end
of RemoveOldXlogFiles(), which is mostly performed by the bgwriter
anyway.

It was also noted that "we've always expected the filesystem to take
care of its own metadata"
which isn't actually stated anywhere in the docs, AFAIK.

Perhaps this is an irrelevant problem these days, but would it hurt to fix?

Happy to do the patch if we agree.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: fsync reliability

From

Alvaro Herrera

Date:

21 April 2011, 10:26:59

Excerpts from Simon Riggs's message of jue abr 21 05:26:06 -0300 2011:

> ISTM that we can easily do this, since we preallocate WAL files during
> RemoveOldXlogFiles() and rarely extend the number of files.
> So it seems easily possible to fsync the pg_xlog directory at the end
> of RemoveOldXlogFiles(), which is mostly performed by the bgwriter
> anyway.
> 
> It was also noted that "we've always expected the filesystem to take
> care of its own metadata"
> which isn't actually stated anywhere in the docs, AFAIK.
> 
> Perhaps this is an irrelevant problem these days, but would it hurt to fix?

I don't think it's irrelevant (yet).  Even Greg Smith's book suggests to
use ext2 for the WAL partition in extreme cases.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: fsync reliability

From

Tom Lane

Date:

21 April 2011, 12:56:06

Simon Riggs <simon@2ndQuadrant.com> writes:
> Daniel Farina points out to me that the Linux man page for fsync() says
> "Calling fsync() does not necessarily ensure that the entry in the directory
>        containing the file has also reached disk.  For that an
> explicit fsync() on a
>        file descriptor for the directory is also needed."
> http://www.kernel.org/doc/man-pages/online/pages/man2/fsync.2.html

> This point appears to have been discussed before

Yes ...

> Tom said
> "We don't try to "fsync the
> directory" after a normal table create for instance"
> which is fine because we don't need to. In the event of a crash a
> missing table would be recreated during crash recovery.

Nonsense.  Once a checkpoint occurs after the WAL record that says to
create the table, we won't replay that action.  Or are you proposing
to have checkpoints run around and fsync every directory in the data
tree?

The traditional standard is that the filesystem is supposed to take
care of its own metadata, and even Linux filesystems have pretty much
figured that out.  I don't really see a need for us to be nursemaiding
the filesystem.  At most there's a documentation issue here, ie, we
ought to be more explicit about which filesystems and which mount
options we recommend.
        regards, tom lane

Re: fsync reliability

From

Robert Haas

Date:

21 April 2011, 13:46:06

On Thu, Apr 21, 2011 at 11:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The traditional standard is that the filesystem is supposed to take
> care of its own metadata, and even Linux filesystems have pretty much
> figured that out.  I don't really see a need for us to be nursemaiding
> the filesystem.  At most there's a documentation issue here, ie, we
> ought to be more explicit about which filesystems and which mount
> options we recommend.

I think it would be illuminating to shine upon this conversation the
light of some actual facts, as to whether or not this can be
demonstrated to be broken on systems people actually use, and to what
extent it can be mitigated by the sorts of configuration choices you
mention.  Neither Simon's comments nor yours give me any clear feeling
as to how likely this is to cause problems for real users, nor how
easily those problems can be mitigated.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: fsync reliability

From

Simon Riggs

Date:

21 April 2011, 13:47:26

On Thu, Apr 21, 2011 at 4:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> The traditional standard is that the filesystem is supposed to take
> care of its own metadata, and even Linux filesystems have pretty much
> figured that out.  I don't really see a need for us to be nursemaiding
> the filesystem.  At most there's a documentation issue here, ie,

I'm surprised by your response. If we've not documented something that
turns out to be essential to reliability of production databases, then
our users have a problem.

If our users have a data loss problem, my understanding was that we fixed it.

As it turns out, I've never personally advised anyone to use a
non-journalled filesystem, so my hands are clean in this. But it is
something we can fix, if we chose.

> we
> ought to be more explicit about which filesystems and which mount
> options we recommend.

Please be explicit then. What should the docs have said? I will update them.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: fsync reliability

From

Simon Riggs

Date:

21 April 2011, 13:53:30

On Thu, Apr 21, 2011 at 5:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Apr 21, 2011 at 11:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The traditional standard is that the filesystem is supposed to take
>> care of its own metadata, and even Linux filesystems have pretty much
>> figured that out.  I don't really see a need for us to be nursemaiding
>> the filesystem.  At most there's a documentation issue here, ie, we
>> ought to be more explicit about which filesystems and which mount
>> options we recommend.
>
> I think it would be illuminating to shine upon this conversation the
> light of some actual facts, as to whether or not this can be
> demonstrated to be broken on systems people actually use, and to what
> extent it can be mitigated by the sorts of configuration choices you
> mention.  Neither Simon's comments nor yours give me any clear feeling
> as to how likely this is to cause problems for real users, nor how
> easily those problems can be mitigated.

If you have some actual facts yourself, add them. Or listen for people that do.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: fsync reliability

From

Robert Haas

Date:

21 April 2011, 14:20:23

On Thu, Apr 21, 2011 at 12:53 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Thu, Apr 21, 2011 at 5:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Apr 21, 2011 at 11:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> The traditional standard is that the filesystem is supposed to take
>>> care of its own metadata, and even Linux filesystems have pretty much
>>> figured that out.  I don't really see a need for us to be nursemaiding
>>> the filesystem.  At most there's a documentation issue here, ie, we
>>> ought to be more explicit about which filesystems and which mount
>>> options we recommend.
>>
>> I think it would be illuminating to shine upon this conversation the
>> light of some actual facts, as to whether or not this can be
>> demonstrated to be broken on systems people actually use, and to what
>> extent it can be mitigated by the sorts of configuration choices you
>> mention.  Neither Simon's comments nor yours give me any clear feeling
>> as to how likely this is to cause problems for real users, nor how
>> easily those problems can be mitigated.
>
> If you have some actual facts yourself, add them. Or listen for people that do.

Since I don't have any actual facts, listening for people who do is
precisely what I am doing.  Since the proposed change was your
suggestion, perhaps you would like to provide some.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: fsync reliability

From

Greg Smith

Date:

22 April 2011, 00:51:10

On 04/21/2011 04:26 AM, Simon Riggs wrote:
> However, that begs the question of what happens with WAL. At present,
> we do nothing to ensure that "the entry in the directory containing
> the file has also reached disk".
>    

Well, we do, but it's not obvious why that is unless you've stared at 
this for far too many hours.  A clear description of the possible issue 
you and Dan are raising showed up on LKML a few years ago:  
http://lwn.net/Articles/270891/

Here's the most relevant part, which directly addresses the WAL case:

"[fsync] is unsafe for write-ahead logging, because it doesn't really 
guarantee any _ordering_ for the writes at the hard storage level.  So 
aside from losing committed data, it can also corrupt structural 
metadata.  With ext3 it's quite easy to verify that fsync/fdatasync 
don't always write a journal entry.  (Apart from looking at the kernel 
code :-)

Just write some data, fsync(), and observe the number of writes in 
/proc/diskstats.  If the current mtime second _hasn't_ changed, the 
inode isn't written.  If you write data, say, 10 times a second to the 
same place followed by fsync(), you'll see a little more than 10 write 
I/Os, and less than 20."

There's a terrible hack suggested where you run fchmod to force the 
journal out in the next fsync that makes me want to track the poster 
down and shoot him, but this part raises a reasonable question.

The main issue he's complaining about here is a moot one for 
PostgreSQL.  If the WAL rewrites have been reordered but have not 
completed, the minute WAL replay hits the spot with a missing block the 
CRC32 will be busted and replay is finished.  The fact that he's 
assuming a database would have such a naive WAL implementation that it 
would corrupt the database if blocks are written out of order in between 
fsync call returning is one of the reasons this whole idea never got 
more traction--hard to get excited about a proposal whose fundamentals 
rest on an assumption that doesn't turns out to be true on real databases.

There's still the "fsync'd a data block but not the directory entry yet" 
issue as fall-out from this too.  Why doesn't PostgreSQL run into this 
problem?  Because the exact code sequence used is this one:

open
write
fsync
close

And Linux shouldn't ever screw that up, or the similar rename path.  
Here's what the close man page says, from 
http://linux.die.net/man/2/close :

"A successful close does not guarantee that the data has been 
successfully saved to disk, as the kernel defers writes. It is not 
common for a filesystem to flush the buffers when the stream is closed. 
If you need to be sure that the data is physically stored use fsync(2). 
(It will depend on the disk hardware at this point.)"

What this is alluding to is that if you fsync before closing, the close 
will write all the metadata out too.  You're busted if your write cache 
lies, but we already know all about that issue.

There was a discussion of issues around this on LKML a few years ago, 
with Alan Cox getting the good pull quote at 
http://lkml.org/lkml/2009/3/27/268 : "fsync/close() as a pair allows the 
user to correctly indicate their requirements."  While fsync doesn't 
guarantee that metadata is written out, and neither does close, kernel 
developers seem to all agree that fsync-before-close means you want 
everything on disk.  Filesystems that don't honor that will break all 
sorts of software.

It is of course possible there are bugs in some part of this code path, 
where a clever enough test case might expose a window of strange 
file/metadata ordering.  I think it's too weak of a theorized problem to 
go specifically chasing after though.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: fsync reliability

From

Simon Riggs

Date:

22 April 2011, 08:47:47

On Fri, Apr 22, 2011 at 4:51 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 04/21/2011 04:26 AM, Simon Riggs wrote:
>>
>> However, that begs the question of what happens with WAL. At present,
>> we do nothing to ensure that "the entry in the directory containing
>> the file has also reached disk".
>>
>
>
> Well, we do, but it's not obvious why that is unless you've stared at this
> for far too many hours.  A clear description of the possible issue you and
> Dan are raising showed up on LKML a few years ago:
>  http://lwn.net/Articles/270891/
>
> Here's the most relevant part, which directly addresses the WAL case:
>
> "[fsync] is unsafe for write-ahead logging, because it doesn't really
> guarantee any _ordering_ for the writes at the hard storage level.  So aside
> from losing committed data, it can also corrupt structural metadata.  With
> ext3 it's quite easy to verify that fsync/fdatasync don't always write a
> journal entry.  (Apart from looking at the kernel code :-)
>
> Just write some data, fsync(), and observe the number of writes in
> /proc/diskstats.  If the current mtime second _hasn't_ changed, the inode
> isn't written.  If you write data, say, 10 times a second to the same place
> followed by fsync(), you'll see a little more than 10 write I/Os, and less
> than 20."
>
> There's a terrible hack suggested where you run fchmod to force the journal
> out in the next fsync that makes me want to track the poster down and shoot
> him, but this part raises a reasonable question.
>
> The main issue he's complaining about here is a moot one for PostgreSQL.  If
> the WAL rewrites have been reordered but have not completed, the minute WAL
> replay hits the spot with a missing block the CRC32 will be busted and
> replay is finished.  The fact that he's assuming a database would have such
> a naive WAL implementation that it would corrupt the database if blocks are
> written out of order in between fsync call returning is one of the reasons
> this whole idea never got more traction--hard to get excited about a
> proposal whose fundamentals rest on an assumption that doesn't turns out to
> be true on real databases.
>
> There's still the "fsync'd a data block but not the directory entry yet"
> issue as fall-out from this too.  Why doesn't PostgreSQL run into this
> problem?  Because the exact code sequence used is this one:
>
> open
> write
> fsync
> close
>
> And Linux shouldn't ever screw that up, or the similar rename path.  Here's
> what the close man page says, from http://linux.die.net/man/2/close :
>
> "A successful close does not guarantee that the data has been successfully
> saved to disk, as the kernel defers writes. It is not common for a
> filesystem to flush the buffers when the stream is closed. If you need to be
> sure that the data is physically stored use fsync(2). (It will depend on the
> disk hardware at this point.)"
>
> What this is alluding to is that if you fsync before closing, the close will
> write all the metadata out too.  You're busted if your write cache lies, but
> we already know all about that issue.
>
> There was a discussion of issues around this on LKML a few years ago, with
> Alan Cox getting the good pull quote at http://lkml.org/lkml/2009/3/27/268 :
> "fsync/close() as a pair allows the user to correctly indicate their
> requirements."  While fsync doesn't guarantee that metadata is written out,
> and neither does close, kernel developers seem to all agree that
> fsync-before-close means you want everything on disk.  Filesystems that
> don't honor that will break all sorts of software.
>
> It is of course possible there are bugs in some part of this code path,
> where a clever enough test case might expose a window of strange
> file/metadata ordering.  I think it's too weak of a theorized problem to go
> specifically chasing after though.

We do issue fsync and then close, but only when we switch log files.
We don't do that as part of the normal commit path.

I agree that there isn't a "crash bug" here. If WAL metadata is wrong,
or if WAL data blocks are missing, then this will just show up as an
"end of WAL" condition on crash recovery. Postgres will still work at
the end of it. What worries me is that because we always end on an
error, we have no real way of knowing if this has happened never or
lots.

Now I think about it, I can't really see a good reason why we apply
WAL files in sequence trusting just the file name sequence during
crash recovery. The files contain information to allow us to identify
the contents, so if we can't see a file with the right name we can
always scan other files to see if they are the right ones. I would
prefer a WAL file ordering that wasn't dependent at all on file name.
If we did that we wouldn't need to do the file rename thing, we could
just have files called log1, log2 etc.. Archiving can still use
current file names.

The issue you raise above where "fsync is not safe for Write Ahead
Logging" doesn't sound good. I don't think what you've said has fully
addressed that yet. We could replace the commit path with O_DIRECT and
physically order the data blocks, but I would guess the code path to
durable storage has way too many bits of code tweaking it for me to
feel happy that was worth it.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: fsync reliability

From

Greg Smith

Date:

22 April 2011, 09:35:34

Simon Riggs wrote:
> We do issue fsync and then close, but only when we switch log files.
> We don't do that as part of the normal commit path.
>   

Since all these files are zero-filled before use, the space is allocated 
for them, and the remaining important filesystem layout metadata gets 
flushed during the close.  The only metadata that changes after 
that--things like the last access time--isn't important to the WAL 
functioning.  So the metadata doesn't need to be updated after a normal 
commit, it's already there.  There are two main risks when crashing 
while fsync is in the middle of executing a push out to physical 
storage: torn pages due to partial data writes, and other out of order 
writes.  The only filesystems where this isn't true are the copy on 
write ones, where the blocks move around on disk too.  But those all 
have their own more careful guarantees about metadata too.

> The issue you raise above where "fsync is not safe for Write Ahead
> Logging" doesn't sound good. I don't think what you've said has fully
> addressed that yet. We could replace the commit path with O_DIRECT and
> physically order the data blocks, but I would guess the code path to
> durable storage has way too many bits of code tweaking it for me to
> feel happy that was worth it.
>   

As far as I can tell the CRC is sufficient protection against that.  
This is all data that hasn't really been committed being torn up here.  
Once you trust that the metadata problem isn't real, reordered writes 
are the only going to destroy things that are in the middle of being 
flushed to disk.  A synchronous commit mangled this way will be rolled 
back regardless because it never really finished (fsync didn't return); 
an asynchronous one was never guaranteed to be on disk.

On many older Linux systems O_DIRECT is a less reliable code path than 
than write/fsync is, so you're right that isn't necessarily a useful 
step forward.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: fsync reliability

From

Simon Riggs

Date:

22 April 2011, 10:32:12

On Fri, Apr 22, 2011 at 1:35 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Simon Riggs wrote:
>>
>> We do issue fsync and then close, but only when we switch log files.
>> We don't do that as part of the normal commit path.
>>
>
> Since all these files are zero-filled before use, the space is allocated for
> them, and the remaining important filesystem layout metadata gets flushed
> during the close.  The only metadata that changes after that--things like
> the last access time--isn't important to the WAL functioning.  So the
> metadata doesn't need to be updated after a normal commit, it's already
> there.  There are two main risks when crashing while fsync is in the middle
> of executing a push out to physical storage: torn pages due to partial data
> writes, and other out of order writes.  The only filesystems where this
> isn't true are the copy on write ones, where the blocks move around on disk
> too.  But those all have their own more careful guarantees about metadata
> too.

OK, that's good, but ISTM we still have a hole during
RemoveOldXlogFiles() where we don't fsync or open/close the file, just
rename it.

The WAL filename is critical in identifying the next batch of data,
incorrect metadata will have an effect on crash recovery.

So we are relying on the metadata being safe.


>> The issue you raise above where "fsync is not safe for Write Ahead
>> Logging" doesn't sound good. I don't think what you've said has fully
>> addressed that yet. We could replace the commit path with O_DIRECT and
>> physically order the data blocks, but I would guess the code path to
>> durable storage has way too many bits of code tweaking it for me to
>> feel happy that was worth it.
>>
>
> As far as I can tell the CRC is sufficient protection against that.  This is
> all data that hasn't really been committed being torn up here.  Once you
> trust that the metadata problem isn't real, reordered writes are the only
> going to destroy things that are in the middle of being flushed to disk.  A
> synchronous commit mangled this way will be rolled back regardless because
> it never really finished (fsync didn't return); an asynchronous one was
> never guaranteed to be on disk.

OK, that's clear. Thanks for putting my mind at rest.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: fsync reliability

From

Greg Stark

Date:

22 April 2011, 11:32:06

On Thu, Apr 21, 2011 at 4:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The traditional standard is that the filesystem is supposed to take
> care of its own metadata, and even Linux filesystems have pretty much
> figured that out.  I don't really see a need for us to be nursemaiding
> the filesystem.  At most there's a documentation issue here, ie, we
> ought to be more explicit about which filesystems and which mount
> options we recommend.

To be fair the traditional standard was that filesystem metadata was
written synchronously. That is, the creat/rename/unlink calls didn't
finish until the data had been written. That was never brilliant but
it was simple. It's unclear to me whether that API was decided on
because the implementation of anything else was hard or whether it was
implemented that way because it was deemed a good idea to define the
API that way. I suspect it was the former.

As APIs go, having meta-data operations be buffered and reusing fsync
on the directory to block until they're written seems as sane as
anything else. It's a bit of a pain for us to keep track of which
files have been created or deleted in a directory and fsync the
directory on checkpoint but that's just because we've already gone to
special efforts to keep track of what data is dirty but not done
anything to keep track of which directories have been dirtied.

--
greg

Re: fsync reliability

From

Greg Smith

Date:

22 April 2011, 17:54:48

On 04/22/2011 09:32 AM, Simon Riggs wrote:
> OK, that's good, but ISTM we still have a hole during
> RemoveOldXlogFiles() where we don't fsync or open/close the file, just
> rename it.
>    

This is also something that many applications rely upon working as hoped 
for here, even though it's not technically part of POSIX.  Early 
versions of ext4 broke that, and it caused a giant outcry of 
complaints.  
http://www.h-online.com/open/news/item/Ext4-data-loss-explanations-and-workarounds-740671.html 
has a good summary.  This was broken on ext4 from around 2.6.28 to 
2.6.30, but the fix for it was so demanded that it's even been ported by 
the relatively lazy distributions to their 2.6.28/2.6.29 kernels.

There may be a small window for metadata issues here if you've put the 
WAL on ext2 and there's a crash in the middle of rename.  That factors 
into why any suggestions I make about using ext2 come with a load of 
warnings about the risk of not journaling.  It's hard to predict every 
type of issue that fsck might force you to come to terms with after a 
crash on ext2, and if there was a problem with this path I'd expect it 
to show up as something to be reconciled then.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: fsync reliability

From

Matthew Woodcraft

Date:

23 April 2011, 11:25:14

On 2011-04-22 21:55, Greg Smith wrote:
> On 04/22/2011 09:32 AM, Simon Riggs wrote:
>> OK, that's good, but ISTM we still have a hole during
>> RemoveOldXlogFiles() where we don't fsync or open/close the file, just
>> rename it.
> 
> This is also something that many applications rely upon working as hoped
> for here, even though it's not technically part of POSIX.  Early
> versions of ext4 broke that, and it caused a giant outcry of
> complaints. 
> http://www.h-online.com/open/news/item/Ext4-data-loss-explanations-and-workarounds-740671.html
> has a good summary.  This was broken on ext4 from around 2.6.28 to
> 2.6.30, but the fix for it was so demanded that it's even been ported by
> the relatively lazy distributions to their 2.6.28/2.6.29 kernels.

As far as I can make out, the current situation is that this fix (the
auto_da_alloc mount option) doesn't work as advertised, and the ext4
maintainers are not treating this as a bug.

See https://bugzilla.kernel.org/show_bug.cgi?id=15910

-M-

Re: fsync reliability

From

Daniel Farina

Date:

24 April 2011, 22:15:24

On Thu, Apr 21, 2011 at 1:26 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Daniel Farina points out to me that the Linux man page for fsync() says
> "Calling fsync() does not necessarily ensure that the entry in the directory
>       containing the file has also reached disk.  For that an
> explicit fsync() on a
>       file descriptor for the directory is also needed."
> http://www.kernel.org/doc/man-pages/online/pages/man2/fsync.2.html

I'd also like to point out that even on ext(2|3) there is a special
option, 'dirsync', and directory attribute (see 'chattr') that exists,
mostly to the benefit of the authors of MTAs that use a lot of
metadata manipulation operations, to allow all directory metadata
mangling to be synchronous, to get around non-durable metadata
manipulations (even if you use fsync() a crash between the rename()
and the fsync() will leave you in either the pre-move or post-move
state: it is atomic, and non-durable, the synchronous directory
modification ensures that the return of rename() coincides with the
durability of the rename itself, or so I would think.

I only found this from doing some research about how perform a
two-phase commit between postgres and the file system and reading the
kernel source.  I admit, it's a dusty and obscure corner, but it still
seems in use by said MTAs.

Would a reading and exploration of the kernel code at hand perhaps
help resolve this discussion, one way or another?

--
fdr

Re: fsync reliability

From

Daniel Farina

Date:

24 April 2011, 23:06:38

On Thu, Apr 21, 2011 at 8:51 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> There's still the "fsync'd a data block but not the directory entry yet"
> issue as fall-out from this too.  Why doesn't PostgreSQL run into this
> problem?  Because the exact code sequence used is this one:
>
> open
> write
> fsync
> close
>
> And Linux shouldn't ever screw that up, or the similar rename path.  Here's
> what the close man page says, from http://linux.die.net/man/2/close :

Theodore Ts'o addresses this *exact* sequence of events, and suggests
if you want that rename to definitely stick that you must fsync the
directory:

http://www.linuxfoundation.org/news-media/blogs/browse/2009/03/don%E2%80%99t-fear-fsync

"""
One argument that has commonly been made on the various comment
streams is that when replacing a file by writing a new file and the
renaming “file.new” to “file”, most applications don’t need a
guarantee that new contents of the file are committed to stable store
at a certain point in time; only that either the new or the old
contents of the file will be present on the disk. So the argument is
essentially that the sequence:

fd = open(”foo.new”, O_WRONLY);
write(fd, buf, bufsize);
fsync(fd);
close(fd);
rename(”foo.new”, “foo”);
… is too expensive, since it provides “atomicity and durability”, when
in fact all the application needed was “atomicity” (i.e., either the
new or the old contents of foo should be present after a crash), but
not durability (i.e., the application doesn’t need to need the new
version of foo now, but rather at some intermediate time in the future
when it’s convenient for the OS).

This argument is flawed for two reasons. First of all, the squence
above exactly provides desired “atomicity without durability”.   It
doesn’t guarantee which version of the file will appear in the event
of an unexpected crash; if the application needs a guarantee that the
new version of the file will be present after a crash, ***it’s
necessary to fsync the containing directory***
"""

Emphasis mine.

So, all in all, I think the creation of, deletion of, and renaming of
files in the write ahead log area should be followed by a pg_xlog
fsync.  I think it is also necessary to fsync directories in the
cluster directory at checkpoint time, also: if a chunk of directory
metadata doesn't make it to disk, a checkpoint occurs, and then
there's a crash then it's possible that replaying the WAL
post-checkpoint won't create/move/delete the file in the cluster.

The fact this hasn't been happening (or hasn't triggered an error,
which would be scarier) may just be a happy accident of that data
being flushed most of the time, meaning that that fsync() on the
directory file descriptor won't cost very much anyway.

--
fdr

Re: fsync reliability

From

Greg Smith

Date:

25 April 2011, 12:25:41

On 04/24/2011 10:06 PM, Daniel Farina wrote:
> On Thu, Apr 21, 2011 at 8:51 PM, Greg Smith<greg@2ndquadrant.com>  wrote:
>    
>> There's still the "fsync'd a data block but not the directory entry yet"
>> issue as fall-out from this too.  Why doesn't PostgreSQL run into this
>> problem?  Because the exact code sequence used is this one:
>>
>> open
>> write
>> fsync
>> close
>>
>> And Linux shouldn't ever screw that up, or the similar rename path.  Here's
>> what the close man page says, from http://linux.die.net/man/2/close :
>>      
> Theodore Ts'o addresses this *exact* sequence of events, and suggests
> if you want that rename to definitely stick that you must fsync the
> directory:
>
> http://www.linuxfoundation.org/news-media/blogs/browse/2009/03/don%E2%80%99t-fear-fsync
>    

Not exactly.  That's talking about the sequence used for creating a 
file, plus a rename.  When new WAL files are being created, I believe 
the ugly part of this is avoided.  The path when WAL files are recycled 
using rename does seem to be the one with the most likely edge case.

The difficult case Tso's discussion is trying to satisfy involves 
creating a new file and then swapping it for an old one atomically.  
PostgreSQL never does that exactly.  It creates new files, pads them 
with zeros, and then starts writing to them; it also renames old files 
that are already of the correctly length.  Combined with the fact that 
there are always fsyncs after writes to the files, and this case really 
isn't exactly the same as any of the others people are complaining about.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: fsync reliability

From

Greg Smith

Date:

25 April 2011, 12:59:28

On 04/23/2011 09:58 AM, Matthew Woodcraft wrote:
> As far as I can make out, the current situation is that this fix (the
> auto_da_alloc mount option) doesn't work as advertised, and the ext4
> maintainers are not treating this as a bug.
>
> See https://bugzilla.kernel.org/show_bug.cgi?id=15910
>    

I agree with the resolution that this isn't a bug.  As pointed out 
there, XFS does the same thing, and this behavior isn't going away any 
time soon.  Leaving behind zero-length files in situations where 
developers tried to optimize away a necessary fsync happens.

Here's the part where the submitter goes wrong:

"We first added a fsync() call for each extracted file. But scattered 
fsyncs resulted in a massive performance degradation during package 
installation (factor 10 or more, some reported that it took over an hour 
to unpack a linux-headers-* package!) In order to reduce the I/O 
performance degradation, fsync calls were deferred..."

Stop right there; the slow path was the only one that had any hope of 
being correct.  It can actually slow things by a factor of 100X or more, 
worst-case.  "So, we currently have the choice between filesystem 
corruption or major performance loss":  yes, you do.  Writing files is 
tricky and it can either be slow or safe.  If you're going to avoid even 
trying to enforce the right thing here, you're really going to get 
really burned.

It's unfortunate that so many people are used to the speed you get in 
the common situation for a while now with ext3 and cheap hard drives:  
all writes are cached unsafely, but the filesystem resists a few bad 
behaviors.  Much of the struggle where people say "this is so much 
slower, I won't put up with it" and try to code around it is futile, and 
it's hard to separate out the attempts to find such optimizations from 
the legitimate complaints.

Anyway, you're right to point out that the filesystem is not necessarily 
going to save anyone from some of the tricky rename situations even with 
the improvements made to delayed allocation.  They've fixed some of the 
worst behavior of the earlier implementation, but there are still 
potential issues in that area it seems.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: fsync reliability

From

Greg Stark

Date:

25 April 2011, 16:53:57

On Mon, Apr 25, 2011 at 5:00 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Stop right there; the slow path was the only one that had any hope of being
> correct.  It can actually slow things by a factor of 100X or more,
> worst-case.  "So, we currently have the choice between filesystem corruption
> or major performance loss":  yes, you do.  Writing files is tricky and it
> can either be slow or safe.  If you're going to avoid even trying to enforce
> the right thing here, you're really going to get really burned.

Well no. That's like saying the whole database can't possibly process
transactions faster than the rate at which fsyncs can happen. That's
not true because we can process transactions in parallel and fsync a
whole bunch simultaneously.

The API tytso and company are suggesting is that if you want
reasonable performance you should create a thread for each file, fsync
in that thread and then do your rename. Hardly the sanest API one
could imagine.

And if you fail to do that you don't just risk losing data. You get a
filesystem state that *never* existed. It's as if we said that if the
database crashes your transaction might be rolled back, it might be
committed, and we might just replace your data with zeros. Huh?

--
greg

Re: fsync reliability

From

Daniel Farina

Date:

25 April 2011, 17:22:29

On Mon, Apr 25, 2011 at 8:26 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 04/24/2011 10:06 PM, Daniel Farina wrote:
>>
>> On Thu, Apr 21, 2011 at 8:51 PM, Greg Smith<greg@2ndquadrant.com>  wrote:
>>
>>>
>>> There's still the "fsync'd a data block but not the directory entry yet"
>>> issue as fall-out from this too.  Why doesn't PostgreSQL run into this
>>> problem?  Because the exact code sequence used is this one:
>>>
>>> open
>>> write
>>> fsync
>>> close
>>>
>>> And Linux shouldn't ever screw that up, or the similar rename path.
>>>  Here's
>>> what the close man page says, from http://linux.die.net/man/2/close :
>>>
>>
>> Theodore Ts'o addresses this *exact* sequence of events, and suggests
>> if you want that rename to definitely stick that you must fsync the
>> directory:
>>
>>
>> http://www.linuxfoundation.org/news-media/blogs/browse/2009/03/don%E2%80%99t-fear-fsync
>>
>
> Not exactly.  That's talking about the sequence used for creating a file,
> plus a rename.  When new WAL files are being created, I believe the ugly
> part of this is avoided.  The path when WAL files are recycled using rename
> does seem to be the one with the most likely edge case.

Hmm, how do we avoid this in the creation case?  My current
anticipation is there are cases where you can do open(afile), write(),
fsync(), crash and the file will not be linked, or at the very least,
is *entitled* to not be linked to its parent directory.

The recycling case also sucks.

Would it be insane to use the MTA approach and just use chattr +D?  That also
models the behavior on other systems with synchronous directory
modifications, of which (maybe? could very well be wrong) BSD is
included.

--
fdr

Re: fsync reliability

From

Bruce Momjian

Date:

09 May 2011, 16:01:33

FYI, does wal.c need updated comments to explain the file system
semantics we expect, and how our code triggers it?

---------------------------------------------------------------------------

Greg Smith wrote:
> On 04/23/2011 09:58 AM, Matthew Woodcraft wrote:
> > As far as I can make out, the current situation is that this fix (the
> > auto_da_alloc mount option) doesn't work as advertised, and the ext4
> > maintainers are not treating this as a bug.
> >
> > See https://bugzilla.kernel.org/show_bug.cgi?id=15910
> >    
> 
> I agree with the resolution that this isn't a bug.  As pointed out 
> there, XFS does the same thing, and this behavior isn't going away any 
> time soon.  Leaving behind zero-length files in situations where 
> developers tried to optimize away a necessary fsync happens.
> 
> Here's the part where the submitter goes wrong:
> 
> "We first added a fsync() call for each extracted file. But scattered 
> fsyncs resulted in a massive performance degradation during package 
> installation (factor 10 or more, some reported that it took over an hour 
> to unpack a linux-headers-* package!) In order to reduce the I/O 
> performance degradation, fsync calls were deferred..."
> 
> Stop right there; the slow path was the only one that had any hope of 
> being correct.  It can actually slow things by a factor of 100X or more, 
> worst-case.  "So, we currently have the choice between filesystem 
> corruption or major performance loss":  yes, you do.  Writing files is 
> tricky and it can either be slow or safe.  If you're going to avoid even 
> trying to enforce the right thing here, you're really going to get 
> really burned.
> 
> It's unfortunate that so many people are used to the speed you get in 
> the common situation for a while now with ext3 and cheap hard drives:  
> all writes are cached unsafely, but the filesystem resists a few bad 
> behaviors.  Much of the struggle where people say "this is so much 
> slower, I won't put up with it" and try to code around it is futile, and 
> it's hard to separate out the attempts to find such optimizations from 
> the legitimate complaints.
> 
> Anyway, you're right to point out that the filesystem is not necessarily 
> going to save anyone from some of the tricky rename situations even with 
> the improvements made to delayed allocation.  They've fixed some of the 
> worst behavior of the earlier implementation, but there are still 
> potential issues in that area it seems.
> 
> -- 
> Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
> PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
> 
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +