Re: fsync-pgdata-on-recovery tries to write to more files than previously - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: fsync-pgdata-on-recovery tries to write to more files than previously
Date
Msg-id 20150526015430.GT26667@tamriel.snowman.net
Whole thread Raw
In response to Re: fsync-pgdata-on-recovery tries to write to more files than previously  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: fsync-pgdata-on-recovery tries to write to more files than previously
List pgsql-hackers
Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, May 25, 2015 at 2:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> On 2015-05-25 14:14:10 -0400, Stephen Frost wrote:
>>>> Not really sure I see that as helping.
>>> On most OSs, except windows and some obscure unixes, a readonly fd is
>>> allowed to fsync a file.
>> Perhaps, but if we didn't have permission to write the file, it's hard to
>> argue that it's our responsibility to fsync it.  So this seems like it's
>> adding complexity without really adding any safety.
>
> I agree.  I think ignoring fsync failures is a very sensible approach.
> If the files are not writable, they're probably not ours.  If they are
> not writable but somehow still ours, we probably can't have written
> them before the crash, either.  If they are ours and we somehow wrote
> to them before the crash, and then while the system was down they were
> made inaccessible, and then the database was restarted, then we're
> well into the territory where the system administrator has done
> something that we cannot possibly be expected to cope with ... but
> ignoring the fsync isn't very likely to cause any real problems even
> here.  If we really did modify those blocks recently, recovery will
> try to redo the changes, and we'll fail then anyway.  So what's the
> problem?
>
> I agree with Tom's concern that if we have two lists of directories,
> they may get out of sync.  We could probably merge the two lists
> somehow, but I'm not really seeing the point, since Tom's blanket
> approach should work just fine.

I certainly see your point, but Tom also pointed out that it's not great
to ignore failures during this phase:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Greg Stark <stark@mit.edu> writes:
> > What exactly is failing?
> > Is it that fsync is returning -1 ?
> According to the original report from Christoph Berg, it was open()
> not fsync() that was failing, at least in permissions-based cases.
>
> I'm not sure if we should just uniformly ignore all failures in this
> phase.  That would have the merit of clearly not creating any new
> startup failure cases compared to the previous code, but as you say
> sometimes it might mean ignoring real problems.

If we accept this, then we still have to have the lists, to decide what
to fail on and what to ignore.  If we're going to have said lists tho, I
don't really see the point in fsync'ing things we're pretty confident
aren't ours.

Further, in any of these cases, we have to decide which failure cases
are ones that are "fatal" and which are not- being in the list or not
isn't the only criteria, it's just one part of the overall decision.  We
also need to consider what return value we get back for which system
calls, all of which may entirely be system dependent, meanly we may have
to deal with portability issues here too.

Then there are other interesting considerations like what happens with
an NFS mount (as Greg mentioned), or perhaps what happens when it's a
MAC violation (eg: SELinux).  Generally speaking, those will also return
an error code which we can contemplate, but it'll still create annoying
log noise for people running in such environments.  Perhaps that would
encourage them to move whatever files they have out of $PGDATA, which is
likely to be a good decision, but that may not always be possible..
Thanks!    Stephen

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: fsync-pgdata-on-recovery tries to write to more files than previously
Next
From: Andres Freund
Date:
Subject: Re: fsync-pgdata-on-recovery tries to write to more files than previously