Thread: fsync-pgdata-on-recovery tries to write to more files than previously

fsync-pgdata-on-recovery tries to write to more files than previously

From
Christoph Berg
Date:
Hi,

the new fsync-pgdata-on-recovery code tries to open all files using
O_RDWR. At least on 9.1, this can make recovery fail:

* launch postgres, hit ^\ (or otherwise shut down uncleanly)
* touch foo; chmod 444 foo
* launch postgres

LOG:  database system was interrupted; last known up at 2015-05-23 19:18:36 CEST
FATAL:  could not open file "/home/cbe/9.1/foo": Permission denied
LOG:  startup process (PID 27305) exited with exit code 1
LOG:  aborting startup due to startup process failure

The code on 9.4 looks similar to me, but I couldn't trigger the
problem there.

I think this is a real-world problem:

1) In older releases, the SSL certs were read from the data directory,
and at least the default Debian installation creates symlinks from
PGDATA/server.* to /etc/ssl/ where PostgreSQL can't write

2) It's probably a pretty common scenario that the root user will edit
postgresql.conf, and make backups or create other random files there
that are not writable. Even a non-writable postgresql.conf itself or
recovery.conf was not a problem previously.

To me, this is a serious regression because it prevents automatic
startup of a server that would otherwise just run.

Christoph
--
cb@df7cb.de | http://www.df7cb.de/

Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Christoph Berg
Date:
Re: To PostgreSQL Hackers 2015-05-23 <20150523172627.GA24277@msg.df7cb.de>
> Hi,
>
> the new fsync-pgdata-on-recovery code tries to open all files using
> O_RDWR. At least on 9.1, this can make recovery fail:
>
> * launch postgres, hit ^\ (or otherwise shut down uncleanly)
> * touch foo; chmod 444 foo
> * launch postgres
>
> LOG:  database system was interrupted; last known up at 2015-05-23 19:18:36 CEST
> FATAL:  could not open file "/home/cbe/9.1/foo": Permission denied
> LOG:  startup process (PID 27305) exited with exit code 1
> LOG:  aborting startup due to startup process failure
>
> The code on 9.4 looks similar to me, but I couldn't trigger the
> problem there.

Correction: 9.4 is equally broken. (I was still running 9.4.1 when I
tried first.)

> I think this is a real-world problem:
>
> 1) In older releases, the SSL certs were read from the data directory,
> and at least the default Debian installation creates symlinks from
> PGDATA/server.* to /etc/ssl/ where PostgreSQL can't write
>
> 2) It's probably a pretty common scenario that the root user will edit
> postgresql.conf, and make backups or create other random files there
> that are not writable. Even a non-writable postgresql.conf itself or
> recovery.conf was not a problem previously.

3) The .postgresql.conf.swp files created by (root's) vim are 0600.

> To me, this is a serious regression because it prevents automatic
> startup of a server that would otherwise just run.

Christoph
--
cb@df7cb.de | http://www.df7cb.de/

Christoph Berg <myon@debian.org> writes:
> the new fsync-pgdata-on-recovery code tries to open all files using
> O_RDWR. At least on 9.1, this can make recovery fail:

Hm.  I wonder whether it would be all right to just skip files for which
we get EPERM on open().  The argument being that if we can't write to the
file, we should not be held responsible for fsync'ing it either.  But
I'm not sure whether EPERM would be the only relevant errno, or whether
there are cases where this would mask real problems.
        regards, tom lane



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Christoph Berg
Date:
Re: Tom Lane 2015-05-23 <2284.1432413209@sss.pgh.pa.us>
> Christoph Berg <myon@debian.org> writes:
> > the new fsync-pgdata-on-recovery code tries to open all files using
> > O_RDWR. At least on 9.1, this can make recovery fail:
>
> Hm.  I wonder whether it would be all right to just skip files for which
> we get EPERM on open().  The argument being that if we can't write to the
> file, we should not be held responsible for fsync'ing it either.  But
> I'm not sure whether EPERM would be the only relevant errno, or whether
> there are cases where this would mask real problems.

Maybe logging WARNINGs instead of FATAL would be enough of a fix?

Christoph
--
cb@df7cb.de | http://www.df7cb.de/



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Andres Freund
Date:
On 2015-05-23 16:33:29 -0400, Tom Lane wrote:
> Christoph Berg <myon@debian.org> writes:
> > the new fsync-pgdata-on-recovery code tries to open all files using
> > O_RDWR. At least on 9.1, this can make recovery fail:
> 
> Hm.  I wonder whether it would be all right to just skip files for which
> we get EPERM on open().  The argument being that if we can't write to the
> file, we should not be held responsible for fsync'ing it either.  But
> I'm not sure whether EPERM would be the only relevant errno, or whether
> there are cases where this would mask real problems.

We could even try doing the a fsync with a readonly fd as a fallback,
but that's also pretty hacky.

How about, to avoid masking actual problems, we have a more
differentiated logic for the toplevel data directory? I think we could
just skip all non-directory files in there data_directory itself. None
of the files in the toplevel directory, with the exception of
postgresql.auto.conf, will ever get written to by PG itself.  And if
there's readonly files somewhere in a subdirectory, I won't feel
particularly bad.

Greetings,

Andres Freund



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Christoph Berg
Date:
Re: Andres Freund 2015-05-24 <20150524005245.GD32396@alap3.anarazel.de>
> How about, to avoid masking actual problems, we have a more
> differentiated logic for the toplevel data directory? I think we could
> just skip all non-directory files in there data_directory itself. None
> of the files in the toplevel directory, with the exception of
> postgresql.auto.conf, will ever get written to by PG itself.  And if
> there's readonly files somewhere in a subdirectory, I won't feel
> particularly bad.

I like that idea.

Christoph
-- 
cb@df7cb.de | http://www.df7cb.de/



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Christoph Berg
Date:
Re: To Andres Freund 2015-05-24 <20150524075244.GB27048@msg.df7cb.de>
> Re: Andres Freund 2015-05-24 <20150524005245.GD32396@alap3.anarazel.de>
> > How about, to avoid masking actual problems, we have a more
> > differentiated logic for the toplevel data directory? I think we could
> > just skip all non-directory files in there data_directory itself. None
> > of the files in the toplevel directory, with the exception of
> > postgresql.auto.conf, will ever get written to by PG itself.  And if
> > there's readonly files somewhere in a subdirectory, I won't feel
> > particularly bad.

pg_log/ is also admin domain. What about only recursing into
well-known directories + postgresql.auto.conf?

(I've also been wondering if pg_basebackup shouldn't skip pg_log, but
that's a different topic...)

Christoph
-- 
cb@df7cb.de | http://www.df7cb.de/



Christoph Berg <myon@debian.org> writes:
> Re: To Andres Freund 2015-05-24 <20150524075244.GB27048@msg.df7cb.de>
>> Re: Andres Freund 2015-05-24 <20150524005245.GD32396@alap3.anarazel.de>
>>> How about, to avoid masking actual problems, we have a more
>>> differentiated logic for the toplevel data directory?

> pg_log/ is also admin domain. What about only recursing into
> well-known directories + postgresql.auto.conf?

The idea that this code would know exactly what's what under $PGDATA
scares me.  I can positively guarantee that it would diverge from reality
over time, and nobody would notice until it ate their data, failed to
start, or otherwise behaved undesirably.

pg_log/ is a perfect example, because that is not a hard-wired directory
name; somebody could point the syslogger at a different place very easily.
Wiring in special behavior for that name is just wrong.

I would *much* rather have a uniform rule for how to treat each file
the scan comes across.  It might take some tweaking to get to one that
works well; but once we did, we could have some confidence that it
wouldn't break later.
        regards, tom lane



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Andres Freund
Date:
On May 24, 2015 7:52:53 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Christoph Berg <myon@debian.org> writes:
>> Re: To Andres Freund 2015-05-24 <20150524075244.GB27048@msg.df7cb.de>
>>> Re: Andres Freund 2015-05-24
><20150524005245.GD32396@alap3.anarazel.de>
>>>> How about, to avoid masking actual problems, we have a more
>>>> differentiated logic for the toplevel data directory?
>
>> pg_log/ is also admin domain. What about only recursing into
>> well-known directories + postgresql.auto.conf?
>
>The idea that this code would know exactly what's what under $PGDATA
>scares me.  I can positively guarantee that it would diverge from
>reality
>over time, and nobody would notice until it ate their data, failed to
>start, or otherwise behaved undesirably.
>
>pg_log/ is a perfect example, because that is not a hard-wired
>directory
>name; somebody could point the syslogger at a different place very
>easily.
>Wiring in special behavior for that name is just wrong.
>
>I would *much* rather have a uniform rule for how to treat each file
>the scan comes across.  It might take some tweaking to get to one that
>works well; but once we did, we could have some confidence that it
>wouldn't break later.

If we'd merge it with initdb's list I think I'd not be that bad. I'm thinking of some header declaring it, roughly like
thermgr list.
 

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.



Andres Freund <andres@anarazel.de> writes:
> On May 24, 2015 7:52:53 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Christoph Berg <myon@debian.org> writes:
>>> pg_log/ is also admin domain. What about only recursing into
>>> well-known directories + postgresql.auto.conf?

>> The idea that this code would know exactly what's what under $PGDATA
>> scares me.  I can positively guarantee that it would diverge from
>> reality over time, and nobody would notice until it ate their data,
>> failed to start, or otherwise behaved undesirably.
>> 
>> pg_log/ is a perfect example, because that is not a hard-wired
>> directory name; somebody could point the syslogger at a different place
>> very easily.  Wiring in special behavior for that name is just wrong.
>> 
>> I would *much* rather have a uniform rule for how to treat each file
>> the scan comes across.  It might take some tweaking to get to one that
>> works well; but once we did, we could have some confidence that it
>> wouldn't break later.

> If we'd merge it with initdb's list I think I'd not be that bad. I'm thinking of some header declaring it, roughly
likethe rmgr list.
 

pg_log/ is a counterexample to that idea too; initdb doesn't know about it
(and shouldn't).
        regards, tom lane



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Andres Freund
Date:
On 2015-05-25 13:38:01 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On May 24, 2015 7:52:53 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > If we'd merge it with initdb's list I think I'd not be that bad. I'm thinking of some header declaring it, roughly
likethe rmgr list.
 
> 
> pg_log/ is a counterexample to that idea too; initdb doesn't know about it
> (and shouldn't).

The idea would be to *only* directories that initdb knows about. Since
that's where the valuables are. So I don't see how pg_log would be a
counterexample.



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Stephen Frost
Date:
* Andres Freund (andres@anarazel.de) wrote:
> On 2015-05-25 13:38:01 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > On May 24, 2015 7:52:53 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > > If we'd merge it with initdb's list I think I'd not be that bad. I'm thinking of some header declaring it,
roughlylike the rmgr list. 
> >
> > pg_log/ is a counterexample to that idea too; initdb doesn't know about it
> > (and shouldn't).
>
> The idea would be to *only* directories that initdb knows about. Since
> that's where the valuables are. So I don't see how pg_log would be a
> counterexample.

Indeed, that wouldn't be included in the list of things to fsync and it
isn't listed in initdb, so that works.

I've not followed this thread all that closely, but I do tend to agree
with the idea of "only try to mess with files that are *clearly* ours to
mess with."
Thanks!
    Stephen

Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Andres Freund
Date:
On 2015-05-25 14:14:10 -0400, Stephen Frost wrote:
> That seems overly complicated, for my 2c at least.  I don't particularly
> like trying to mess with files that might be rightfully considered "not
> ours" either.

I'd not consider an fsync to be "messing" with files, especially if
they're in PGDATA.

> > Additionally we could attempt to fsync with a readonly fd before trying
> > the read-write fd...
> 
> Not really sure I see that as helping.

On most OSs, except windows and some obscure unixes, a readonly fd is
allowed to fsync a file.

Greetings,

Andres Freund



Stephen Frost <sfrost@snowman.net> writes:
> I've not followed this thread all that closely, but I do tend to agree
> with the idea of "only try to mess with files that are *clearly* ours to
> mess with."

Well, that opens us to errors of omission, ie failing to fsync things we
should have.  Maybe that's an okay risk, but personally I'd judge that
"fsync everything and ignore (some?) errors" is probably a more robust
approach over time.
        regards, tom lane



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Stephen Frost
Date:
* Andres Freund (andres@anarazel.de) wrote:
> On 2015-05-25 14:02:28 -0400, Tom Lane wrote:
> > Stephen Frost <sfrost@snowman.net> writes:
> > > I've not followed this thread all that closely, but I do tend to agree
> > > with the idea of "only try to mess with files that are *clearly* ours to
> > > mess with."
> >
> > Well, that opens us to errors of omission, ie failing to fsync things we
> > should have.
>
> Is that really that likely? I mean we don't normally add data to the top
> level directory itself, and subdirectories hopefully won't be added
> except via initdb?

That feels like a pretty low risk, to me at least.  Certainly better
than having a failure, like what's going on now.

> > Maybe that's an okay risk, but personally I'd judge that
> > "fsync everything and ignore (some?) errors" is probably a more robust
> > approach over time.
>
> The over-the-top approach would be to combine the two. Error out in
> directories that are in the initdb list, and ignore permission errors
> otherwise...

That seems overly complicated, for my 2c at least.  I don't particularly
like trying to mess with files that might be rightfully considered "not
ours" either.  This all makes me really wonder about
postgresql.auto.conf too..  Clearly, on the one hand, we consider that
"our" file, and so we should error out if we don't own it, but on the
other hand, I've specifically recommended making that file owned by
root to some folks, to avoid DBAs playing with the startup-time
settings..

> Additionally we could attempt to fsync with a readonly fd before trying
> the read-write fd...

Not really sure I see that as helping.
Thanks!
    Stephen

Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Stephen Frost
Date:
* Andres Freund (andres@anarazel.de) wrote:
> On 2015-05-25 14:14:10 -0400, Stephen Frost wrote:
> > That seems overly complicated, for my 2c at least.  I don't particularly
> > like trying to mess with files that might be rightfully considered "not
> > ours" either.
>
> I'd not consider an fsync to be "messing" with files, especially if
> they're in PGDATA.

I'm not entirely sure I agree.

> > > Additionally we could attempt to fsync with a readonly fd before trying
> > > the read-write fd...
> >
> > Not really sure I see that as helping.
>
> On most OSs, except windows and some obscure unixes, a readonly fd is
> allowed to fsync a file.

I wouldn't have thought otherwise, given that you were suggesting it,
but there's no guarantee we're going to be allowed to read it either, or
even access the directory the symlink points to, etc..
Thanks,
    Stephen

Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Andres Freund
Date:
On 2015-05-25 14:02:28 -0400, Tom Lane wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > I've not followed this thread all that closely, but I do tend to agree
> > with the idea of "only try to mess with files that are *clearly* ours to
> > mess with."
>
> Well, that opens us to errors of omission, ie failing to fsync things we
> should have.

Is that really that likely? I mean we don't normally add data to the top
level directory itself, and subdirectories hopefully won't be added
except via initdb?

> Maybe that's an okay risk, but personally I'd judge that
> "fsync everything and ignore (some?) errors" is probably a more robust
> approach over time.

The over-the-top approach would be to combine the two. Error out in
directories that are in the initdb list, and ignore permission errors
otherwise...

Additionally we could attempt to fsync with a readonly fd before trying
the read-write fd...



Andres Freund <andres@anarazel.de> writes:
> On 2015-05-25 14:14:10 -0400, Stephen Frost wrote:
>>> Additionally we could attempt to fsync with a readonly fd before trying
>>> the read-write fd...

>> Not really sure I see that as helping.

> On most OSs, except windows and some obscure unixes, a readonly fd is
> allowed to fsync a file.

Perhaps, but if we didn't have permission to write the file, it's hard to
argue that it's our responsibility to fsync it.  So this seems like it's
adding complexity without really adding any safety.
        regards, tom lane



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > I've not followed this thread all that closely, but I do tend to agree
> > with the idea of "only try to mess with files that are *clearly* ours to
> > mess with."
> 
> Well, that opens us to errors of omission, ie failing to fsync things we
> should have.  Maybe that's an okay risk, but personally I'd judge that
> "fsync everything and ignore (some?) errors" is probably a more robust
> approach over time.

How is it possible to make errors of omission?  The list of directories
in initdb is the complete set of directories that are created for a
newly-initdb'd database, no?  Surely there can't be a database that
contains vital directories that are not created there?  See "subdirs"
static in initdb.c.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> Well, that opens us to errors of omission, ie failing to fsync things we
>> should have.  Maybe that's an okay risk, but personally I'd judge that
>> "fsync everything and ignore (some?) errors" is probably a more robust
>> approach over time.

> How is it possible to make errors of omission?  The list of directories
> in initdb is the complete set of directories that are created for a
> newly-initdb'd database, no?  Surely there can't be a database that
> contains vital directories that are not created there?  See "subdirs"
> static in initdb.c.

Easy: all you need is to suppose that some of the plain files at top level
of $PGDATA ought to be fsync'd.  (I'm fairly sure for example that we took
steps awhile back to force postmaster.pid to be fsync'd.)  If there is a
distinction between the fsync requirements of top-level files and
everything else, it is completely accidental and not to be relied on.

And from the other direction, where exactly is it written that
distros/users will only create problematic files at the top level of
$PGDATA?  I'd have zero confidence in such an assertion applied to
tablespace directories, for sure.
        regards, tom lane



On 05/25/2015 01:21 PM, Tom Lane wrote:
> And from the other direction, where exactly is it written that
> distros/users will only create problematic files at the top level of
> $PGDATA?  I'd have zero confidence in such an assertion applied to
> tablespace directories, for sure.

Yes, absolutely.

For example: Cstore_FDW now puts its files in a subdirectory of the
tablespace directory, which for a really large cstore table, the user
will want to symlink somewhere else, or might create as a mount.  Such a
user might then, for example, annotate this with a README.txt which
happens to be root/root perms.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



On Mon, May 25, 2015 at 2:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2015-05-25 14:14:10 -0400, Stephen Frost wrote:
>>>> Additionally we could attempt to fsync with a readonly fd before trying
>>>> the read-write fd...
>
>>> Not really sure I see that as helping.
>
>> On most OSs, except windows and some obscure unixes, a readonly fd is
>> allowed to fsync a file.
>
> Perhaps, but if we didn't have permission to write the file, it's hard to
> argue that it's our responsibility to fsync it.  So this seems like it's
> adding complexity without really adding any safety.

I agree.  I think ignoring fsync failures is a very sensible approach.
If the files are not writable, they're probably not ours.  If they are
not writable but somehow still ours, we probably can't have written
them before the crash, either.  If they are ours and we somehow wrote
to them before the crash, and then while the system was down they were
made inaccessible, and then the database was restarted, then we're
well into the territory where the system administrator has done
something that we cannot possibly be expected to cope with ... but
ignoring the fsync isn't very likely to cause any real problems even
here.  If we really did modify those blocks recently, recovery will
try to redo the changes, and we'll fail then anyway.  So what's the
problem?

I agree with Tom's concern that if we have two lists of directories,
they may get out of sync.  We could probably merge the two lists
somehow, but I'm not really seeing the point, since Tom's blanket
approach should work just fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Stephen Frost
Date:
Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, May 25, 2015 at 2:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> On 2015-05-25 14:14:10 -0400, Stephen Frost wrote:
>>>> Not really sure I see that as helping.
>>> On most OSs, except windows and some obscure unixes, a readonly fd is
>>> allowed to fsync a file.
>> Perhaps, but if we didn't have permission to write the file, it's hard to
>> argue that it's our responsibility to fsync it.  So this seems like it's
>> adding complexity without really adding any safety.
>
> I agree.  I think ignoring fsync failures is a very sensible approach.
> If the files are not writable, they're probably not ours.  If they are
> not writable but somehow still ours, we probably can't have written
> them before the crash, either.  If they are ours and we somehow wrote
> to them before the crash, and then while the system was down they were
> made inaccessible, and then the database was restarted, then we're
> well into the territory where the system administrator has done
> something that we cannot possibly be expected to cope with ... but
> ignoring the fsync isn't very likely to cause any real problems even
> here.  If we really did modify those blocks recently, recovery will
> try to redo the changes, and we'll fail then anyway.  So what's the
> problem?
>
> I agree with Tom's concern that if we have two lists of directories,
> they may get out of sync.  We could probably merge the two lists
> somehow, but I'm not really seeing the point, since Tom's blanket
> approach should work just fine.

I certainly see your point, but Tom also pointed out that it's not great
to ignore failures during this phase:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Greg Stark <stark@mit.edu> writes:
> > What exactly is failing?
> > Is it that fsync is returning -1 ?
> According to the original report from Christoph Berg, it was open()
> not fsync() that was failing, at least in permissions-based cases.
>
> I'm not sure if we should just uniformly ignore all failures in this
> phase.  That would have the merit of clearly not creating any new
> startup failure cases compared to the previous code, but as you say
> sometimes it might mean ignoring real problems.

If we accept this, then we still have to have the lists, to decide what
to fail on and what to ignore.  If we're going to have said lists tho, I
don't really see the point in fsync'ing things we're pretty confident
aren't ours.

Further, in any of these cases, we have to decide which failure cases
are ones that are "fatal" and which are not- being in the list or not
isn't the only criteria, it's just one part of the overall decision.  We
also need to consider what return value we get back for which system
calls, all of which may entirely be system dependent, meanly we may have
to deal with portability issues here too.

Then there are other interesting considerations like what happens with
an NFS mount (as Greg mentioned), or perhaps what happens when it's a
MAC violation (eg: SELinux).  Generally speaking, those will also return
an error code which we can contemplate, but it'll still create annoying
log noise for people running in such environments.  Perhaps that would
encourage them to move whatever files they have out of $PGDATA, which is
likely to be a good decision, but that may not always be possible..
Thanks!    Stephen

Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Andres Freund
Date:
On 2015-05-25 21:33:03 -0400, Robert Haas wrote:
> On Mon, May 25, 2015 at 2:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Perhaps, but if we didn't have permission to write the file, it's hard to
> > argue that it's our responsibility to fsync it.  So this seems like it's
> > adding complexity without really adding any safety.
> 
> I agree.  I think ignoring fsync failures is a very sensible approach.
> If the files are not writable, they're probably not ours.

The reason we started discussing this is because Tom had the - quite
reasonable - concern that this might not solely be a problem of EACCESS,
but that there could be other errors that we need to ignore to not fail
spuriously.  Say a symlink goes to a binary, which is currently being
executed: ETXTBSY. Or the file is in a readonly filesystem: EROFS.  So
we'd need to ignore a lot of errors, possibly ignoring valid ones.

I personally can see why people will put things in PGDATA itself, if you
put unreadable stuff in some subdirectory that you didn't create
yourself, I see much less reason to tolerate that.

Another thing is whether we should handle a recursive symlink in pgdata?
I personally think not, but...

It's also not just as simple as making fsync_fname fail gracefully upon
EACCESS - the opendir() could fail just as well.

Greetings,

Andres Freund



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Abhijit Menon-Sen
Date:
At 2015-05-26 03:54:51 +0200, andres@anarazel.de wrote:
>
> Say a symlink goes to a binary, which is currently being executed:
> ETXTBSY. Or the file is in a readonly filesystem: EROFS.  So we'd
> need to ignore a lot of errors, possibly ignoring valid ones.

Right. That's why I started out by being conservative and following only
the "expected" symlinks in pg_tblspc (as other parts of the code do).

> Another thing is whether we should handle a recursive symlink in
> pgdata? I personally think not, but...

I think not too.

> It's also not just as simple as making fsync_fname fail gracefully
> upon EACCESS - the opendir() could fail just as well.

I'll post a proposed patch shortly.

-- Abhijit



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Stephen Frost
Date:
* Abhijit Menon-Sen (ams@2ndQuadrant.com) wrote:
> At 2015-05-26 03:54:51 +0200, andres@anarazel.de wrote:
> > Another thing is whether we should handle a recursive symlink in
> > pgdata? I personally think not, but...
>
> I think not too.

Yikes..  That's definitely the kind of thing that's why I worry about
the whole 'fsync everything' idea- what if I symlink to /?  I've
certainly done that before from my home directory for ease of use and I
imagine there are people out there who have similar setups where they
sftp as the PG user and use the symlink to more easily navigate
somewhere else.  We have to realize that, on at least some systems,
PGDATA could be the postgres user's home directory too.  That's not the
case on Debian-based systems today, but I think it might have been
before Debian had the multi-cluster tooling.

> > It's also not just as simple as making fsync_fname fail gracefully
> > upon EACCESS - the opendir() could fail just as well.
>
> I'll post a proposed patch shortly.

Thanks!
Stephen

On Mon, May 25, 2015 at 9:54 PM, Stephen Frost <sfrost@snowman.net> wrote:
> I certainly see your point, but Tom also pointed out that it's not great
> to ignore failures during this phase:
>
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> Greg Stark <stark@mit.edu> writes:
>> > What exactly is failing?
>> > Is it that fsync is returning -1 ?
>> According to the original report from Christoph Berg, it was open()
>> not fsync() that was failing, at least in permissions-based cases.
>>
>> I'm not sure if we should just uniformly ignore all failures in this
>> phase.  That would have the merit of clearly not creating any new
>> startup failure cases compared to the previous code, but as you say
>> sometimes it might mean ignoring real problems.
>
> If we accept this, then we still have to have the lists, to decide what
> to fail on and what to ignore.  If we're going to have said lists tho, I
> don't really see the point in fsync'ing things we're pretty confident
> aren't ours.

No, that's not right at all.  The idea, at least as I understand it,
would be decide which errors to ignore by looking at the error code,
not by looking at which file is involved.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Mon, May 25, 2015 at 9:54 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-05-25 21:33:03 -0400, Robert Haas wrote:
>> On Mon, May 25, 2015 at 2:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > Perhaps, but if we didn't have permission to write the file, it's hard to
>> > argue that it's our responsibility to fsync it.  So this seems like it's
>> > adding complexity without really adding any safety.
>>
>> I agree.  I think ignoring fsync failures is a very sensible approach.
>> If the files are not writable, they're probably not ours.
>
> The reason we started discussing this is because Tom had the - quite
> reasonable - concern that this might not solely be a problem of EACCESS,
> but that there could be other errors that we need to ignore to not fail
> spuriously.  Say a symlink goes to a binary, which is currently being
> executed: ETXTBSY. Or the file is in a readonly filesystem: EROFS.  So
> we'd need to ignore a lot of errors, possibly ignoring valid ones.

But ignoring those errors wouldn't compromise data integrity, either.
So let's just ignore (but log) all errors: then we'll be demonstrably
no worse off than we were before this patch went in.  If it turns out
that there's some particular case where ignoring errors DOES
compromise data integrity, then we can plug that hole surgically when
somebody reports it.

Anything we do short of making all errors in this area non-fatal is
going to leave behind startup-failure cases that exist today, and we
have no evidence at this time that such startup failures would be
justified by any actual data loss risk.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> Anything we do short of making all errors in this area non-fatal is
> going to leave behind startup-failure cases that exist today, and we
> have no evidence at this time that such startup failures would be
> justified by any actual data loss risk.

Yeah.  Perhaps I missed it, but was the original patch motivated by
actual failures that had been seen in the field, or was it just a
hypothetical concern?  Certainly, any actual failures of that sort
are few and far between compared to the number of problems we now
realize the patch introduced.

Also, we need to discuss how hard walkdir() needs to try to avoid
throwing errors of its own.
        regards, tom lane



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Andrew Dunstan
Date:
On 05/26/2015 08:05 AM, Robert Haas wrote:
> On Mon, May 25, 2015 at 9:54 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> I certainly see your point, but Tom also pointed out that it's not great
>> to ignore failures during this phase:
>>
>> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>>> Greg Stark <stark@mit.edu> writes:
>>>> What exactly is failing?
>>>> Is it that fsync is returning -1 ?
>>> According to the original report from Christoph Berg, it was open()
>>> not fsync() that was failing, at least in permissions-based cases.
>>>
>>> I'm not sure if we should just uniformly ignore all failures in this
>>> phase.  That would have the merit of clearly not creating any new
>>> startup failure cases compared to the previous code, but as you say
>>> sometimes it might mean ignoring real problems.
>> If we accept this, then we still have to have the lists, to decide what
>> to fail on and what to ignore.  If we're going to have said lists tho, I
>> don't really see the point in fsync'ing things we're pretty confident
>> aren't ours.
> No, that's not right at all.  The idea, at least as I understand it,
> would be decide which errors to ignore by looking at the error code,
> not by looking at which file is involved.
>


OK, I'm late to the party. But why exactly are we syncing absolutely 
everything? That seems over-broad.

And might it be better to check that we can open each file using 
access() than calling open() and looking at the error code?

cheers

andrew




Andrew Dunstan <andrew@dunslane.net> writes:
> OK, I'm late to the party. But why exactly are we syncing absolutely 
> everything? That seems over-broad.

If we try to be selective, we risk errors of omission, which no one would
ever notice until someone's data got eaten in a low-probability crash
scenario.  It seems more robust (at least to me) to fsync everything we
can find.  That does require more thought about error cases than went
into the original patch ... but I think that we need more thought about
error cases even if we do try to be selective.

One thing perhaps we *should* be selective about, though, is which
symlinks we try to follow.  I think that a good case could be made
for ignoring symlinks everywhere except in the pg_tablespace directory.
If we did, that would all by itself take care of the Debian scenario,
if I understand that case correctly.

> And might it be better to check that we can open each file using 
> access() than calling open() and looking at the error code?

Don't really see the point; that's just an extra step, and access()
won't exactly prove you can open the file, anyway.
        regards, tom lane



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Andrew Dunstan
Date:
On 05/26/2015 11:58 AM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> OK, I'm late to the party. But why exactly are we syncing absolutely
>> everything? That seems over-broad.
> If we try to be selective, we risk errors of omission, which no one would
> ever notice until someone's data got eaten in a low-probability crash
> scenario.  It seems more robust (at least to me) to fsync everything we
> can find.  That does require more thought about error cases than went
> into the original patch ... but I think that we need more thought about
> error cases even if we do try to be selective.
>
> One thing perhaps we *should* be selective about, though, is which
> symlinks we try to follow.  I think that a good case could be made
> for ignoring symlinks everywhere except in the pg_tablespace directory.
> If we did, that would all by itself take care of the Debian scenario,
> if I understand that case correctly.

People have symlinked the xlog directory. I've done it myself in the 
past. A better rule might be to ignore symlinks unless either they are 
in pg_tblspc or they are in the data directory and their name starts 
with "pg_".

cheers

andrew





Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Magnus Hagander
Date:
On Tue, May 26, 2015 at 6:16 PM, Andrew Dunstan <andrew@dunslane.net> wrote:

On 05/26/2015 11:58 AM, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
OK, I'm late to the party. But why exactly are we syncing absolutely
everything? That seems over-broad.
If we try to be selective, we risk errors of omission, which no one would
ever notice until someone's data got eaten in a low-probability crash
scenario.  It seems more robust (at least to me) to fsync everything we
can find.  That does require more thought about error cases than went
into the original patch ... but I think that we need more thought about
error cases even if we do try to be selective.

One thing perhaps we *should* be selective about, though, is which
symlinks we try to follow.  I think that a good case could be made
for ignoring symlinks everywhere except in the pg_tablespace directory.
If we did, that would all by itself take care of the Debian scenario,
if I understand that case correctly.

People have symlinked the xlog directory. I've done it myself in the past. A better rule might be to ignore symlinks unless either they are in pg_tblspc or they are in the data directory and their name starts with "pg_".

Not just "people". initdb will symlink the xlog directory if you use -x. 

--
Andrew Dunstan <andrew@dunslane.net> writes:
> On 05/26/2015 11:58 AM, Tom Lane wrote:
>> One thing perhaps we *should* be selective about, though, is which
>> symlinks we try to follow.  I think that a good case could be made
>> for ignoring symlinks everywhere except in the pg_tablespace directory.
>> If we did, that would all by itself take care of the Debian scenario,
>> if I understand that case correctly.

> People have symlinked the xlog directory.

Good point, we need to cover that case.
        regards, tom lane



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Andres Freund
Date:
On 2015-05-26 10:41:12 -0400, Tom Lane wrote:
> Yeah.  Perhaps I missed it, but was the original patch motivated by
> actual failures that had been seen in the field, or was it just a
> hypothetical concern?

I'd mentioned that it might be a good idea to do this while
investingating a bug with unlogged tables that several parties had
reported. That patch has been fixed in a more granular fashion. From
there it took on kind of a life on its own.

It is somewhat interesting that similar code has been used in
pg_upgrade, via initdb -S, for a while now, without, to my knowledge, it
causing reported problem. I think the relevant difference is that that
code doesn't follow symlinks.  It's obviously also less exercised and
poeople might just have fixed up permissions when encountering troubles.

Abhijit, do you recall why the code was changed to follow all symlinks
in contrast to explicitly going through the tablespaces as initdb -S
does? I'm pretty sure early versions of the patch pretty much had a
verbatim copy of the initdb logic?  That logic is missing pg_xlog btw,
which is bad for pg_upgrade.



I *can* reproduce corrupted clusters without this without trying too
hard. I yesterday wrote a test for the crash testing infrastructure I've
on and off worked on (since 2011. Uhm) and I could reproduce a bunch of
corrupted clusters without this patch.  When randomly triggering crash
restarts shortly afterwards follwed by a simulated hw restart (stop
accepting writes via linux device mapper) while concurrent COPYs are
running, I can trigger a bunch of data corruptions.

Since then my computer in berlin has done 440 testruns with the patch,
and 450 without. I've gotten 7 errors without, 0 with. But the
instrumentation for detecting errors is really shitty (pkey lookup for
every expected row) and doesn't yet keep meaningful diagnosistics. So
this isn't particularly bulletproof either way.

I can't tell whether the patch is just masking yet another problem due
to different timings caused by the fsync, or whether it's fixing the
problem that we can forget to sync WAL segments.

Greetings,

Andres Freund



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Andres Freund
Date:
On 2015-05-26 19:07:20 +0200, Andres Freund wrote:
> Abhijit, do you recall why the code was changed to follow all symlinks
> in contrast to explicitly going through the tablespaces as initdb -S
> does? I'm pretty sure early versions of the patch pretty much had a
> verbatim copy of the initdb logic?

Forgot to add Abhijit to CC list, sorry.

> That [initdb's] logic is missing pg_xlog btw, which is bad for pg_upgrade.

On second thought it's probably not actually bad for pg_upgrade's
specific usecase. It'll always end with a proper shutdown, so it'll
never need that WAL. But it'd be bad if anybody ever relied on initdb -S
in a different scenario.



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Andres Freund
Date:
On 2015-05-26 19:07:20 +0200, Andres Freund wrote:
> It is somewhat interesting that similar code has been used in
> pg_upgrade, via initdb -S, for a while now, without, to my knowledge, it
> causing reported problem. I think the relevant difference is that that
> code doesn't follow symlinks.  It's obviously also less exercised and
> poeople might just have fixed up permissions when encountering troubles.
> 
> Abhijit, do you recall why the code was changed to follow all symlinks
> in contrast to explicitly going through the tablespaces as initdb -S
> does? I'm pretty sure early versions of the patch pretty much had a
> verbatim copy of the initdb logic?  That logic is missing pg_xlog btw,
> which is bad for pg_upgrade.

So, this was discussed in the following thread, starting at:
http://archives.postgresql.org/message-id/20150403163232.GA28444%40eldon.alvh.no-ip.org

"Actually, since surely we must follow symlinks everywhere, why do we
have to do this separately for pg_tblspc?  Shouldn't that link-following
occur automatically when walking PGDATA in the first place?"

I don't think it's true that we must follow symlinks everywhere. I
think, as argued upthread, that it's sufficient to recurse through
PGDATA, follow the symlinks in pg_tbspc, and if a symlink, also go
through pg_xlog separately.  There are no other places we it's "allowed"
to introduce symlinks and we have refuted bugreports of people having
problems after doing that.

So what I propose is:
1) Remove the automatic symlink following
2) Follow pg_tbspc/*, pg_xlog if it's a symlink, fix the latter in  initdb -S
3) Add a elevel argument to walkdir(), return if AllocateDir() fails,  continue for stat() failures in the readdir()
loop.
4) Add elevel argument to pre_sync_fname, fsync_fname, return after  errors.
5) Accept EACCESS, ETXTBSY (if defined) when open()ing the files. By  virtue of not following symlinks we should not
needto worry about  EROFS
 

I'm inclined to think that 4) is a big enough compat break that a
fsync_fname_ext with the new argument is a good idea.

Arguments for/against?



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Abhijit Menon-Sen
Date:
At 2015-05-26 19:07:20 +0200, andres@anarazel.de wrote:
>
> Abhijit, do you recall why the code was changed to follow all symlinks
> in contrast to explicitly going through the tablespaces as initdb -S
> does? I'm pretty sure early versions of the patch pretty much had a
> verbatim copy of the initdb logic?

Yes, earlier versions of the patch did follow symlinks only in
pg_tblspc. I changed this because Álvaro pointed out that this
was likely to miss important stuff, e.g. in

20150403163232.GA28444@eldon.alvh.no-ip.org,
20150406131636.GD4369@alvh.no-ip.org

-- Abhijit



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Abhijit Menon-Sen
Date:
At 2015-05-26 22:44:03 +0200, andres@anarazel.de wrote:
>
> So, this was discussed in the following thread, starting at:
> http://archives.postgresql.org/message-id/20150403163232.GA28444%40eldon.alvh.no-ip.org

Sorry, I didn't see this before replying.

> There are no other places we it's "allowed" to introduce symlinks and
> we have refuted bugreports of people having problems after doing that.

OK.

> So what I propose is:
> 1) Remove the automatic symlink following
> 2) Follow pg_tbspc/*, pg_xlog if it's a symlink, fix the latter in
>    initdb -S
> 3) Add a elevel argument to walkdir(), return if AllocateDir() fails,
>    continue for stat() failures in the readdir() loop.
> 4) Add elevel argument to pre_sync_fname, fsync_fname, return after
>    errors.
> 5) Accept EACCESS, ETXTBSY (if defined) when open()ing the files. By
>    virtue of not following symlinks we should not need to worry about
>    EROFS

Makes sense. I've got most of that done, I'll just remove the symlink
following (or rather, restrict it to the cases mentioned), test a bit,
and post.

> I'm inclined to think that 4) is a big enough compat break that a
> fsync_fname_ext with the new argument is a good idea.

Agreed.

-- Abhijit



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Abhijit Menon-Sen
Date:
At 2015-05-26 22:44:03 +0200, andres@anarazel.de wrote:
>
> So what I propose is:
> 1) Remove the automatic symlink following
> 2) Follow pg_tbspc/*, pg_xlog if it's a symlink, fix the latter in
>    initdb -S
> 3) Add a elevel argument to walkdir(), return if AllocateDir() fails,
>    continue for stat() failures in the readdir() loop.
> 4) Add elevel argument to pre_sync_fname, fsync_fname, return after
>    errors.
> 5) Accept EACCESS, ETXTBSY (if defined) when open()ing the files. By
>    virtue of not following symlinks we should not need to worry about
>    EROFS

Here's a WIP patch for discussion.

I've (a) removed the S_ISLNK() branch in walkdir, (b) reintroduced
walktblspc_links to call walkdir on each of the entries within pg_tblspc
(simpler than trying to make walkdir follow links only for pg_xlog and
under pg_tblspc), (c) call walkdir on pg_xlog if it's a symlink (not
done for initdb -S; will submit separately), (d) add elevel arguments as
described, (e) ignore EACCES and ETXTBSY.

This correctly fsync()s stuff according to strace, and doesn't die if
there are unreadable files/links in PGDATA.

What I haven't done is return if AllocateDir() fails. I'm not convinced
that's correct, because it'll not complain if PGDATA is unreadable (but
this will break other things, so it doesn't matter), but also will die
if readdir fails rather than opendir.

I'm trying a couple of approaches to that (e.g. using readdir directly
instead of ReadDir), but other suggestions are welcome.

-- Abhijit

Attachment

Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Abhijit Menon-Sen
Date:
At 2015-05-27 11:46:39 +0530, ams@2ndQuadrant.com wrote:
>
> I'm trying a couple of approaches to that (e.g. using readdir directly
> instead of ReadDir), but other suggestions are welcome.

Here's what that looks like, but not yet fully tested.

-- Abhijit

Attachment
Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:
> At 2015-05-27 11:46:39 +0530, ams@2ndQuadrant.com wrote:
>> I'm trying a couple of approaches to that (e.g. using readdir directly
>> instead of ReadDir), but other suggestions are welcome.

> Here's what that looks like, but not yet fully tested.

I doubt that that (not using AllocateDir) is a good idea in the backend,
mainly because you are going to leak directory references if anything
called by walkdir() throws elog(ERROR).  While that doesn't matter much
for the immediate usage, since we'd throw up our hands and die anyway,
it completely puts the kibosh on any idea that walkdir() is a
general-purpose subroutine.  It would have to be decorated with big red
warnings NEVER USE THIS FUNCTION.  Ick.

What I think is a reasonable compromise is to treat AllocateDir failure as
nonfatal, but to continue using ReadDir etc which means that any post-open
failure in reading a successfully-opened directory would be fatal.  Such
errors would suggest that something is badly wrong in the filesystem, so
I do not feel too terrible about not covering that case.

Meanwhile, it seems like you have copied a couple of very dubious
decisions in initdb's walktblspc_links():

1. It doesn't say a word if the passed-in path (which in practice
is always $PGDATA/pg_tblspc) doesn't exist.

2. It assumes that every entry within pg_tblspc must be a tablespace
directory (or symlink to one), and fails if any are plain files.

At the very least, these behaviors seem inconsistent.  #2 is making
strong assumptions about pg_tblspc being correctly set up and free
of user-added cruft, while #1 doesn't even assume it's there at all.

Now, we don't really have to do anything about these things in order
to fix the immediate problem, but I wonder if we shouldn't try to
clean up initdb's behavior while we're at it.


Independently of that, I thought the plan was to complain about any
problems at LOG message level, not DEBUG1, and definitely not WARNING
(which is lower than LOG for this purpose).  And why would you use a
different message level for pg_xlog/ than for other files anyway?
        regards, tom lane



I wrote:
> Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:
>> At 2015-05-27 11:46:39 +0530, ams@2ndQuadrant.com wrote:
>>> I'm trying a couple of approaches to that (e.g. using readdir directly
>>> instead of ReadDir), but other suggestions are welcome.

>> Here's what that looks like, but not yet fully tested.

> I doubt that that (not using AllocateDir) is a good idea in the backend,

Oh, scratch that.  Reading the patch closer, I see you kept the use of
AllocateDir and only replaced ReadDir.  That's kind of ugly (and certainly
requires a comment about the inconsistency) but it's probably ok from an
error handling standpoint.  There are hard failure cases in AllocateDir,
but they seem unlikely to create problems in practice.

The other points stand though ...
        regards, tom lane



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Abhijit Menon-Sen
Date:
At 2015-05-27 20:22:18 -0400, tgl@sss.pgh.pa.us wrote:
>
> I doubt that that (not using AllocateDir) […]

(Disregarding per your followup.)

> What I think is a reasonable compromise is to treat AllocateDir
> failure as nonfatal, but to continue using ReadDir etc which means
> that any post-open failure in reading a successfully-opened directory
> would be fatal.

OK, that's halfway between the two patches I posted. Will do.

> Meanwhile, it seems like you have copied a couple of very dubious
> decisions in initdb's walktblspc_links(): […]
>
> Now, we don't really have to do anything about these things in order
> to fix the immediate problem, but I wonder if we shouldn't try to
> clean up initdb's behavior while we're at it.

OK, I'll fix that in both.

> Independently of that, I thought the plan was to complain about any
> problems at LOG message level, not DEBUG1, and definitely not WARNING
> (which is lower than LOG for this purpose). 

I'll change the level to LOG for fsync_fname_ext, but I think DEBUG1 is
more appropriate for the pre_sync_fname run. Or do you think I should
use LOG for that too?

> And why would you use a different message level for pg_xlog/ than for
> other files anyway?

That was just a mistake, I forgot to change it after copying.

Thanks for having a look. I'll post a revised patch shortly.

-- Abhijit



Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:
> At 2015-05-27 20:22:18 -0400, tgl@sss.pgh.pa.us wrote:
>> I doubt that that (not using AllocateDir) […]

> (Disregarding per your followup.)

>> What I think is a reasonable compromise is to treat AllocateDir
>> failure as nonfatal, but to continue using ReadDir etc which means
>> that any post-open failure in reading a successfully-opened directory
>> would be fatal.

> OK, that's halfway between the two patches I posted. Will do.

Thought you were disregarding?

Anyway, I had an idea about reducing the ugliness.  What you basically
did is copy-and-paste ReadDir so you could change the elevel.  What
about turning ReadDir into a wrapper around a ReadDirExtended function
that adds an elevel argument?
        regards, tom lane



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Abhijit Menon-Sen
Date:
At 2015-05-27 23:41:29 -0400, tgl@sss.pgh.pa.us wrote:
>
> What about turning ReadDir into a wrapper around a ReadDirExtended
> function that adds an elevel argument?

Sounds reasonable, doing.

-- Abhijit



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Abhijit Menon-Sen
Date:
Hi.

Here's an updated patch for the fsync problem(s).

A few points that may be worth considering:

1. I've made the ReadDir → ReadDirExtended change, but in retrospect I
   don't think it's really worth it. Using ReadDir and letting it die
   is probably fine. (Aside: I left ReadDirExtended static for now.)

2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c?
   I wasn't sure if I should move that to fd.c as well. I think it's
   borderline OK for now.

3. There's a comment in fsync_fname that we need to open directories
   with O_RDONLY and non-directories with O_RDWR. Does this also apply
   to pre_sync_fname (i.e. sync_file_range and posix_fadvise) on any
   platforms we care about? It doesn't seem so, but I'm not sure.

4. As a partial aside, why does fsync_fname use OpenTransientFile? It
   looks like it should use BasicOpenFile as pre_sync_fname does. We
   close the fd immediately after calling fsync anyway. Or is there
   some reason I'm missing that we should use OpenTransientFile in
   pre_sync_fname too?

   (Aside²: it's pointless to specify S_IRUSR|S_IWUSR in fsync_fname
   since we're not setting O_CREAT. Minor, so will submit separate
   patch.)

5. I made walktblspc_entries use stat() instead of lstat(), so that we
   can handle symlinks and directories the same way. Note that we won't
   continue to follow links inside the sub-directories because we use
   walkdir instead of recursing.

   But this depends on S_ISDIR() returning true after stat() on Windows
   with a junction. Does that work? And are the entries in pb_tblspc on
   Windows actually junctions?

I've tested this with unreadable files in PGDATA and junk inside
pb_tblspc. Feedback welcome.

I have a separate patch to initdb with the corresponding changes, which
I will post after dinner and a bit more testing.

-- Abhijit

Attachment
On Thu, May 28, 2015 at 10:26 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:
> 2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c?
>    I wasn't sure if I should move that to fd.c as well. I think it's
>    borderline OK for now.

I think if the function is specific as fsync_pgdata(), fd.c is not the
right place.  I'm not sure xlog.c is either, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, May 28, 2015 at 10:26 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:
>> 2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c?
>> I wasn't sure if I should move that to fd.c as well. I think it's
>> borderline OK for now.

> I think if the function is specific as fsync_pgdata(), fd.c is not the
> right place.  I'm not sure xlog.c is either, though.

ISTM xlog.c is clearly the *wrong* place; if this behavior has anything
to do with WAL logging as such, it's not apparent to me.  fd.c is not
a great place perhaps, but in view of the fact that we have things like
RemovePgTempFiles() in there, it's not unreasonable to see fsync_pgdata
as something to put there as well (perhaps with a name more chosen to
match fd.c names...)

Since Robert appears to be busy worrying about the multixact issue
reported by Steve Kehlet, I suggest he focus on that and I'll take
care of getting this thing committed.  AFAICT we have consensus on
what it should do and we're down to arguing about code style.
        regards, tom lane



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Abhijit Menon-Sen
Date:
At 2015-05-28 19:56:15 +0530, ams@2ndQuadrant.com wrote:
>
> I have a separate patch to initdb with the corresponding changes, which
> I will post after dinner and a bit more testing.

Here's that patch too.

I was a bit unsure how far to go with this. It fixes the problem of not
following pg_xlog if it's a symlink (Andres) and the one where it died
on bogus stuff in pg_tblspc (Tom) and unreadable files anywhere else
(it now spews errors to stderr, but carries on for as long as it can).

I've done a bit more than strictly necessary to fix those problems,
though, and made the code largely similar to what's in the other patch.
If you want something minimal, let me know and I'll cut it down to size.

-- Abhijit

P.S. If this patch gets applied, then the "Adapted from walktblspc_links
in initdb.c" comment in fd.c should be changed: s/links/entries/.

Attachment
On Thu, May 28, 2015 at 1:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, May 28, 2015 at 10:26 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:
>>> 2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c?
>>> I wasn't sure if I should move that to fd.c as well. I think it's
>>> borderline OK for now.
>
>> I think if the function is specific as fsync_pgdata(), fd.c is not the
>> right place.  I'm not sure xlog.c is either, though.
>
> ISTM xlog.c is clearly the *wrong* place; if this behavior has anything
> to do with WAL logging as such, it's not apparent to me.  fd.c is not
> a great place perhaps, but in view of the fact that we have things like
> RemovePgTempFiles() in there, it's not unreasonable to see fsync_pgdata
> as something to put there as well (perhaps with a name more chosen to
> match fd.c names...)

OK, sure, makes sense.

> Since Robert appears to be busy worrying about the multixact issue
> reported by Steve Kehlet, I suggest he focus on that and I'll take
> care of getting this thing committed.  AFAICT we have consensus on
> what it should do and we're down to arguing about code style.

Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:
> Here's an updated patch for the fsync problem(s).

I've committed this after some mostly-cosmetic rearrangements.

> 4. As a partial aside, why does fsync_fname use OpenTransientFile? It
>    looks like it should use BasicOpenFile as pre_sync_fname does. We
>    close the fd immediately after calling fsync anyway. Or is there
>    some reason I'm missing that we should use OpenTransientFile in
>    pre_sync_fname too?

pre_sync_fname is the one that is wrong IMO.  Using BasicOpenFile just
creates an opportunity for problems; that function should get called
from as few places as possible.

> 5. I made walktblspc_entries use stat() instead of lstat(), so that we
>    can handle symlinks and directories the same way. Note that we won't
>    continue to follow links inside the sub-directories because we use
>    walkdir instead of recursing.

Given that, walktblspc_entries was 99% duplicate, so I folded it into
walkdir with a process_symlinks boolean.

I have to leave shortly, so I'll look at the initdb cleanup tomorrow.
        regards, tom lane



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Abhijit Menon-Sen
Date:
At 2015-05-28 17:37:16 -0400, tgl@sss.pgh.pa.us wrote:
>
> I've committed this after some mostly-cosmetic rearrangements.

Thank you.

> I have to leave shortly, so I'll look at the initdb cleanup tomorrow.

Here's a revision of that patch that's more along the lines of what you
committed.

It occurred to me that we should probably also silently skip EACCES on
opendir and stat/lstat in walkdir. That's what the attached initdb patch
does. If you think it's a good idea, there's also a small fixup attached
for the backend version too.

-- Abhijit

Attachment
Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:
>> I have to leave shortly, so I'll look at the initdb cleanup tomorrow.

> Here's a revision of that patch that's more along the lines of what you
> committed.

Will look at that tomorrow ...

> It occurred to me that we should probably also silently skip EACCES on
> opendir and stat/lstat in walkdir. That's what the attached initdb patch
> does. If you think it's a good idea, there's also a small fixup attached
> for the backend version too.

Actually, I was seriously considering removing the don't-log-EACCES
special case (at least for files, perhaps not for directories).
The reason being that it's darn hard to verify that the code is doing
what it's supposed to when there is no simple way to get it to log any
complaints.  I left it as you wrote it in the committed version, but
I think both that and the ETXTBSY special case do not really have value
as long as their only effect is to suppress a LOG entry.

Speaking of which, could somebody test that the committed patch does
what it's supposed to on Windows?  You were worried upthread about
whether the tests for symlinks (aka junction points) behaved correctly,
and I have no way to check that either.
        regards, tom lane



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Michael Paquier
Date:
On Fri, May 29, 2015 at 12:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:
>>> I have to leave shortly, so I'll look at the initdb cleanup tomorrow.
>
>> Here's a revision of that patch that's more along the lines of what you
>> committed.
>
> Will look at that tomorrow ...
>
>> It occurred to me that we should probably also silently skip EACCES on
>> opendir and stat/lstat in walkdir. That's what the attached initdb patch
>> does. If you think it's a good idea, there's also a small fixup attached
>> for the backend version too.
>
> Actually, I was seriously considering removing the don't-log-EACCES
> special case (at least for files, perhaps not for directories).
> The reason being that it's darn hard to verify that the code is doing
> what it's supposed to when there is no simple way to get it to log any
> complaints.  I left it as you wrote it in the committed version, but
> I think both that and the ETXTBSY special case do not really have value
> as long as their only effect is to suppress a LOG entry.
>
> Speaking of which, could somebody test that the committed patch does
> what it's supposed to on Windows?  You were worried upthread about
> whether the tests for symlinks (aka junction points) behaved correctly,
> and I have no way to check that either.

The error can be reproduced by using junction
(https://technet.microsoft.com/en-us/sysinternals/bb896768.aspx) to
define a junction point within PGDATA and then for example mark as
read-only a configuration file included in postgresql.conf through the
junction point.

Using this method, I have checked that 9.4.1 works, reproduced the
permission error with 9.4.2, and checked as well that a3ae3db4 fixed
the problem. So things look to work fine.
Regards,
-- 
Michael



On Fri, May 29, 2015 at 9:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>
> Speaking of which, could somebody test that the committed patch does
> what it's supposed to on Windows?  You were worried upthread about
> whether the tests for symlinks (aka junction points) behaved correctly,
> and I have no way to check that either.
>

I have done some tests for the committed patch for this issue 
(especially to check the behaviour of symlink tests in the
new code) on Windows - 7 and below are results:

Test symlink for pg_xlog
-------------------------------------
Test -1 - Junction point for pg_xlog
1. Create a database (initdb)
2. Start server and forcefully crash it
3. cp E:\PGData\pg_xlog E:\TempLog\xlog_symlink_loc
4. mv E:\PGData\pg_xlog E:\bak_pg_xlog 
5. Created a directory junction (equivalent symlink)
mklink /J E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc 
6.  Restart Server - operation is successful.
I have debugged this operation, it does what it suppose to do,
detects the junction point and does fsync.

Test - 2 - Directory Symlink for pg_xlog
First 4 steps are same as Test-1
5.  mklink /D E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc
6.  Restart Server - Error 
- FATAL:  required WAL directory "pg_xlog" does not exist
This error occurs in
ValidateXLOGDirectoryStructure()->stat(XLOGDIR, &stat_buf)
7. If I manually step-through that line, it proceeds and in
SyncDataDirectory(), it detects the symlink;
8. After that in SyncDataDirectory(), when it does walkdir for
datadir, for ./pg_xlog it reports the log message and then
proceeds normal and the server is started.

Test-3 - Symlinks in pg_tblspc.
1. Create couple of tablespaces which creates symlinks
in pg_tblspc
2. Crash the server
3. Restart Server - It works fine.

I am not sure Test-2 is a valid thing and we should support it or
not, but the current patch is sane w.r.t that form of symlinks
as well.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Michael Paquier
Date:
On Fri, May 29, 2015 at 5:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, May 29, 2015 at 9:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>>
>> Speaking of which, could somebody test that the committed patch does
>> what it's supposed to on Windows?  You were worried upthread about
>> whether the tests for symlinks (aka junction points) behaved correctly,
>> and I have no way to check that either.
>>
>
> I have done some tests for the committed patch for this issue
> (especially to check the behaviour of symlink tests in the
> new code) on Windows - 7 and below are results:
>
> Test symlink for pg_xlog
> -------------------------------------
> Test -1 - Junction point for pg_xlog
> 1. Create a database (initdb)
> 2. Start server and forcefully crash it
> 3. cp E:\PGData\pg_xlog E:\TempLog\xlog_symlink_loc
> 4. mv E:\PGData\pg_xlog E:\bak_pg_xlog
> 5. Created a directory junction (equivalent symlink)
> mklink /J E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc
> 6.  Restart Server - operation is successful.
> I have debugged this operation, it does what it suppose to do,
> detects the junction point and does fsync.
>
> Test - 2 - Directory Symlink for pg_xlog
> First 4 steps are same as Test-1
> 5.  mklink /D E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc
> 6.  Restart Server - Error
> - FATAL:  required WAL directory "pg_xlog" does not exist
> This error occurs in
> ValidateXLOGDirectoryStructure()->stat(XLOGDIR, &stat_buf)
> 7. If I manually step-through that line, it proceeds and in
> SyncDataDirectory(), it detects the symlink;
> 8. After that in SyncDataDirectory(), when it does walkdir for
> datadir, for ./pg_xlog it reports the log message and then
> proceeds normal and the server is started.
>
> Test-3 - Symlinks in pg_tblspc.
> 1. Create couple of tablespaces which creates symlinks
> in pg_tblspc
> 2. Crash the server
> 3. Restart Server - It works fine.
>
> I am not sure Test-2 is a valid thing and we should support it or
> not, but the current patch is sane w.r.t that form of symlinks
> as well.

It is always good to check, but is that relevant to the bug fixed? I
mean, you need to symlink a file in PGDATA that server has no write
permission on... For example tablespaces can be written to in test 3,
so that would work even with 9.4.2 after crashing the server with -m
immediate.
-- 
Michael



On Fri, May 29, 2015 at 2:28 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Fri, May 29, 2015 at 5:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Test-3 - Symlinks in pg_tblspc.
> > 1. Create couple of tablespaces which creates symlinks
> > in pg_tblspc
> > 2. Crash the server
> > 3. Restart Server - It works fine.
> >
> > I am not sure Test-2 is a valid thing and we should support it or
> > not, but the current patch is sane w.r.t that form of symlinks
> > as well.
>
> It is always good to check, but is that relevant to the bug fixed? I
> mean, you need to symlink a file in PGDATA that server has no write
> permission on... For example tablespaces can be written to in test 3,
> so that would work even with 9.4.2 after crashing the server with -m
> immediate.

I have just tried to cover the paths which has symlink related code
in the new commit (in functions SyncDataDirectory() and walkdir()),
because thats what I understood from Tom's mail.  Without test-3, it
won't cover walkdir symlink test path, I didn't knew that there was any
confusion about verification of permissions problem on Windows
and I haven't looked to verify the whole patch. 

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Christoph Berg
Date:
Re: Tom Lane 2015-05-28 <5740.1432849036@sss.pgh.pa.us>
> Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:
> > Here's an updated patch for the fsync problem(s).
> 
> I've committed this after some mostly-cosmetic rearrangements.

Fwiw, I can confirm that the problem is fixed for 9.5. The regression
tests I've added to postgresql-common pass. Thanks!

Christoph
-- 
cb@df7cb.de | http://www.df7cb.de/



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Michael Paquier
Date:
On Fri, May 29, 2015 at 6:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, May 29, 2015 at 2:28 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>> On Fri, May 29, 2015 at 5:01 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> >
>> > Test-3 - Symlinks in pg_tblspc.
>> > 1. Create couple of tablespaces which creates symlinks
>> > in pg_tblspc
>> > 2. Crash the server
>> > 3. Restart Server - It works fine.
>> >
>> > I am not sure Test-2 is a valid thing and we should support it or
>> > not, but the current patch is sane w.r.t that form of symlinks
>> > as well.
>>
>> It is always good to check, but is that relevant to the bug fixed? I
>> mean, you need to symlink a file in PGDATA that server has no write
>> permission on... For example tablespaces can be written to in test 3,
>> so that would work even with 9.4.2 after crashing the server with -m
>> immediate.
>
> I have just tried to cover the paths which has symlink related code
> in the new commit (in functions SyncDataDirectory() and walkdir()),
> because thats what I understood from Tom's mail.  Without test-3, it
> won't cover walkdir symlink test path, I didn't knew that there was any
> confusion about verification of permissions problem on Windows
> and I haven't looked to verify the whole patch.

Reading once again this thread (after some good red wine + alpha), it
looks that you are right: not many checks with Windows junctions have
actually been with what has been committed...

> Test - 2 - Directory Symlink for pg_xlog
> First 4 steps are same as Test-1
> 5.  mklink /D E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc
> 6.  Restart Server - Error
> - FATAL:  required WAL directory "pg_xlog" does not exist
> This error occurs in
> ValidateXLOGDirectoryStructure()->stat(XLOGDIR, &stat_buf)
> 7. If I manually step-through that line, it proceeds and in
> SyncDataDirectory(), it detects the symlink;
> 8. After that in SyncDataDirectory(), when it does walkdir for
> datadir, for ./pg_xlog it reports the log message and then
> proceeds normal and the server is started.

Regarding this test, I just tested initdb -X and it works correctly
after crashing the server, so there is at least a workaround for the
error you are triggering here. Now, shouldn't we improve things as
well with mklink? That's a rather narrow case and we have a workaround
for it at initialization with initdb. I haven't looked much at the
code so I don't have a patch at hand but thoughts on this matter are
welcome.
-- 
Michael



Michael Paquier <michael.paquier@gmail.com> writes:
> On Fri, May 29, 2015 at 6:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Test - 2 - Directory Symlink for pg_xlog
>> First 4 steps are same as Test-1
>> 5.  mklink /D E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc
>> 6.  Restart Server - Error
>> - FATAL:  required WAL directory "pg_xlog" does not exist
>> This error occurs in
>> ValidateXLOGDirectoryStructure()->stat(XLOGDIR, &stat_buf)
>> 7. If I manually step-through that line, it proceeds and in
>> SyncDataDirectory(), it detects the symlink;
>> 8. After that in SyncDataDirectory(), when it does walkdir for
>> datadir, for ./pg_xlog it reports the log message and then
>> proceeds normal and the server is started.

> Regarding this test, I just tested initdb -X and it works correctly
> after crashing the server, so there is at least a workaround for the
> error you are triggering here. Now, shouldn't we improve things as
> well with mklink? That's a rather narrow case and we have a workaround
> for it at initialization with initdb. I haven't looked much at the
> code so I don't have a patch at hand but thoughts on this matter are
> welcome.

My (vague) recollection is that what mklink creates is not the same
as a junction point and we intentionally only support the latter as
pseudo-symlinks.  You'd need to trawl the archives for more detail,
unless someone else remembers.

So I think we're good here.  Thanks for all the testing!
        regards, tom lane



Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:
> At 2015-05-28 17:37:16 -0400, tgl@sss.pgh.pa.us wrote:
>> I have to leave shortly, so I'll look at the initdb cleanup tomorrow.

> Here's a revision of that patch that's more along the lines of what you
> committed.

Pushed with minor revisions.

> It occurred to me that we should probably also silently skip EACCES on
> opendir and stat/lstat in walkdir. That's what the attached initdb patch
> does. If you think it's a good idea, there's also a small fixup attached
> for the backend version too.

As I mentioned yesterday, I'm not really on board with ignoring EACCES,
except for the directories-on-Windows case.  Since we're only logging
the failures anyway, I think it is reasonable to log a complaint for any
unwritable file in the data directory.  I've not done it yet, but what
I think we oughta do is revert
if (errno == EACCES || (isdir && errno == EISDIR))

back to the former logic
if (isdir && (errno == EISDIR || errno == EACCES))

and even then, maybe the EACCES exclusion ought to be Windows only?

Also I want to get rid of the ETXTBSY special cases.  That one doesn't
seem like something that we should silently ignore: what the heck are
executables doing in the data directory?  Or is there some other meaning
on Windows?
        regards, tom lane



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Abhijit Menon-Sen
Date:
At 2015-05-29 13:14:18 -0400, tgl@sss.pgh.pa.us wrote:
>
> Pushed with minor revisions.

Thanks, looks good.

> Since we're only logging the failures anyway, I think it is reasonable
> to log a complaint for any unwritable file in the data directory.

Sounds reasonable, patch attached. ETXTBSY has no special meaning that I
can find under Windows, so I included that change as well.

-- Abhijit

Attachment

Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Andres Freund
Date:
On 2015-05-29 13:14:18 -0400, Tom Lane wrote:
> Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:
> As I mentioned yesterday, I'm not really on board with ignoring EACCES,
> except for the directories-on-Windows case.  Since we're only logging
> the failures anyway, I think it is reasonable to log a complaint for any
> unwritable file in the data directory.

That sounds like a potentially nontrivial amount of repetitive log bleat
after every crash start? One which the user can't really stop?

> Also I want to get rid of the ETXTBSY special cases.  That one doesn't
> seem like something that we should silently ignore: what the heck are
> executables doing in the data directory?  Or is there some other meaning
> on Windows?

I've seen a bunch of binaries placed in the data directory as
archive/restore commands. Those will be busy a good amount of the
time. While it'd not be my choice to do that, it's not entirely
unreasonable.

Andres



Andres Freund <andres@anarazel.de> writes:
> On 2015-05-29 13:14:18 -0400, Tom Lane wrote:
>> Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:
>> As I mentioned yesterday, I'm not really on board with ignoring EACCES,
>> except for the directories-on-Windows case.  Since we're only logging
>> the failures anyway, I think it is reasonable to log a complaint for any
>> unwritable file in the data directory.

> That sounds like a potentially nontrivial amount of repetitive log bleat
> after every crash start? One which the user can't really stop?

Why can't the user stop it?  We won't be bleating about the case of a
symlink to a non-writable file someplace else, which is the Debian use
case.  I don't see a very good excuse to have a non-writable file right
in the data directory.

>> Also I want to get rid of the ETXTBSY special cases.  That one doesn't
>> seem like something that we should silently ignore: what the heck are
>> executables doing in the data directory?  Or is there some other meaning
>> on Windows?

> I've seen a bunch of binaries placed in the data directory as
> archive/restore commands. Those will be busy a good amount of the
> time. While it'd not be my choice to do that, it's not entirely
> unreasonable.

I'd say it's a pretty damn-fool arrangement: for starters, it's
an unnecessary security hazard.

In any case, if the cost of such a file is one more line of log output
during a crash restart, most people would have no problem at all in
ignoring that log output.
        regards, tom lane



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Andres Freund
Date:
On 2015-05-29 13:49:16 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2015-05-29 13:14:18 -0400, Tom Lane wrote:
> >> Abhijit Menon-Sen <ams@2ndQuadrant.com> writes:
> >> As I mentioned yesterday, I'm not really on board with ignoring EACCES,
> >> except for the directories-on-Windows case.  Since we're only logging
> >> the failures anyway, I think it is reasonable to log a complaint for any
> >> unwritable file in the data directory.
> 
> > That sounds like a potentially nontrivial amount of repetitive log bleat
> > after every crash start? One which the user can't really stop?
> 
> Why can't the user stop it?

Because it makes a good amount of sense to have e.g. certificates not
owned by postgres and not writeable? You don't necessarily want to
symlink them somewhere else, because that makes moving clusters around
harder than when they're self contained.

> I'd say it's a pretty damn-fool arrangement: for starters, it's an
> unnecessary security hazard.

I don't buy the security argument at all. You likely have
postgresql.conf in the data directoy. You can write to at least .auto,
which will definitely reside the data directory. That contains
archive_command.

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> On 2015-05-29 13:49:16 -0400, Tom Lane wrote:
>> Why can't the user stop it?

> Because it makes a good amount of sense to have e.g. certificates not
> owned by postgres and not writeable? You don't necessarily want to
> symlink them somewhere else, because that makes moving clusters around
> harder than when they're self contained.

Meh.  Well, I'm willing to yield on the EACCES point, but I still find
the exclusion for ETXTBSY to be ugly and inappropriate.

>> I'd say it's a pretty damn-fool arrangement: for starters, it's an
>> unnecessary security hazard.

> I don't buy the security argument at all. You likely have
> postgresql.conf in the data directoy. You can write to at least .auto,
> which will definitely reside the data directory. That contains
> archive_command.

The fact that a superuser might have multiple ways to subvert things
doesn't make it a good idea to add another one: the attack surface
could be larger, or at least different.  But even if you don't buy
that it's a security hazard, why would it be a good idea to have
executables inside $PGDATA?  That would for example lead to them getting
copied by pg_basebackup, which seems unlikely to be a good thing.
And if you did have such executables there, why would they be active
during a postmaster restart?

I really seriously doubt that this is either common enough or useful
enough to justify suppressing warning messages about it.
        regards, tom lane



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Andres Freund
Date:
On 2015-05-29 14:15:48 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2015-05-29 13:49:16 -0400, Tom Lane wrote:
> >> Why can't the user stop it?
> 
> > Because it makes a good amount of sense to have e.g. certificates not
> > owned by postgres and not writeable? You don't necessarily want to
> > symlink them somewhere else, because that makes moving clusters around
> > harder than when they're self contained.
> 
> Meh.  Well, I'm willing to yield on the EACCES point, but I still find
> the exclusion for ETXTBSY to be ugly and inappropriate.

Ok.



Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Stephen Frost
Date:
* Andres Freund (andres@anarazel.de) wrote:
> On 2015-05-29 13:49:16 -0400, Tom Lane wrote:
> > > That sounds like a potentially nontrivial amount of repetitive log bleat
> > > after every crash start? One which the user can't really stop?
> >
> > Why can't the user stop it?
>
> Because it makes a good amount of sense to have e.g. certificates not
> owned by postgres and not writeable? You don't necessarily want to
> symlink them somewhere else, because that makes moving clusters around
> harder than when they're self contained.

A certain other file might be non-writable by PG too... (*cough* .auto
*cough*).

> > I'd say it's a pretty damn-fool arrangement: for starters, it's an
> > unnecessary security hazard.
>
> I don't buy the security argument at all. You likely have
> postgresql.conf in the data directoy. You can write to at least .auto,
> which will definitely reside the data directory. That contains
> archive_command.

I'm not sure that I see the security issue here either..  We're not
talking about setuid shell scripts or anything that isn't running as the
PG user, which a superuser could take over anyway..
Thanks!
    Stephen

Re: fsync-pgdata-on-recovery tries to write to more files than previously

From
Christoph Berg
Date:
Re: Tom Lane 2015-05-29 <13871.1432921756@sss.pgh.pa.us>
> Why can't the user stop it?  We won't be bleating about the case of a
> symlink to a non-writable file someplace else, which is the Debian use
> case.  I don't see a very good excuse to have a non-writable file right
> in the data directory.

I've repeatedly seen PGDATA or pg_xlog been put directly on a
mountpoint, which means there well be a non-writable lost+found
directory there. (A case with pg_xlog was also reported as a support
case at credativ.) I'm usually advising against using the top level
directory directly, but it's not uncommon to encounter it.

> In any case, if the cost of such a file is one more line of log output
> during a crash restart, most people would have no problem at all in
> ignoring that log output.

Nod.

Christoph
-- 
cb@df7cb.de | http://www.df7cb.de/