Thread: fsync-pgdata-on-recovery tries to write to more files than previously
Hi, the new fsync-pgdata-on-recovery code tries to open all files using O_RDWR. At least on 9.1, this can make recovery fail: * launch postgres, hit ^\ (or otherwise shut down uncleanly) * touch foo; chmod 444 foo * launch postgres LOG: database system was interrupted; last known up at 2015-05-23 19:18:36 CEST FATAL: could not open file "/home/cbe/9.1/foo": Permission denied LOG: startup process (PID 27305) exited with exit code 1 LOG: aborting startup due to startup process failure The code on 9.4 looks similar to me, but I couldn't trigger the problem there. I think this is a real-world problem: 1) In older releases, the SSL certs were read from the data directory, and at least the default Debian installation creates symlinks from PGDATA/server.* to /etc/ssl/ where PostgreSQL can't write 2) It's probably a pretty common scenario that the root user will edit postgresql.conf, and make backups or create other random files there that are not writable. Even a non-writable postgresql.conf itself or recovery.conf was not a problem previously. To me, this is a serious regression because it prevents automatic startup of a server that would otherwise just run. Christoph -- cb@df7cb.de | http://www.df7cb.de/
Re: To PostgreSQL Hackers 2015-05-23 <20150523172627.GA24277@msg.df7cb.de> > Hi, > > the new fsync-pgdata-on-recovery code tries to open all files using > O_RDWR. At least on 9.1, this can make recovery fail: > > * launch postgres, hit ^\ (or otherwise shut down uncleanly) > * touch foo; chmod 444 foo > * launch postgres > > LOG: database system was interrupted; last known up at 2015-05-23 19:18:36 CEST > FATAL: could not open file "/home/cbe/9.1/foo": Permission denied > LOG: startup process (PID 27305) exited with exit code 1 > LOG: aborting startup due to startup process failure > > The code on 9.4 looks similar to me, but I couldn't trigger the > problem there. Correction: 9.4 is equally broken. (I was still running 9.4.1 when I tried first.) > I think this is a real-world problem: > > 1) In older releases, the SSL certs were read from the data directory, > and at least the default Debian installation creates symlinks from > PGDATA/server.* to /etc/ssl/ where PostgreSQL can't write > > 2) It's probably a pretty common scenario that the root user will edit > postgresql.conf, and make backups or create other random files there > that are not writable. Even a non-writable postgresql.conf itself or > recovery.conf was not a problem previously. 3) The .postgresql.conf.swp files created by (root's) vim are 0600. > To me, this is a serious regression because it prevents automatic > startup of a server that would otherwise just run. Christoph -- cb@df7cb.de | http://www.df7cb.de/
Christoph Berg <myon@debian.org> writes: > the new fsync-pgdata-on-recovery code tries to open all files using > O_RDWR. At least on 9.1, this can make recovery fail: Hm. I wonder whether it would be all right to just skip files for which we get EPERM on open(). The argument being that if we can't write to the file, we should not be held responsible for fsync'ing it either. But I'm not sure whether EPERM would be the only relevant errno, or whether there are cases where this would mask real problems. regards, tom lane
Re: Tom Lane 2015-05-23 <2284.1432413209@sss.pgh.pa.us> > Christoph Berg <myon@debian.org> writes: > > the new fsync-pgdata-on-recovery code tries to open all files using > > O_RDWR. At least on 9.1, this can make recovery fail: > > Hm. I wonder whether it would be all right to just skip files for which > we get EPERM on open(). The argument being that if we can't write to the > file, we should not be held responsible for fsync'ing it either. But > I'm not sure whether EPERM would be the only relevant errno, or whether > there are cases where this would mask real problems. Maybe logging WARNINGs instead of FATAL would be enough of a fix? Christoph -- cb@df7cb.de | http://www.df7cb.de/
On 2015-05-23 16:33:29 -0400, Tom Lane wrote: > Christoph Berg <myon@debian.org> writes: > > the new fsync-pgdata-on-recovery code tries to open all files using > > O_RDWR. At least on 9.1, this can make recovery fail: > > Hm. I wonder whether it would be all right to just skip files for which > we get EPERM on open(). The argument being that if we can't write to the > file, we should not be held responsible for fsync'ing it either. But > I'm not sure whether EPERM would be the only relevant errno, or whether > there are cases where this would mask real problems. We could even try doing the a fsync with a readonly fd as a fallback, but that's also pretty hacky. How about, to avoid masking actual problems, we have a more differentiated logic for the toplevel data directory? I think we could just skip all non-directory files in there data_directory itself. None of the files in the toplevel directory, with the exception of postgresql.auto.conf, will ever get written to by PG itself. And if there's readonly files somewhere in a subdirectory, I won't feel particularly bad. Greetings, Andres Freund
Re: Andres Freund 2015-05-24 <20150524005245.GD32396@alap3.anarazel.de> > How about, to avoid masking actual problems, we have a more > differentiated logic for the toplevel data directory? I think we could > just skip all non-directory files in there data_directory itself. None > of the files in the toplevel directory, with the exception of > postgresql.auto.conf, will ever get written to by PG itself. And if > there's readonly files somewhere in a subdirectory, I won't feel > particularly bad. I like that idea. Christoph -- cb@df7cb.de | http://www.df7cb.de/
Re: To Andres Freund 2015-05-24 <20150524075244.GB27048@msg.df7cb.de> > Re: Andres Freund 2015-05-24 <20150524005245.GD32396@alap3.anarazel.de> > > How about, to avoid masking actual problems, we have a more > > differentiated logic for the toplevel data directory? I think we could > > just skip all non-directory files in there data_directory itself. None > > of the files in the toplevel directory, with the exception of > > postgresql.auto.conf, will ever get written to by PG itself. And if > > there's readonly files somewhere in a subdirectory, I won't feel > > particularly bad. pg_log/ is also admin domain. What about only recursing into well-known directories + postgresql.auto.conf? (I've also been wondering if pg_basebackup shouldn't skip pg_log, but that's a different topic...) Christoph -- cb@df7cb.de | http://www.df7cb.de/
Christoph Berg <myon@debian.org> writes: > Re: To Andres Freund 2015-05-24 <20150524075244.GB27048@msg.df7cb.de> >> Re: Andres Freund 2015-05-24 <20150524005245.GD32396@alap3.anarazel.de> >>> How about, to avoid masking actual problems, we have a more >>> differentiated logic for the toplevel data directory? > pg_log/ is also admin domain. What about only recursing into > well-known directories + postgresql.auto.conf? The idea that this code would know exactly what's what under $PGDATA scares me. I can positively guarantee that it would diverge from reality over time, and nobody would notice until it ate their data, failed to start, or otherwise behaved undesirably. pg_log/ is a perfect example, because that is not a hard-wired directory name; somebody could point the syslogger at a different place very easily. Wiring in special behavior for that name is just wrong. I would *much* rather have a uniform rule for how to treat each file the scan comes across. It might take some tweaking to get to one that works well; but once we did, we could have some confidence that it wouldn't break later. regards, tom lane
On May 24, 2015 7:52:53 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Christoph Berg <myon@debian.org> writes: >> Re: To Andres Freund 2015-05-24 <20150524075244.GB27048@msg.df7cb.de> >>> Re: Andres Freund 2015-05-24 ><20150524005245.GD32396@alap3.anarazel.de> >>>> How about, to avoid masking actual problems, we have a more >>>> differentiated logic for the toplevel data directory? > >> pg_log/ is also admin domain. What about only recursing into >> well-known directories + postgresql.auto.conf? > >The idea that this code would know exactly what's what under $PGDATA >scares me. I can positively guarantee that it would diverge from >reality >over time, and nobody would notice until it ate their data, failed to >start, or otherwise behaved undesirably. > >pg_log/ is a perfect example, because that is not a hard-wired >directory >name; somebody could point the syslogger at a different place very >easily. >Wiring in special behavior for that name is just wrong. > >I would *much* rather have a uniform rule for how to treat each file >the scan comes across. It might take some tweaking to get to one that >works well; but once we did, we could have some confidence that it >wouldn't break later. If we'd merge it with initdb's list I think I'd not be that bad. I'm thinking of some header declaring it, roughly like thermgr list. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.
Andres Freund <andres@anarazel.de> writes: > On May 24, 2015 7:52:53 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Christoph Berg <myon@debian.org> writes: >>> pg_log/ is also admin domain. What about only recursing into >>> well-known directories + postgresql.auto.conf? >> The idea that this code would know exactly what's what under $PGDATA >> scares me. I can positively guarantee that it would diverge from >> reality over time, and nobody would notice until it ate their data, >> failed to start, or otherwise behaved undesirably. >> >> pg_log/ is a perfect example, because that is not a hard-wired >> directory name; somebody could point the syslogger at a different place >> very easily. Wiring in special behavior for that name is just wrong. >> >> I would *much* rather have a uniform rule for how to treat each file >> the scan comes across. It might take some tweaking to get to one that >> works well; but once we did, we could have some confidence that it >> wouldn't break later. > If we'd merge it with initdb's list I think I'd not be that bad. I'm thinking of some header declaring it, roughly likethe rmgr list. pg_log/ is a counterexample to that idea too; initdb doesn't know about it (and shouldn't). regards, tom lane
On 2015-05-25 13:38:01 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On May 24, 2015 7:52:53 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > If we'd merge it with initdb's list I think I'd not be that bad. I'm thinking of some header declaring it, roughly likethe rmgr list. > > pg_log/ is a counterexample to that idea too; initdb doesn't know about it > (and shouldn't). The idea would be to *only* directories that initdb knows about. Since that's where the valuables are. So I don't see how pg_log would be a counterexample.
* Andres Freund (andres@anarazel.de) wrote: > On 2015-05-25 13:38:01 -0400, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > On May 24, 2015 7:52:53 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > If we'd merge it with initdb's list I think I'd not be that bad. I'm thinking of some header declaring it, roughlylike the rmgr list. > > > > pg_log/ is a counterexample to that idea too; initdb doesn't know about it > > (and shouldn't). > > The idea would be to *only* directories that initdb knows about. Since > that's where the valuables are. So I don't see how pg_log would be a > counterexample. Indeed, that wouldn't be included in the list of things to fsync and it isn't listed in initdb, so that works. I've not followed this thread all that closely, but I do tend to agree with the idea of "only try to mess with files that are *clearly* ours to mess with." Thanks! Stephen
On 2015-05-25 14:14:10 -0400, Stephen Frost wrote: > That seems overly complicated, for my 2c at least. I don't particularly > like trying to mess with files that might be rightfully considered "not > ours" either. I'd not consider an fsync to be "messing" with files, especially if they're in PGDATA. > > Additionally we could attempt to fsync with a readonly fd before trying > > the read-write fd... > > Not really sure I see that as helping. On most OSs, except windows and some obscure unixes, a readonly fd is allowed to fsync a file. Greetings, Andres Freund
Stephen Frost <sfrost@snowman.net> writes: > I've not followed this thread all that closely, but I do tend to agree > with the idea of "only try to mess with files that are *clearly* ours to > mess with." Well, that opens us to errors of omission, ie failing to fsync things we should have. Maybe that's an okay risk, but personally I'd judge that "fsync everything and ignore (some?) errors" is probably a more robust approach over time. regards, tom lane
* Andres Freund (andres@anarazel.de) wrote: > On 2015-05-25 14:02:28 -0400, Tom Lane wrote: > > Stephen Frost <sfrost@snowman.net> writes: > > > I've not followed this thread all that closely, but I do tend to agree > > > with the idea of "only try to mess with files that are *clearly* ours to > > > mess with." > > > > Well, that opens us to errors of omission, ie failing to fsync things we > > should have. > > Is that really that likely? I mean we don't normally add data to the top > level directory itself, and subdirectories hopefully won't be added > except via initdb? That feels like a pretty low risk, to me at least. Certainly better than having a failure, like what's going on now. > > Maybe that's an okay risk, but personally I'd judge that > > "fsync everything and ignore (some?) errors" is probably a more robust > > approach over time. > > The over-the-top approach would be to combine the two. Error out in > directories that are in the initdb list, and ignore permission errors > otherwise... That seems overly complicated, for my 2c at least. I don't particularly like trying to mess with files that might be rightfully considered "not ours" either. This all makes me really wonder about postgresql.auto.conf too.. Clearly, on the one hand, we consider that "our" file, and so we should error out if we don't own it, but on the other hand, I've specifically recommended making that file owned by root to some folks, to avoid DBAs playing with the startup-time settings.. > Additionally we could attempt to fsync with a readonly fd before trying > the read-write fd... Not really sure I see that as helping. Thanks! Stephen
* Andres Freund (andres@anarazel.de) wrote: > On 2015-05-25 14:14:10 -0400, Stephen Frost wrote: > > That seems overly complicated, for my 2c at least. I don't particularly > > like trying to mess with files that might be rightfully considered "not > > ours" either. > > I'd not consider an fsync to be "messing" with files, especially if > they're in PGDATA. I'm not entirely sure I agree. > > > Additionally we could attempt to fsync with a readonly fd before trying > > > the read-write fd... > > > > Not really sure I see that as helping. > > On most OSs, except windows and some obscure unixes, a readonly fd is > allowed to fsync a file. I wouldn't have thought otherwise, given that you were suggesting it, but there's no guarantee we're going to be allowed to read it either, or even access the directory the symlink points to, etc.. Thanks, Stephen
On 2015-05-25 14:02:28 -0400, Tom Lane wrote: > Stephen Frost <sfrost@snowman.net> writes: > > I've not followed this thread all that closely, but I do tend to agree > > with the idea of "only try to mess with files that are *clearly* ours to > > mess with." > > Well, that opens us to errors of omission, ie failing to fsync things we > should have. Is that really that likely? I mean we don't normally add data to the top level directory itself, and subdirectories hopefully won't be added except via initdb? > Maybe that's an okay risk, but personally I'd judge that > "fsync everything and ignore (some?) errors" is probably a more robust > approach over time. The over-the-top approach would be to combine the two. Error out in directories that are in the initdb list, and ignore permission errors otherwise... Additionally we could attempt to fsync with a readonly fd before trying the read-write fd...
Andres Freund <andres@anarazel.de> writes: > On 2015-05-25 14:14:10 -0400, Stephen Frost wrote: >>> Additionally we could attempt to fsync with a readonly fd before trying >>> the read-write fd... >> Not really sure I see that as helping. > On most OSs, except windows and some obscure unixes, a readonly fd is > allowed to fsync a file. Perhaps, but if we didn't have permission to write the file, it's hard to argue that it's our responsibility to fsync it. So this seems like it's adding complexity without really adding any safety. regards, tom lane
Tom Lane wrote: > Stephen Frost <sfrost@snowman.net> writes: > > I've not followed this thread all that closely, but I do tend to agree > > with the idea of "only try to mess with files that are *clearly* ours to > > mess with." > > Well, that opens us to errors of omission, ie failing to fsync things we > should have. Maybe that's an okay risk, but personally I'd judge that > "fsync everything and ignore (some?) errors" is probably a more robust > approach over time. How is it possible to make errors of omission? The list of directories in initdb is the complete set of directories that are created for a newly-initdb'd database, no? Surely there can't be a database that contains vital directories that are not created there? See "subdirs" static in initdb.c. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> Well, that opens us to errors of omission, ie failing to fsync things we >> should have. Maybe that's an okay risk, but personally I'd judge that >> "fsync everything and ignore (some?) errors" is probably a more robust >> approach over time. > How is it possible to make errors of omission? The list of directories > in initdb is the complete set of directories that are created for a > newly-initdb'd database, no? Surely there can't be a database that > contains vital directories that are not created there? See "subdirs" > static in initdb.c. Easy: all you need is to suppose that some of the plain files at top level of $PGDATA ought to be fsync'd. (I'm fairly sure for example that we took steps awhile back to force postmaster.pid to be fsync'd.) If there is a distinction between the fsync requirements of top-level files and everything else, it is completely accidental and not to be relied on. And from the other direction, where exactly is it written that distros/users will only create problematic files at the top level of $PGDATA? I'd have zero confidence in such an assertion applied to tablespace directories, for sure. regards, tom lane
On 05/25/2015 01:21 PM, Tom Lane wrote: > And from the other direction, where exactly is it written that > distros/users will only create problematic files at the top level of > $PGDATA? I'd have zero confidence in such an assertion applied to > tablespace directories, for sure. Yes, absolutely. For example: Cstore_FDW now puts its files in a subdirectory of the tablespace directory, which for a really large cstore table, the user will want to symlink somewhere else, or might create as a mount. Such a user might then, for example, annotate this with a README.txt which happens to be root/root perms. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, May 25, 2015 at 2:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2015-05-25 14:14:10 -0400, Stephen Frost wrote: >>>> Additionally we could attempt to fsync with a readonly fd before trying >>>> the read-write fd... > >>> Not really sure I see that as helping. > >> On most OSs, except windows and some obscure unixes, a readonly fd is >> allowed to fsync a file. > > Perhaps, but if we didn't have permission to write the file, it's hard to > argue that it's our responsibility to fsync it. So this seems like it's > adding complexity without really adding any safety. I agree. I think ignoring fsync failures is a very sensible approach. If the files are not writable, they're probably not ours. If they are not writable but somehow still ours, we probably can't have written them before the crash, either. If they are ours and we somehow wrote to them before the crash, and then while the system was down they were made inaccessible, and then the database was restarted, then we're well into the territory where the system administrator has done something that we cannot possibly be expected to cope with ... but ignoring the fsync isn't very likely to cause any real problems even here. If we really did modify those blocks recently, recovery will try to redo the changes, and we'll fail then anyway. So what's the problem? I agree with Tom's concern that if we have two lists of directories, they may get out of sync. We could probably merge the two lists somehow, but I'm not really seeing the point, since Tom's blanket approach should work just fine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, May 25, 2015 at 2:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Andres Freund <andres@anarazel.de> writes: >>> On 2015-05-25 14:14:10 -0400, Stephen Frost wrote: >>>> Not really sure I see that as helping. >>> On most OSs, except windows and some obscure unixes, a readonly fd is >>> allowed to fsync a file. >> Perhaps, but if we didn't have permission to write the file, it's hard to >> argue that it's our responsibility to fsync it. So this seems like it's >> adding complexity without really adding any safety. > > I agree. I think ignoring fsync failures is a very sensible approach. > If the files are not writable, they're probably not ours. If they are > not writable but somehow still ours, we probably can't have written > them before the crash, either. If they are ours and we somehow wrote > to them before the crash, and then while the system was down they were > made inaccessible, and then the database was restarted, then we're > well into the territory where the system administrator has done > something that we cannot possibly be expected to cope with ... but > ignoring the fsync isn't very likely to cause any real problems even > here. If we really did modify those blocks recently, recovery will > try to redo the changes, and we'll fail then anyway. So what's the > problem? > > I agree with Tom's concern that if we have two lists of directories, > they may get out of sync. We could probably merge the two lists > somehow, but I'm not really seeing the point, since Tom's blanket > approach should work just fine. I certainly see your point, but Tom also pointed out that it's not great to ignore failures during this phase: * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Greg Stark <stark@mit.edu> writes: > > What exactly is failing? > > Is it that fsync is returning -1 ? > According to the original report from Christoph Berg, it was open() > not fsync() that was failing, at least in permissions-based cases. > > I'm not sure if we should just uniformly ignore all failures in this > phase. That would have the merit of clearly not creating any new > startup failure cases compared to the previous code, but as you say > sometimes it might mean ignoring real problems. If we accept this, then we still have to have the lists, to decide what to fail on and what to ignore. If we're going to have said lists tho, I don't really see the point in fsync'ing things we're pretty confident aren't ours. Further, in any of these cases, we have to decide which failure cases are ones that are "fatal" and which are not- being in the list or not isn't the only criteria, it's just one part of the overall decision. We also need to consider what return value we get back for which system calls, all of which may entirely be system dependent, meanly we may have to deal with portability issues here too. Then there are other interesting considerations like what happens with an NFS mount (as Greg mentioned), or perhaps what happens when it's a MAC violation (eg: SELinux). Generally speaking, those will also return an error code which we can contemplate, but it'll still create annoying log noise for people running in such environments. Perhaps that would encourage them to move whatever files they have out of $PGDATA, which is likely to be a good decision, but that may not always be possible.. Thanks! Stephen
On 2015-05-25 21:33:03 -0400, Robert Haas wrote: > On Mon, May 25, 2015 at 2:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Perhaps, but if we didn't have permission to write the file, it's hard to > > argue that it's our responsibility to fsync it. So this seems like it's > > adding complexity without really adding any safety. > > I agree. I think ignoring fsync failures is a very sensible approach. > If the files are not writable, they're probably not ours. The reason we started discussing this is because Tom had the - quite reasonable - concern that this might not solely be a problem of EACCESS, but that there could be other errors that we need to ignore to not fail spuriously. Say a symlink goes to a binary, which is currently being executed: ETXTBSY. Or the file is in a readonly filesystem: EROFS. So we'd need to ignore a lot of errors, possibly ignoring valid ones. I personally can see why people will put things in PGDATA itself, if you put unreadable stuff in some subdirectory that you didn't create yourself, I see much less reason to tolerate that. Another thing is whether we should handle a recursive symlink in pgdata? I personally think not, but... It's also not just as simple as making fsync_fname fail gracefully upon EACCESS - the opendir() could fail just as well. Greetings, Andres Freund
Re: fsync-pgdata-on-recovery tries to write to more files than previously
From
Abhijit Menon-Sen
Date:
At 2015-05-26 03:54:51 +0200, andres@anarazel.de wrote: > > Say a symlink goes to a binary, which is currently being executed: > ETXTBSY. Or the file is in a readonly filesystem: EROFS. So we'd > need to ignore a lot of errors, possibly ignoring valid ones. Right. That's why I started out by being conservative and following only the "expected" symlinks in pg_tblspc (as other parts of the code do). > Another thing is whether we should handle a recursive symlink in > pgdata? I personally think not, but... I think not too. > It's also not just as simple as making fsync_fname fail gracefully > upon EACCESS - the opendir() could fail just as well. I'll post a proposed patch shortly. -- Abhijit
* Abhijit Menon-Sen (ams@2ndQuadrant.com) wrote: > At 2015-05-26 03:54:51 +0200, andres@anarazel.de wrote: > > Another thing is whether we should handle a recursive symlink in > > pgdata? I personally think not, but... > > I think not too. Yikes.. That's definitely the kind of thing that's why I worry about the whole 'fsync everything' idea- what if I symlink to /? I've certainly done that before from my home directory for ease of use and I imagine there are people out there who have similar setups where they sftp as the PG user and use the symlink to more easily navigate somewhere else. We have to realize that, on at least some systems, PGDATA could be the postgres user's home directory too. That's not the case on Debian-based systems today, but I think it might have been before Debian had the multi-cluster tooling. > > It's also not just as simple as making fsync_fname fail gracefully > > upon EACCESS - the opendir() could fail just as well. > > I'll post a proposed patch shortly. Thanks! Stephen
On Mon, May 25, 2015 at 9:54 PM, Stephen Frost <sfrost@snowman.net> wrote: > I certainly see your point, but Tom also pointed out that it's not great > to ignore failures during this phase: > > * Tom Lane (tgl@sss.pgh.pa.us) wrote: >> Greg Stark <stark@mit.edu> writes: >> > What exactly is failing? >> > Is it that fsync is returning -1 ? >> According to the original report from Christoph Berg, it was open() >> not fsync() that was failing, at least in permissions-based cases. >> >> I'm not sure if we should just uniformly ignore all failures in this >> phase. That would have the merit of clearly not creating any new >> startup failure cases compared to the previous code, but as you say >> sometimes it might mean ignoring real problems. > > If we accept this, then we still have to have the lists, to decide what > to fail on and what to ignore. If we're going to have said lists tho, I > don't really see the point in fsync'ing things we're pretty confident > aren't ours. No, that's not right at all. The idea, at least as I understand it, would be decide which errors to ignore by looking at the error code, not by looking at which file is involved. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, May 25, 2015 at 9:54 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-05-25 21:33:03 -0400, Robert Haas wrote: >> On Mon, May 25, 2015 at 2:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > Perhaps, but if we didn't have permission to write the file, it's hard to >> > argue that it's our responsibility to fsync it. So this seems like it's >> > adding complexity without really adding any safety. >> >> I agree. I think ignoring fsync failures is a very sensible approach. >> If the files are not writable, they're probably not ours. > > The reason we started discussing this is because Tom had the - quite > reasonable - concern that this might not solely be a problem of EACCESS, > but that there could be other errors that we need to ignore to not fail > spuriously. Say a symlink goes to a binary, which is currently being > executed: ETXTBSY. Or the file is in a readonly filesystem: EROFS. So > we'd need to ignore a lot of errors, possibly ignoring valid ones. But ignoring those errors wouldn't compromise data integrity, either. So let's just ignore (but log) all errors: then we'll be demonstrably no worse off than we were before this patch went in. If it turns out that there's some particular case where ignoring errors DOES compromise data integrity, then we can plug that hole surgically when somebody reports it. Anything we do short of making all errors in this area non-fatal is going to leave behind startup-failure cases that exist today, and we have no evidence at this time that such startup failures would be justified by any actual data loss risk. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Anything we do short of making all errors in this area non-fatal is > going to leave behind startup-failure cases that exist today, and we > have no evidence at this time that such startup failures would be > justified by any actual data loss risk. Yeah. Perhaps I missed it, but was the original patch motivated by actual failures that had been seen in the field, or was it just a hypothetical concern? Certainly, any actual failures of that sort are few and far between compared to the number of problems we now realize the patch introduced. Also, we need to discuss how hard walkdir() needs to try to avoid throwing errors of its own. regards, tom lane
On 05/26/2015 08:05 AM, Robert Haas wrote: > On Mon, May 25, 2015 at 9:54 PM, Stephen Frost <sfrost@snowman.net> wrote: >> I certainly see your point, but Tom also pointed out that it's not great >> to ignore failures during this phase: >> >> * Tom Lane (tgl@sss.pgh.pa.us) wrote: >>> Greg Stark <stark@mit.edu> writes: >>>> What exactly is failing? >>>> Is it that fsync is returning -1 ? >>> According to the original report from Christoph Berg, it was open() >>> not fsync() that was failing, at least in permissions-based cases. >>> >>> I'm not sure if we should just uniformly ignore all failures in this >>> phase. That would have the merit of clearly not creating any new >>> startup failure cases compared to the previous code, but as you say >>> sometimes it might mean ignoring real problems. >> If we accept this, then we still have to have the lists, to decide what >> to fail on and what to ignore. If we're going to have said lists tho, I >> don't really see the point in fsync'ing things we're pretty confident >> aren't ours. > No, that's not right at all. The idea, at least as I understand it, > would be decide which errors to ignore by looking at the error code, > not by looking at which file is involved. > OK, I'm late to the party. But why exactly are we syncing absolutely everything? That seems over-broad. And might it be better to check that we can open each file using access() than calling open() and looking at the error code? cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > OK, I'm late to the party. But why exactly are we syncing absolutely > everything? That seems over-broad. If we try to be selective, we risk errors of omission, which no one would ever notice until someone's data got eaten in a low-probability crash scenario. It seems more robust (at least to me) to fsync everything we can find. That does require more thought about error cases than went into the original patch ... but I think that we need more thought about error cases even if we do try to be selective. One thing perhaps we *should* be selective about, though, is which symlinks we try to follow. I think that a good case could be made for ignoring symlinks everywhere except in the pg_tablespace directory. If we did, that would all by itself take care of the Debian scenario, if I understand that case correctly. > And might it be better to check that we can open each file using > access() than calling open() and looking at the error code? Don't really see the point; that's just an extra step, and access() won't exactly prove you can open the file, anyway. regards, tom lane
On 05/26/2015 11:58 AM, Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: >> OK, I'm late to the party. But why exactly are we syncing absolutely >> everything? That seems over-broad. > If we try to be selective, we risk errors of omission, which no one would > ever notice until someone's data got eaten in a low-probability crash > scenario. It seems more robust (at least to me) to fsync everything we > can find. That does require more thought about error cases than went > into the original patch ... but I think that we need more thought about > error cases even if we do try to be selective. > > One thing perhaps we *should* be selective about, though, is which > symlinks we try to follow. I think that a good case could be made > for ignoring symlinks everywhere except in the pg_tablespace directory. > If we did, that would all by itself take care of the Debian scenario, > if I understand that case correctly. People have symlinked the xlog directory. I've done it myself in the past. A better rule might be to ignore symlinks unless either they are in pg_tblspc or they are in the data directory and their name starts with "pg_". cheers andrew
Re: fsync-pgdata-on-recovery tries to write to more files than previously
From
Magnus Hagander
Date:
On Tue, May 26, 2015 at 6:16 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
On 05/26/2015 11:58 AM, Tom Lane wrote:Andrew Dunstan <andrew@dunslane.net> writes:OK, I'm late to the party. But why exactly are we syncing absolutelyIf we try to be selective, we risk errors of omission, which no one would
everything? That seems over-broad.
ever notice until someone's data got eaten in a low-probability crash
scenario. It seems more robust (at least to me) to fsync everything we
can find. That does require more thought about error cases than went
into the original patch ... but I think that we need more thought about
error cases even if we do try to be selective.
One thing perhaps we *should* be selective about, though, is which
symlinks we try to follow. I think that a good case could be made
for ignoring symlinks everywhere except in the pg_tablespace directory.
If we did, that would all by itself take care of the Debian scenario,
if I understand that case correctly.
People have symlinked the xlog directory. I've done it myself in the past. A better rule might be to ignore symlinks unless either they are in pg_tblspc or they are in the data directory and their name starts with "pg_".
Not just "people". initdb will symlink the xlog directory if you use -x.
Andrew Dunstan <andrew@dunslane.net> writes: > On 05/26/2015 11:58 AM, Tom Lane wrote: >> One thing perhaps we *should* be selective about, though, is which >> symlinks we try to follow. I think that a good case could be made >> for ignoring symlinks everywhere except in the pg_tablespace directory. >> If we did, that would all by itself take care of the Debian scenario, >> if I understand that case correctly. > People have symlinked the xlog directory. Good point, we need to cover that case. regards, tom lane
On 2015-05-26 10:41:12 -0400, Tom Lane wrote: > Yeah. Perhaps I missed it, but was the original patch motivated by > actual failures that had been seen in the field, or was it just a > hypothetical concern? I'd mentioned that it might be a good idea to do this while investingating a bug with unlogged tables that several parties had reported. That patch has been fixed in a more granular fashion. From there it took on kind of a life on its own. It is somewhat interesting that similar code has been used in pg_upgrade, via initdb -S, for a while now, without, to my knowledge, it causing reported problem. I think the relevant difference is that that code doesn't follow symlinks. It's obviously also less exercised and poeople might just have fixed up permissions when encountering troubles. Abhijit, do you recall why the code was changed to follow all symlinks in contrast to explicitly going through the tablespaces as initdb -S does? I'm pretty sure early versions of the patch pretty much had a verbatim copy of the initdb logic? That logic is missing pg_xlog btw, which is bad for pg_upgrade. I *can* reproduce corrupted clusters without this without trying too hard. I yesterday wrote a test for the crash testing infrastructure I've on and off worked on (since 2011. Uhm) and I could reproduce a bunch of corrupted clusters without this patch. When randomly triggering crash restarts shortly afterwards follwed by a simulated hw restart (stop accepting writes via linux device mapper) while concurrent COPYs are running, I can trigger a bunch of data corruptions. Since then my computer in berlin has done 440 testruns with the patch, and 450 without. I've gotten 7 errors without, 0 with. But the instrumentation for detecting errors is really shitty (pkey lookup for every expected row) and doesn't yet keep meaningful diagnosistics. So this isn't particularly bulletproof either way. I can't tell whether the patch is just masking yet another problem due to different timings caused by the fsync, or whether it's fixing the problem that we can forget to sync WAL segments. Greetings, Andres Freund
On 2015-05-26 19:07:20 +0200, Andres Freund wrote: > Abhijit, do you recall why the code was changed to follow all symlinks > in contrast to explicitly going through the tablespaces as initdb -S > does? I'm pretty sure early versions of the patch pretty much had a > verbatim copy of the initdb logic? Forgot to add Abhijit to CC list, sorry. > That [initdb's] logic is missing pg_xlog btw, which is bad for pg_upgrade. On second thought it's probably not actually bad for pg_upgrade's specific usecase. It'll always end with a proper shutdown, so it'll never need that WAL. But it'd be bad if anybody ever relied on initdb -S in a different scenario.
On 2015-05-26 19:07:20 +0200, Andres Freund wrote: > It is somewhat interesting that similar code has been used in > pg_upgrade, via initdb -S, for a while now, without, to my knowledge, it > causing reported problem. I think the relevant difference is that that > code doesn't follow symlinks. It's obviously also less exercised and > poeople might just have fixed up permissions when encountering troubles. > > Abhijit, do you recall why the code was changed to follow all symlinks > in contrast to explicitly going through the tablespaces as initdb -S > does? I'm pretty sure early versions of the patch pretty much had a > verbatim copy of the initdb logic? That logic is missing pg_xlog btw, > which is bad for pg_upgrade. So, this was discussed in the following thread, starting at: http://archives.postgresql.org/message-id/20150403163232.GA28444%40eldon.alvh.no-ip.org "Actually, since surely we must follow symlinks everywhere, why do we have to do this separately for pg_tblspc? Shouldn't that link-following occur automatically when walking PGDATA in the first place?" I don't think it's true that we must follow symlinks everywhere. I think, as argued upthread, that it's sufficient to recurse through PGDATA, follow the symlinks in pg_tbspc, and if a symlink, also go through pg_xlog separately. There are no other places we it's "allowed" to introduce symlinks and we have refuted bugreports of people having problems after doing that. So what I propose is: 1) Remove the automatic symlink following 2) Follow pg_tbspc/*, pg_xlog if it's a symlink, fix the latter in initdb -S 3) Add a elevel argument to walkdir(), return if AllocateDir() fails, continue for stat() failures in the readdir() loop. 4) Add elevel argument to pre_sync_fname, fsync_fname, return after errors. 5) Accept EACCESS, ETXTBSY (if defined) when open()ing the files. By virtue of not following symlinks we should not needto worry about EROFS I'm inclined to think that 4) is a big enough compat break that a fsync_fname_ext with the new argument is a good idea. Arguments for/against?
Re: fsync-pgdata-on-recovery tries to write to more files than previously
From
Abhijit Menon-Sen
Date:
At 2015-05-26 19:07:20 +0200, andres@anarazel.de wrote: > > Abhijit, do you recall why the code was changed to follow all symlinks > in contrast to explicitly going through the tablespaces as initdb -S > does? I'm pretty sure early versions of the patch pretty much had a > verbatim copy of the initdb logic? Yes, earlier versions of the patch did follow symlinks only in pg_tblspc. I changed this because Álvaro pointed out that this was likely to miss important stuff, e.g. in 20150403163232.GA28444@eldon.alvh.no-ip.org, 20150406131636.GD4369@alvh.no-ip.org -- Abhijit
Re: fsync-pgdata-on-recovery tries to write to more files than previously
From
Abhijit Menon-Sen
Date:
At 2015-05-26 22:44:03 +0200, andres@anarazel.de wrote: > > So, this was discussed in the following thread, starting at: > http://archives.postgresql.org/message-id/20150403163232.GA28444%40eldon.alvh.no-ip.org Sorry, I didn't see this before replying. > There are no other places we it's "allowed" to introduce symlinks and > we have refuted bugreports of people having problems after doing that. OK. > So what I propose is: > 1) Remove the automatic symlink following > 2) Follow pg_tbspc/*, pg_xlog if it's a symlink, fix the latter in > initdb -S > 3) Add a elevel argument to walkdir(), return if AllocateDir() fails, > continue for stat() failures in the readdir() loop. > 4) Add elevel argument to pre_sync_fname, fsync_fname, return after > errors. > 5) Accept EACCESS, ETXTBSY (if defined) when open()ing the files. By > virtue of not following symlinks we should not need to worry about > EROFS Makes sense. I've got most of that done, I'll just remove the symlink following (or rather, restrict it to the cases mentioned), test a bit, and post. > I'm inclined to think that 4) is a big enough compat break that a > fsync_fname_ext with the new argument is a good idea. Agreed. -- Abhijit
Re: fsync-pgdata-on-recovery tries to write to more files than previously
From
Abhijit Menon-Sen
Date:
At 2015-05-26 22:44:03 +0200, andres@anarazel.de wrote: > > So what I propose is: > 1) Remove the automatic symlink following > 2) Follow pg_tbspc/*, pg_xlog if it's a symlink, fix the latter in > initdb -S > 3) Add a elevel argument to walkdir(), return if AllocateDir() fails, > continue for stat() failures in the readdir() loop. > 4) Add elevel argument to pre_sync_fname, fsync_fname, return after > errors. > 5) Accept EACCESS, ETXTBSY (if defined) when open()ing the files. By > virtue of not following symlinks we should not need to worry about > EROFS Here's a WIP patch for discussion. I've (a) removed the S_ISLNK() branch in walkdir, (b) reintroduced walktblspc_links to call walkdir on each of the entries within pg_tblspc (simpler than trying to make walkdir follow links only for pg_xlog and under pg_tblspc), (c) call walkdir on pg_xlog if it's a symlink (not done for initdb -S; will submit separately), (d) add elevel arguments as described, (e) ignore EACCES and ETXTBSY. This correctly fsync()s stuff according to strace, and doesn't die if there are unreadable files/links in PGDATA. What I haven't done is return if AllocateDir() fails. I'm not convinced that's correct, because it'll not complain if PGDATA is unreadable (but this will break other things, so it doesn't matter), but also will die if readdir fails rather than opendir. I'm trying a couple of approaches to that (e.g. using readdir directly instead of ReadDir), but other suggestions are welcome. -- Abhijit
Attachment
Re: fsync-pgdata-on-recovery tries to write to more files than previously
From
Abhijit Menon-Sen
Date:
At 2015-05-27 11:46:39 +0530, ams@2ndQuadrant.com wrote: > > I'm trying a couple of approaches to that (e.g. using readdir directly > instead of ReadDir), but other suggestions are welcome. Here's what that looks like, but not yet fully tested. -- Abhijit
Attachment
Abhijit Menon-Sen <ams@2ndQuadrant.com> writes: > At 2015-05-27 11:46:39 +0530, ams@2ndQuadrant.com wrote: >> I'm trying a couple of approaches to that (e.g. using readdir directly >> instead of ReadDir), but other suggestions are welcome. > Here's what that looks like, but not yet fully tested. I doubt that that (not using AllocateDir) is a good idea in the backend, mainly because you are going to leak directory references if anything called by walkdir() throws elog(ERROR). While that doesn't matter much for the immediate usage, since we'd throw up our hands and die anyway, it completely puts the kibosh on any idea that walkdir() is a general-purpose subroutine. It would have to be decorated with big red warnings NEVER USE THIS FUNCTION. Ick. What I think is a reasonable compromise is to treat AllocateDir failure as nonfatal, but to continue using ReadDir etc which means that any post-open failure in reading a successfully-opened directory would be fatal. Such errors would suggest that something is badly wrong in the filesystem, so I do not feel too terrible about not covering that case. Meanwhile, it seems like you have copied a couple of very dubious decisions in initdb's walktblspc_links(): 1. It doesn't say a word if the passed-in path (which in practice is always $PGDATA/pg_tblspc) doesn't exist. 2. It assumes that every entry within pg_tblspc must be a tablespace directory (or symlink to one), and fails if any are plain files. At the very least, these behaviors seem inconsistent. #2 is making strong assumptions about pg_tblspc being correctly set up and free of user-added cruft, while #1 doesn't even assume it's there at all. Now, we don't really have to do anything about these things in order to fix the immediate problem, but I wonder if we shouldn't try to clean up initdb's behavior while we're at it. Independently of that, I thought the plan was to complain about any problems at LOG message level, not DEBUG1, and definitely not WARNING (which is lower than LOG for this purpose). And why would you use a different message level for pg_xlog/ than for other files anyway? regards, tom lane
I wrote: > Abhijit Menon-Sen <ams@2ndQuadrant.com> writes: >> At 2015-05-27 11:46:39 +0530, ams@2ndQuadrant.com wrote: >>> I'm trying a couple of approaches to that (e.g. using readdir directly >>> instead of ReadDir), but other suggestions are welcome. >> Here's what that looks like, but not yet fully tested. > I doubt that that (not using AllocateDir) is a good idea in the backend, Oh, scratch that. Reading the patch closer, I see you kept the use of AllocateDir and only replaced ReadDir. That's kind of ugly (and certainly requires a comment about the inconsistency) but it's probably ok from an error handling standpoint. There are hard failure cases in AllocateDir, but they seem unlikely to create problems in practice. The other points stand though ... regards, tom lane
Re: fsync-pgdata-on-recovery tries to write to more files than previously
From
Abhijit Menon-Sen
Date:
At 2015-05-27 20:22:18 -0400, tgl@sss.pgh.pa.us wrote: > > I doubt that that (not using AllocateDir) […] (Disregarding per your followup.) > What I think is a reasonable compromise is to treat AllocateDir > failure as nonfatal, but to continue using ReadDir etc which means > that any post-open failure in reading a successfully-opened directory > would be fatal. OK, that's halfway between the two patches I posted. Will do. > Meanwhile, it seems like you have copied a couple of very dubious > decisions in initdb's walktblspc_links(): […] > > Now, we don't really have to do anything about these things in order > to fix the immediate problem, but I wonder if we shouldn't try to > clean up initdb's behavior while we're at it. OK, I'll fix that in both. > Independently of that, I thought the plan was to complain about any > problems at LOG message level, not DEBUG1, and definitely not WARNING > (which is lower than LOG for this purpose). I'll change the level to LOG for fsync_fname_ext, but I think DEBUG1 is more appropriate for the pre_sync_fname run. Or do you think I should use LOG for that too? > And why would you use a different message level for pg_xlog/ than for > other files anyway? That was just a mistake, I forgot to change it after copying. Thanks for having a look. I'll post a revised patch shortly. -- Abhijit
Abhijit Menon-Sen <ams@2ndQuadrant.com> writes: > At 2015-05-27 20:22:18 -0400, tgl@sss.pgh.pa.us wrote: >> I doubt that that (not using AllocateDir) […] > (Disregarding per your followup.) >> What I think is a reasonable compromise is to treat AllocateDir >> failure as nonfatal, but to continue using ReadDir etc which means >> that any post-open failure in reading a successfully-opened directory >> would be fatal. > OK, that's halfway between the two patches I posted. Will do. Thought you were disregarding? Anyway, I had an idea about reducing the ugliness. What you basically did is copy-and-paste ReadDir so you could change the elevel. What about turning ReadDir into a wrapper around a ReadDirExtended function that adds an elevel argument? regards, tom lane
Re: fsync-pgdata-on-recovery tries to write to more files than previously
From
Abhijit Menon-Sen
Date:
At 2015-05-27 23:41:29 -0400, tgl@sss.pgh.pa.us wrote: > > What about turning ReadDir into a wrapper around a ReadDirExtended > function that adds an elevel argument? Sounds reasonable, doing. -- Abhijit
Re: fsync-pgdata-on-recovery tries to write to more files than previously
From
Abhijit Menon-Sen
Date:
Hi. Here's an updated patch for the fsync problem(s). A few points that may be worth considering: 1. I've made the ReadDir → ReadDirExtended change, but in retrospect I don't think it's really worth it. Using ReadDir and letting it die is probably fine. (Aside: I left ReadDirExtended static for now.) 2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c? I wasn't sure if I should move that to fd.c as well. I think it's borderline OK for now. 3. There's a comment in fsync_fname that we need to open directories with O_RDONLY and non-directories with O_RDWR. Does this also apply to pre_sync_fname (i.e. sync_file_range and posix_fadvise) on any platforms we care about? It doesn't seem so, but I'm not sure. 4. As a partial aside, why does fsync_fname use OpenTransientFile? It looks like it should use BasicOpenFile as pre_sync_fname does. We close the fd immediately after calling fsync anyway. Or is there some reason I'm missing that we should use OpenTransientFile in pre_sync_fname too? (Aside²: it's pointless to specify S_IRUSR|S_IWUSR in fsync_fname since we're not setting O_CREAT. Minor, so will submit separate patch.) 5. I made walktblspc_entries use stat() instead of lstat(), so that we can handle symlinks and directories the same way. Note that we won't continue to follow links inside the sub-directories because we use walkdir instead of recursing. But this depends on S_ISDIR() returning true after stat() on Windows with a junction. Does that work? And are the entries in pb_tblspc on Windows actually junctions? I've tested this with unreadable files in PGDATA and junk inside pb_tblspc. Feedback welcome. I have a separate patch to initdb with the corresponding changes, which I will post after dinner and a bit more testing. -- Abhijit
Attachment
On Thu, May 28, 2015 at 10:26 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote: > 2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c? > I wasn't sure if I should move that to fd.c as well. I think it's > borderline OK for now. I think if the function is specific as fsync_pgdata(), fd.c is not the right place. I'm not sure xlog.c is either, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, May 28, 2015 at 10:26 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote: >> 2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c? >> I wasn't sure if I should move that to fd.c as well. I think it's >> borderline OK for now. > I think if the function is specific as fsync_pgdata(), fd.c is not the > right place. I'm not sure xlog.c is either, though. ISTM xlog.c is clearly the *wrong* place; if this behavior has anything to do with WAL logging as such, it's not apparent to me. fd.c is not a great place perhaps, but in view of the fact that we have things like RemovePgTempFiles() in there, it's not unreasonable to see fsync_pgdata as something to put there as well (perhaps with a name more chosen to match fd.c names...) Since Robert appears to be busy worrying about the multixact issue reported by Steve Kehlet, I suggest he focus on that and I'll take care of getting this thing committed. AFAICT we have consensus on what it should do and we're down to arguing about code style. regards, tom lane
Re: fsync-pgdata-on-recovery tries to write to more files than previously
From
Abhijit Menon-Sen
Date:
At 2015-05-28 19:56:15 +0530, ams@2ndQuadrant.com wrote: > > I have a separate patch to initdb with the corresponding changes, which > I will post after dinner and a bit more testing. Here's that patch too. I was a bit unsure how far to go with this. It fixes the problem of not following pg_xlog if it's a symlink (Andres) and the one where it died on bogus stuff in pg_tblspc (Tom) and unreadable files anywhere else (it now spews errors to stderr, but carries on for as long as it can). I've done a bit more than strictly necessary to fix those problems, though, and made the code largely similar to what's in the other patch. If you want something minimal, let me know and I'll cut it down to size. -- Abhijit P.S. If this patch gets applied, then the "Adapted from walktblspc_links in initdb.c" comment in fd.c should be changed: s/links/entries/.
Attachment
On Thu, May 28, 2015 at 1:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, May 28, 2015 at 10:26 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote: >>> 2. Robert, are you comfortable with what fsync_pgdata() does in xlog.c? >>> I wasn't sure if I should move that to fd.c as well. I think it's >>> borderline OK for now. > >> I think if the function is specific as fsync_pgdata(), fd.c is not the >> right place. I'm not sure xlog.c is either, though. > > ISTM xlog.c is clearly the *wrong* place; if this behavior has anything > to do with WAL logging as such, it's not apparent to me. fd.c is not > a great place perhaps, but in view of the fact that we have things like > RemovePgTempFiles() in there, it's not unreasonable to see fsync_pgdata > as something to put there as well (perhaps with a name more chosen to > match fd.c names...) OK, sure, makes sense. > Since Robert appears to be busy worrying about the multixact issue > reported by Steve Kehlet, I suggest he focus on that and I'll take > care of getting this thing committed. AFAICT we have consensus on > what it should do and we're down to arguing about code style. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Abhijit Menon-Sen <ams@2ndQuadrant.com> writes: > Here's an updated patch for the fsync problem(s). I've committed this after some mostly-cosmetic rearrangements. > 4. As a partial aside, why does fsync_fname use OpenTransientFile? It > looks like it should use BasicOpenFile as pre_sync_fname does. We > close the fd immediately after calling fsync anyway. Or is there > some reason I'm missing that we should use OpenTransientFile in > pre_sync_fname too? pre_sync_fname is the one that is wrong IMO. Using BasicOpenFile just creates an opportunity for problems; that function should get called from as few places as possible. > 5. I made walktblspc_entries use stat() instead of lstat(), so that we > can handle symlinks and directories the same way. Note that we won't > continue to follow links inside the sub-directories because we use > walkdir instead of recursing. Given that, walktblspc_entries was 99% duplicate, so I folded it into walkdir with a process_symlinks boolean. I have to leave shortly, so I'll look at the initdb cleanup tomorrow. regards, tom lane
Re: fsync-pgdata-on-recovery tries to write to more files than previously
From
Abhijit Menon-Sen
Date:
At 2015-05-28 17:37:16 -0400, tgl@sss.pgh.pa.us wrote: > > I've committed this after some mostly-cosmetic rearrangements. Thank you. > I have to leave shortly, so I'll look at the initdb cleanup tomorrow. Here's a revision of that patch that's more along the lines of what you committed. It occurred to me that we should probably also silently skip EACCES on opendir and stat/lstat in walkdir. That's what the attached initdb patch does. If you think it's a good idea, there's also a small fixup attached for the backend version too. -- Abhijit
Attachment
Abhijit Menon-Sen <ams@2ndQuadrant.com> writes: >> I have to leave shortly, so I'll look at the initdb cleanup tomorrow. > Here's a revision of that patch that's more along the lines of what you > committed. Will look at that tomorrow ... > It occurred to me that we should probably also silently skip EACCES on > opendir and stat/lstat in walkdir. That's what the attached initdb patch > does. If you think it's a good idea, there's also a small fixup attached > for the backend version too. Actually, I was seriously considering removing the don't-log-EACCES special case (at least for files, perhaps not for directories). The reason being that it's darn hard to verify that the code is doing what it's supposed to when there is no simple way to get it to log any complaints. I left it as you wrote it in the committed version, but I think both that and the ETXTBSY special case do not really have value as long as their only effect is to suppress a LOG entry. Speaking of which, could somebody test that the committed patch does what it's supposed to on Windows? You were worried upthread about whether the tests for symlinks (aka junction points) behaved correctly, and I have no way to check that either. regards, tom lane
Re: fsync-pgdata-on-recovery tries to write to more files than previously
From
Michael Paquier
Date:
On Fri, May 29, 2015 at 12:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Abhijit Menon-Sen <ams@2ndQuadrant.com> writes: >>> I have to leave shortly, so I'll look at the initdb cleanup tomorrow. > >> Here's a revision of that patch that's more along the lines of what you >> committed. > > Will look at that tomorrow ... > >> It occurred to me that we should probably also silently skip EACCES on >> opendir and stat/lstat in walkdir. That's what the attached initdb patch >> does. If you think it's a good idea, there's also a small fixup attached >> for the backend version too. > > Actually, I was seriously considering removing the don't-log-EACCES > special case (at least for files, perhaps not for directories). > The reason being that it's darn hard to verify that the code is doing > what it's supposed to when there is no simple way to get it to log any > complaints. I left it as you wrote it in the committed version, but > I think both that and the ETXTBSY special case do not really have value > as long as their only effect is to suppress a LOG entry. > > Speaking of which, could somebody test that the committed patch does > what it's supposed to on Windows? You were worried upthread about > whether the tests for symlinks (aka junction points) behaved correctly, > and I have no way to check that either. The error can be reproduced by using junction (https://technet.microsoft.com/en-us/sysinternals/bb896768.aspx) to define a junction point within PGDATA and then for example mark as read-only a configuration file included in postgresql.conf through the junction point. Using this method, I have checked that 9.4.1 works, reproduced the permission error with 9.4.2, and checked as well that a3ae3db4 fixed the problem. So things look to work fine. Regards, -- Michael
On Fri, May 29, 2015 at 9:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>
> Speaking of which, could somebody test that the committed patch does
> what it's supposed to on Windows? You were worried upthread about
> whether the tests for symlinks (aka junction points) behaved correctly,
> and I have no way to check that either.
>
I have done some tests for the committed patch for this issue
>
>
> Speaking of which, could somebody test that the committed patch does
> what it's supposed to on Windows? You were worried upthread about
> whether the tests for symlinks (aka junction points) behaved correctly,
> and I have no way to check that either.
>
I have done some tests for the committed patch for this issue
(especially to check the behaviour of symlink tests in the
new code) on Windows - 7 and below are results:
Test symlink for pg_xlog
-------------------------------------
Test -1 - Junction point for pg_xlog
1. Create a database (initdb)
2. Start server and forcefully crash it
3. cp E:\PGData\pg_xlog E:\TempLog\xlog_symlink_loc
4. mv E:\PGData\pg_xlog E:\bak_pg_xlog
5. Created a directory junction (equivalent symlink)
mklink /J E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc
6. Restart Server - operation is successful.
I have debugged this operation, it does what it suppose to do,
detects the junction point and does fsync.
Test - 2 - Directory Symlink for pg_xlog
First 4 steps are same as Test-1
5. mklink /D E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc
6. Restart Server - Error
- FATAL: required WAL directory "pg_xlog" does not exist
This error occurs in
ValidateXLOGDirectoryStructure()->stat(XLOGDIR, &stat_buf)
7. If I manually step-through that line, it proceeds and in
SyncDataDirectory(), it detects the symlink;8. After that in SyncDataDirectory(), when it does walkdir for
datadir, for ./pg_xlog it reports the log message and then
proceeds normal and the server is started.
Test-3 - Symlinks in pg_tblspc.
1. Create couple of tablespaces which creates symlinks
in pg_tblspc
2. Crash the server
3. Restart Server - It works fine.
I am not sure Test-2 is a valid thing and we should support it or
not, but the current patch is sane w.r.t that form of symlinks
as well.
Re: fsync-pgdata-on-recovery tries to write to more files than previously
From
Michael Paquier
Date:
On Fri, May 29, 2015 at 5:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, May 29, 2015 at 9:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> >> Speaking of which, could somebody test that the committed patch does >> what it's supposed to on Windows? You were worried upthread about >> whether the tests for symlinks (aka junction points) behaved correctly, >> and I have no way to check that either. >> > > I have done some tests for the committed patch for this issue > (especially to check the behaviour of symlink tests in the > new code) on Windows - 7 and below are results: > > Test symlink for pg_xlog > ------------------------------------- > Test -1 - Junction point for pg_xlog > 1. Create a database (initdb) > 2. Start server and forcefully crash it > 3. cp E:\PGData\pg_xlog E:\TempLog\xlog_symlink_loc > 4. mv E:\PGData\pg_xlog E:\bak_pg_xlog > 5. Created a directory junction (equivalent symlink) > mklink /J E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc > 6. Restart Server - operation is successful. > I have debugged this operation, it does what it suppose to do, > detects the junction point and does fsync. > > Test - 2 - Directory Symlink for pg_xlog > First 4 steps are same as Test-1 > 5. mklink /D E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc > 6. Restart Server - Error > - FATAL: required WAL directory "pg_xlog" does not exist > This error occurs in > ValidateXLOGDirectoryStructure()->stat(XLOGDIR, &stat_buf) > 7. If I manually step-through that line, it proceeds and in > SyncDataDirectory(), it detects the symlink; > 8. After that in SyncDataDirectory(), when it does walkdir for > datadir, for ./pg_xlog it reports the log message and then > proceeds normal and the server is started. > > Test-3 - Symlinks in pg_tblspc. > 1. Create couple of tablespaces which creates symlinks > in pg_tblspc > 2. Crash the server > 3. Restart Server - It works fine. > > I am not sure Test-2 is a valid thing and we should support it or > not, but the current patch is sane w.r.t that form of symlinks > as well. It is always good to check, but is that relevant to the bug fixed? I mean, you need to symlink a file in PGDATA that server has no write permission on... For example tablespaces can be written to in test 3, so that would work even with 9.4.2 after crashing the server with -m immediate. -- Michael
On Fri, May 29, 2015 at 2:28 PM, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Fri, May 29, 2015 at 5:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Test-3 - Symlinks in pg_tblspc.
> > 1. Create couple of tablespaces which creates symlinks
> > in pg_tblspc
> > 2. Crash the server
> > 3. Restart Server - It works fine.
> >
> > I am not sure Test-2 is a valid thing and we should support it or
> > not, but the current patch is sane w.r.t that form of symlinks
> > as well.
>
> It is always good to check, but is that relevant to the bug fixed? I
> mean, you need to symlink a file in PGDATA that server has no write
> permission on... For example tablespaces can be written to in test 3,
> so that would work even with 9.4.2 after crashing the server with -m
> immediate.
>
> On Fri, May 29, 2015 at 5:01 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Test-3 - Symlinks in pg_tblspc.
> > 1. Create couple of tablespaces which creates symlinks
> > in pg_tblspc
> > 2. Crash the server
> > 3. Restart Server - It works fine.
> >
> > I am not sure Test-2 is a valid thing and we should support it or
> > not, but the current patch is sane w.r.t that form of symlinks
> > as well.
>
> It is always good to check, but is that relevant to the bug fixed? I
> mean, you need to symlink a file in PGDATA that server has no write
> permission on... For example tablespaces can be written to in test 3,
> so that would work even with 9.4.2 after crashing the server with -m
> immediate.
I have just tried to cover the paths which has symlink related code
in the new commit (in functions SyncDataDirectory() and walkdir()),
because thats what I understood from Tom's mail. Without test-3, it
won't cover walkdir symlink test path, I didn't knew that there was any
confusion about verification of permissions problem on Windows
and I haven't looked to verify the whole patch.
Re: Tom Lane 2015-05-28 <5740.1432849036@sss.pgh.pa.us> > Abhijit Menon-Sen <ams@2ndQuadrant.com> writes: > > Here's an updated patch for the fsync problem(s). > > I've committed this after some mostly-cosmetic rearrangements. Fwiw, I can confirm that the problem is fixed for 9.5. The regression tests I've added to postgresql-common pass. Thanks! Christoph -- cb@df7cb.de | http://www.df7cb.de/
Re: fsync-pgdata-on-recovery tries to write to more files than previously
From
Michael Paquier
Date:
On Fri, May 29, 2015 at 6:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, May 29, 2015 at 2:28 PM, Michael Paquier <michael.paquier@gmail.com> > wrote: >> >> On Fri, May 29, 2015 at 5:01 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > >> > Test-3 - Symlinks in pg_tblspc. >> > 1. Create couple of tablespaces which creates symlinks >> > in pg_tblspc >> > 2. Crash the server >> > 3. Restart Server - It works fine. >> > >> > I am not sure Test-2 is a valid thing and we should support it or >> > not, but the current patch is sane w.r.t that form of symlinks >> > as well. >> >> It is always good to check, but is that relevant to the bug fixed? I >> mean, you need to symlink a file in PGDATA that server has no write >> permission on... For example tablespaces can be written to in test 3, >> so that would work even with 9.4.2 after crashing the server with -m >> immediate. > > I have just tried to cover the paths which has symlink related code > in the new commit (in functions SyncDataDirectory() and walkdir()), > because thats what I understood from Tom's mail. Without test-3, it > won't cover walkdir symlink test path, I didn't knew that there was any > confusion about verification of permissions problem on Windows > and I haven't looked to verify the whole patch. Reading once again this thread (after some good red wine + alpha), it looks that you are right: not many checks with Windows junctions have actually been with what has been committed... > Test - 2 - Directory Symlink for pg_xlog > First 4 steps are same as Test-1 > 5. mklink /D E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc > 6. Restart Server - Error > - FATAL: required WAL directory "pg_xlog" does not exist > This error occurs in > ValidateXLOGDirectoryStructure()->stat(XLOGDIR, &stat_buf) > 7. If I manually step-through that line, it proceeds and in > SyncDataDirectory(), it detects the symlink; > 8. After that in SyncDataDirectory(), when it does walkdir for > datadir, for ./pg_xlog it reports the log message and then > proceeds normal and the server is started. Regarding this test, I just tested initdb -X and it works correctly after crashing the server, so there is at least a workaround for the error you are triggering here. Now, shouldn't we improve things as well with mklink? That's a rather narrow case and we have a workaround for it at initialization with initdb. I haven't looked much at the code so I don't have a patch at hand but thoughts on this matter are welcome. -- Michael
Michael Paquier <michael.paquier@gmail.com> writes: > On Fri, May 29, 2015 at 6:49 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Test - 2 - Directory Symlink for pg_xlog >> First 4 steps are same as Test-1 >> 5. mklink /D E:\ PGData\pg_xlog E:\TempLog\xlog_symlink_loc >> 6. Restart Server - Error >> - FATAL: required WAL directory "pg_xlog" does not exist >> This error occurs in >> ValidateXLOGDirectoryStructure()->stat(XLOGDIR, &stat_buf) >> 7. If I manually step-through that line, it proceeds and in >> SyncDataDirectory(), it detects the symlink; >> 8. After that in SyncDataDirectory(), when it does walkdir for >> datadir, for ./pg_xlog it reports the log message and then >> proceeds normal and the server is started. > Regarding this test, I just tested initdb -X and it works correctly > after crashing the server, so there is at least a workaround for the > error you are triggering here. Now, shouldn't we improve things as > well with mklink? That's a rather narrow case and we have a workaround > for it at initialization with initdb. I haven't looked much at the > code so I don't have a patch at hand but thoughts on this matter are > welcome. My (vague) recollection is that what mklink creates is not the same as a junction point and we intentionally only support the latter as pseudo-symlinks. You'd need to trawl the archives for more detail, unless someone else remembers. So I think we're good here. Thanks for all the testing! regards, tom lane
Abhijit Menon-Sen <ams@2ndQuadrant.com> writes: > At 2015-05-28 17:37:16 -0400, tgl@sss.pgh.pa.us wrote: >> I have to leave shortly, so I'll look at the initdb cleanup tomorrow. > Here's a revision of that patch that's more along the lines of what you > committed. Pushed with minor revisions. > It occurred to me that we should probably also silently skip EACCES on > opendir and stat/lstat in walkdir. That's what the attached initdb patch > does. If you think it's a good idea, there's also a small fixup attached > for the backend version too. As I mentioned yesterday, I'm not really on board with ignoring EACCES, except for the directories-on-Windows case. Since we're only logging the failures anyway, I think it is reasonable to log a complaint for any unwritable file in the data directory. I've not done it yet, but what I think we oughta do is revert if (errno == EACCES || (isdir && errno == EISDIR)) back to the former logic if (isdir && (errno == EISDIR || errno == EACCES)) and even then, maybe the EACCES exclusion ought to be Windows only? Also I want to get rid of the ETXTBSY special cases. That one doesn't seem like something that we should silently ignore: what the heck are executables doing in the data directory? Or is there some other meaning on Windows? regards, tom lane
Re: fsync-pgdata-on-recovery tries to write to more files than previously
From
Abhijit Menon-Sen
Date:
At 2015-05-29 13:14:18 -0400, tgl@sss.pgh.pa.us wrote: > > Pushed with minor revisions. Thanks, looks good. > Since we're only logging the failures anyway, I think it is reasonable > to log a complaint for any unwritable file in the data directory. Sounds reasonable, patch attached. ETXTBSY has no special meaning that I can find under Windows, so I included that change as well. -- Abhijit
Attachment
On 2015-05-29 13:14:18 -0400, Tom Lane wrote: > Abhijit Menon-Sen <ams@2ndQuadrant.com> writes: > As I mentioned yesterday, I'm not really on board with ignoring EACCES, > except for the directories-on-Windows case. Since we're only logging > the failures anyway, I think it is reasonable to log a complaint for any > unwritable file in the data directory. That sounds like a potentially nontrivial amount of repetitive log bleat after every crash start? One which the user can't really stop? > Also I want to get rid of the ETXTBSY special cases. That one doesn't > seem like something that we should silently ignore: what the heck are > executables doing in the data directory? Or is there some other meaning > on Windows? I've seen a bunch of binaries placed in the data directory as archive/restore commands. Those will be busy a good amount of the time. While it'd not be my choice to do that, it's not entirely unreasonable. Andres
Andres Freund <andres@anarazel.de> writes: > On 2015-05-29 13:14:18 -0400, Tom Lane wrote: >> Abhijit Menon-Sen <ams@2ndQuadrant.com> writes: >> As I mentioned yesterday, I'm not really on board with ignoring EACCES, >> except for the directories-on-Windows case. Since we're only logging >> the failures anyway, I think it is reasonable to log a complaint for any >> unwritable file in the data directory. > That sounds like a potentially nontrivial amount of repetitive log bleat > after every crash start? One which the user can't really stop? Why can't the user stop it? We won't be bleating about the case of a symlink to a non-writable file someplace else, which is the Debian use case. I don't see a very good excuse to have a non-writable file right in the data directory. >> Also I want to get rid of the ETXTBSY special cases. That one doesn't >> seem like something that we should silently ignore: what the heck are >> executables doing in the data directory? Or is there some other meaning >> on Windows? > I've seen a bunch of binaries placed in the data directory as > archive/restore commands. Those will be busy a good amount of the > time. While it'd not be my choice to do that, it's not entirely > unreasonable. I'd say it's a pretty damn-fool arrangement: for starters, it's an unnecessary security hazard. In any case, if the cost of such a file is one more line of log output during a crash restart, most people would have no problem at all in ignoring that log output. regards, tom lane
On 2015-05-29 13:49:16 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2015-05-29 13:14:18 -0400, Tom Lane wrote: > >> Abhijit Menon-Sen <ams@2ndQuadrant.com> writes: > >> As I mentioned yesterday, I'm not really on board with ignoring EACCES, > >> except for the directories-on-Windows case. Since we're only logging > >> the failures anyway, I think it is reasonable to log a complaint for any > >> unwritable file in the data directory. > > > That sounds like a potentially nontrivial amount of repetitive log bleat > > after every crash start? One which the user can't really stop? > > Why can't the user stop it? Because it makes a good amount of sense to have e.g. certificates not owned by postgres and not writeable? You don't necessarily want to symlink them somewhere else, because that makes moving clusters around harder than when they're self contained. > I'd say it's a pretty damn-fool arrangement: for starters, it's an > unnecessary security hazard. I don't buy the security argument at all. You likely have postgresql.conf in the data directoy. You can write to at least .auto, which will definitely reside the data directory. That contains archive_command. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2015-05-29 13:49:16 -0400, Tom Lane wrote: >> Why can't the user stop it? > Because it makes a good amount of sense to have e.g. certificates not > owned by postgres and not writeable? You don't necessarily want to > symlink them somewhere else, because that makes moving clusters around > harder than when they're self contained. Meh. Well, I'm willing to yield on the EACCES point, but I still find the exclusion for ETXTBSY to be ugly and inappropriate. >> I'd say it's a pretty damn-fool arrangement: for starters, it's an >> unnecessary security hazard. > I don't buy the security argument at all. You likely have > postgresql.conf in the data directoy. You can write to at least .auto, > which will definitely reside the data directory. That contains > archive_command. The fact that a superuser might have multiple ways to subvert things doesn't make it a good idea to add another one: the attack surface could be larger, or at least different. But even if you don't buy that it's a security hazard, why would it be a good idea to have executables inside $PGDATA? That would for example lead to them getting copied by pg_basebackup, which seems unlikely to be a good thing. And if you did have such executables there, why would they be active during a postmaster restart? I really seriously doubt that this is either common enough or useful enough to justify suppressing warning messages about it. regards, tom lane
On 2015-05-29 14:15:48 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2015-05-29 13:49:16 -0400, Tom Lane wrote: > >> Why can't the user stop it? > > > Because it makes a good amount of sense to have e.g. certificates not > > owned by postgres and not writeable? You don't necessarily want to > > symlink them somewhere else, because that makes moving clusters around > > harder than when they're self contained. > > Meh. Well, I'm willing to yield on the EACCES point, but I still find > the exclusion for ETXTBSY to be ugly and inappropriate. Ok.
* Andres Freund (andres@anarazel.de) wrote: > On 2015-05-29 13:49:16 -0400, Tom Lane wrote: > > > That sounds like a potentially nontrivial amount of repetitive log bleat > > > after every crash start? One which the user can't really stop? > > > > Why can't the user stop it? > > Because it makes a good amount of sense to have e.g. certificates not > owned by postgres and not writeable? You don't necessarily want to > symlink them somewhere else, because that makes moving clusters around > harder than when they're self contained. A certain other file might be non-writable by PG too... (*cough* .auto *cough*). > > I'd say it's a pretty damn-fool arrangement: for starters, it's an > > unnecessary security hazard. > > I don't buy the security argument at all. You likely have > postgresql.conf in the data directoy. You can write to at least .auto, > which will definitely reside the data directory. That contains > archive_command. I'm not sure that I see the security issue here either.. We're not talking about setuid shell scripts or anything that isn't running as the PG user, which a superuser could take over anyway.. Thanks! Stephen
Re: Tom Lane 2015-05-29 <13871.1432921756@sss.pgh.pa.us> > Why can't the user stop it? We won't be bleating about the case of a > symlink to a non-writable file someplace else, which is the Debian use > case. I don't see a very good excuse to have a non-writable file right > in the data directory. I've repeatedly seen PGDATA or pg_xlog been put directly on a mountpoint, which means there well be a non-writable lost+found directory there. (A case with pg_xlog was also reported as a support case at credativ.) I'm usually advising against using the top level directory directly, but it's not uncommon to encounter it. > In any case, if the cost of such a file is one more line of log output > during a crash restart, most people would have no problem at all in > ignoring that log output. Nod. Christoph -- cb@df7cb.de | http://www.df7cb.de/