Thread: EINTR error in SunOS
I encountered an error today (can't repeat) on SunOS 5.8: --test that we read consecutive LFs properly CREATE TEMP TABLE testnl (a int, b text, c int); + ERROR: could not open relation 1663/16384/37713: Interrupted system call The reason I guess is the open() call is interrupted by a signal (what signal BTW?). This error may be specific to SunOS/Solaris, but POSIX does say that an EINTR is possible on open(), close(), read(), write() and also the fopen() family: http://www.opengroup.org/onlinepubs/007908799/xsh/open.html We have patched read()/write(), shall we do so to open()/close() and also fopen() family? Patching files other than fd.c seems unnecessary for two reasons: (1) they are not frequently exercised; (2) they don't have the basic errno-check code there. Regards, Qingqing
Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > + ERROR: could not open relation 1663/16384/37713: Interrupted system call > The reason I guess is the open() call is interrupted by a signal (what > signal BTW?). I've heard of this in connection with NFS ... is your DB on an NFS filesystem by any chance? regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > Qingqing Zhou <zhouqq@cs.toronto.edu> writes: >> + ERROR: could not open relation 1663/16384/37713: Interrupted system >> call > >> The reason I guess is the open() call is interrupted by a signal (what >> signal BTW?). > > I've heard of this in connection with NFS ... is your DB on an NFS > filesystem by any chance? > Exactly. I guess school machines love NFS. Regards, Qingqing
On Fri, 30 Dec 2005, Tom Lane wrote: > > I've heard of this in connection with NFS ... is your DB on an NFS > filesystem by any chance? > I have patched IO routines in backend/storage that POSIX says EINTR is possible except unlink(). Though POSIX says EINTR is not possible, during many regressions, I found it sometimes sets this errno on NFS (I still don't know where is the smoking-gun): TRUNCATE TABLE trunc_c,trunc_d,trunc_e; -- ok + WARNING: could not remove relation 1663/16384/37822: Interrupted system call There are many other unlink() scattered in backend, some even without error check. Shall we patch pg_unlink for this situation and replace them like this: pg_unlink(const char* path, int errlevel){ retry: returnCode = unlink(path); if (returnCode < 0 && errno==EINTR) goto retry; if other_errors elog(elevel, ...); return returnCode;} Or pg_unlink(const char* path){ /* no elog -- but we still have to do error check */} Or let it be ... If we decide to do something for unlink(), then we'd better do something for other EINTR-possible IO routines for fairness :-) By the way, seems POSIX is not very consistent with EINTR. For example, closedir() can set EINTR, but opendir()/readdir() can't. Any magic in it? Regards, Qingqing
Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > On Fri, 30 Dec 2005, Tom Lane wrote: > > > > I've heard of this in connection with NFS ... is your DB on an NFS > > filesystem by any chance? > > I have patched IO routines in backend/storage that POSIX says EINTR is > possible except unlink(). Though POSIX says EINTR is not possible, during > many regressions, I found it sometimes sets this errno on NFS (I still > don't know where is the smoking-gun): Well there is a reason intr is not the default for NFS mounts. It's precisely because it breaks the traditional unix filesystem interface. Syscalls that historically are not interruptible become interruptible and not all programs behave properly when that occurs. In any case POSIX explicitly allows functions to return other errors aside from those specified as long as it's for error conditions not listed. [Chapter 2 Section 3, paragraph 6] Implementations may support additional errors not included in this list, may generate errors included in this list undercircumstances other than those described here, or may contain extensions or limitations that prevent some errors fromoccurring. The ERRORS section on each reference page specifies whether an error shall be returned, or whether it maybe returned. Implementations shall not generate a different error number from the ones described here for error conditionsdescribed in this volume of IEEE Std 1003.1-2001, but may generate additional errors unless explicitly disallowedfor a particular function Ironically EINTR *is* singled out to be specifically forbidden to be returned from some system calls but only those in the Threads option which are mostly pthread* functions. unlink isn't covered by that prohibition. -- greg
Greg Stark <gsstark@mit.edu> writes: > Qingqing Zhou <zhouqq@cs.toronto.edu> writes: >> I have patched IO routines in backend/storage that POSIX says EINTR is >> possible except unlink(). Though POSIX says EINTR is not possible, during >> many regressions, I found it sometimes sets this errno on NFS (I still >> don't know where is the smoking-gun): > Well there is a reason intr is not the default for NFS mounts. It's precisely > because it breaks the traditional unix filesystem interface. Yeah. We have looked at this before and decided that trying to defend against it is too invasive and too fragile (how will you ever be sure you've fixed everyplace, or keep other places from sneaking in later?) What I'd rather do is document prominently that running a DB over NFS isn't recommended, and running it over NFS with interrupts allowed is just not going to work. regards, tom lane
On Sat, 31 Dec 2005, Tom Lane wrote: > > What I'd rather do is document prominently that running a DB over NFS > isn't recommended, and running it over NFS with interrupts allowed is > just not going to work. > Agreed. IO syscalls is not the only problem for NFS -- if we can't fix them in a run, then don't do it. Regards, Qingqing
On Sat, 2005-12-31 at 14:40 -0500, Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > >> I have patched IO routines in backend/storage that POSIX says EINTR is > >> possible except unlink(). Though POSIX says EINTR is not possible, during > >> many regressions, I found it sometimes sets this errno on NFS (I still > >> don't know where is the smoking-gun): > > > Well there is a reason intr is not the default for NFS mounts. It's precisely > > because it breaks the traditional unix filesystem interface. > What I'd rather do is document prominently that running a DB over NFS > isn't recommended, and running it over NFS with interrupts allowed is > just not going to work. Are there issues with having an archive_command which does things with NFS based filesystems? --
Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > On Sat, 31 Dec 2005, Tom Lane wrote: > > > > What I'd rather do is document prominently that running a DB over NFS > > isn't recommended, and running it over NFS with interrupts allowed is > > just not going to work. > > Agreed. IO syscalls is not the only problem for NFS -- if we can't fix > them in a run, then don't do it. I don't think that's reasonable. The NFS intr option breaks the traditional unix filesystem semantics which breaks a lot of older or naive programs. But that's no reason to decide that Postgres can't handle the new semantics. Handling EINTR after all file system calls doesn't sound like it would be terribly hard. And Postgres of all systems has the infrastructure necessary to handle error conditions, abort and roll back the transaction when a file system error occurs. I think mainly this means it would be possible to hit C-c or shut down postgres (uncleanly) when there's a network outage. -- greg
On Sat, 31 Dec 2005, Greg Stark wrote: > > I don't think that's reasonable. The NFS intr option breaks the traditional > unix filesystem semantics which breaks a lot of older or naive programs. But > that's no reason to decide that Postgres can't handle the new semantics. > Is that by default the EINTR is truned off in NFS? If so, I don't see that will be a problem. Sorry for my limited knowledge, is there any requirements/benefits that people turn on EINTR? > Handling EINTR after all file system calls doesn't sound like it would be > terribly hard. The problem is not restricted to file system. Actually my patched version(only backend/storage) passed hundreds times of regression without any problem, but EINTR can hurt other syscalls as well. Find out *all* the EINTR situtations may need big efforts AFAICS. Regards, Qingqing
On Sat, Dec 31, 2005 at 04:46:02PM -0500, Qingqing Zhou wrote: > Is that by default the EINTR is truned off in NFS? If so, I don't see that > will be a problem. Sorry for my limited knowledge, is there any > requirements/benefits that people turn on EINTR? I wont speak for anyone else, but the reason I set intr on for NFS mounts is so that if I turn off the file server I don't get unkillable processes on the client. Messy sure, and maybe there's a better solution made since but I really don't like processes stuck in D state (ie kill -9 won't work). Better the program die in some wierd way than that... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > On Sat, 31 Dec 2005, Greg Stark wrote: > > > > > I don't think that's reasonable. The NFS intr option breaks the traditional > > unix filesystem semantics which breaks a lot of older or naive programs. But > > that's no reason to decide that Postgres can't handle the new semantics. > > > > Is that by default the EINTR is truned off in NFS? If so, I don't see that > will be a problem. Sorry for my limited knowledge, is there any > requirements/benefits that people turn on EINTR? That's why the "intr" option (and the "soft") option has traditionally not been enabled by default in NFS implementations. But many people don't like that when their NFS server disappears their client applications become unkillable. They like to be able to hit C-c and stop whatever is running. In the case of Postgres having "intr" off on the NFS mount point would mean you couldn't C-c a query stuck because the database is on NFS. Of course it's not like you would be able to run any more queries after that, but you might want your terminal back. You wouldn't even be able to shut down Postgres, even with kill -9. If your NFS server is unrecoverable and you want to bring up a Postgres instance using a backup restored some other place you would have to bring it up on another port or reboot your machine. That's the kind of thing that leads lots of sysadmins to use the "intr" and "soft" options. And those sysadmins generally aren't aware of these kinds of consequences since it's more of a programming level issue. > > Handling EINTR after all file system calls doesn't sound like it would be > > terribly hard. > > The problem is not restricted to file system. Actually my patched > version(only backend/storage) passed hundreds times of regression without > any problem, but EINTR can hurt other syscalls as well. Find out *all* the > EINTR situtations may need big efforts AFAICS. Well NFS is only going to affect filesystem calls. If there are other syscalls that can signal EINTR on some obscure platform where Postgres isn't handling it then that's just a run-of-the-mill porting issue. But like I mentioned in the other thread POSIX is of no help here. With the exception of the pthreads syscalls POSIX doesn't prohibit functions from signalling errors other than the ones documented in the specification. So in other words, just about any function can signal just about any error including errors that are proprietary additions any time. Good luck :) -- greg
On Sat, 31 Dec 2005, Greg Stark wrote: > > Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > > > > > Is that by default the EINTR is truned off in NFS? If so, I don't see that > > will be a problem. Sorry for my limited knowledge, is there any > > requirements/benefits that people turn on EINTR? > > That's why the "intr" option (and the "soft") option has traditionally not > been enabled by default in NFS implementations. But many people don't like > that when their NFS server disappears their client applications become > unkillable. They like to be able to hit C-c and stop whatever is running. > Thanks Greg and Martin, I now understand better of intr :-) So we can killed Postgres or not depends on our signal handler. Query Cancel signal won't work because "ImmediateInterruptOK" forbids it and the retry style code in read/write will put the Postgres process into uninterruptable sleep again. But die signal will work I think. Regards, Qingqing
Rod Taylor <pg@rbt.ca> writes: > Are there issues with having an archive_command which does things with > NFS based filesystems? Well, whatever command you use for archive_command -- probably just "cp" if you're using NFS would hang if the NFS server went away. What would happen then might be interesting. If Postgres finds the archive_command hanging indefinitely will it correctly avoid recycling the WAL log indefinitely? I assume so. What's nonoptimal here is that I don't think there would be any warning that anything was wrong until the WAL logs eventually filled up their filesystem and then postgres stopped running. In the meantime your archived WAL logs would be getting older and older and you would have no indication that anything was failing. This was the intention with the NFS error handling. The theory being that eventually the server comes back up and things resume functioning exactly where they left off with no lost operations. The upside is you don't have things failing, then resuming later and unhandled errors in the meantime leading to data corruption. The downside is there's no way for "cp" and ultimately Postgres to know anything's wrong except to have a timeout itself and an arbitrary maximum amount of time to expect operations to take. -- greg
EINTR on read() or write() is not unique to NFS. It can happen on many file systems - it is just seen less frequently on most of them. The code should be able to handle ANY valid read() and write() errno. And EINTR is documented on Linux, BSD, Solaris (1 and 2), and POSIX. Even the Linux man pages can return ENTER on read() and write(). This can happen on soft-mirrors, SCSI disks, and SOME other disk drivers when they have errors. The 'intr' option to NFS is not the same as EINTR. It it means 'if the server does not respond for a while, then return an EINTR', just like any other disk read() or write() does when it fails to reply. I have seen lots of open source code that assumes that all disk reads and writs work 100% or fail 100%. Many do not check the return value to see if all data was written or read from disk. And many do not look at errno at all. I have NOT looked to see how postgres does it. If storage/*.c is where the reads occur, it does very LITTLE when checking for errors. >>>Handling EINTR after all file system calls doesn't sound like it would be >>>terribly hard. >> >>The problem is not restricted to file system. Actually my patched >>version(only backend/storage) passed hundreds times of regression without >>any problem, but EINTR can hurt other syscalls as well. Find out *all* the >>EINTR situtations may need big efforts AFAICS. > > > Well NFS is only going to affect filesystem calls. If there are other syscalls > that can signal EINTR on some obscure platform where Postgres isn't handling > it then that's just a run-of-the-mill porting issue. > > But like I mentioned in the other thread POSIX is of no help here. With the > exception of the pthreads syscalls POSIX doesn't prohibit functions from > signalling errors other than the ones documented in the specification. So in > other words, just about any function can signal just about any error including > errors that are proprietary additions any time. Good luck :) > -- Doug Royer | http://INET-Consulting.com -------------------------------|----------------------------- We Do Standards - You Need Standards
Attachment
Doug Royer <Doug@Royer.com> writes: > The 'intr' option to NFS is not the same as EINTR. It > it means 'if the server does not respond for a while, > then return an EINTR', just like any other disk read() > or write() does when it fails to reply. No, you're thinking of 'soft'. 'intr' (which is actually a modifier to the 'hard' setting) causes the I/O to hang until the server comes back or the process gets a signal (in which case EINTR is returned). -Doug
"Greg Stark" <gsstark@mit.edu> wrote > > Well NFS is only going to affect filesystem calls. If there are other > syscalls > that can signal EINTR on some obscure platform where Postgres isn't > handling > it then that's just a run-of-the-mill porting issue. > Ok, NFS just affects filesystem calls(I mix it with another problem). If possible, I hope we can draw some conclusion / schetch a fix plan here for future developers who want to come up with a patch. The question is: Where and how should we fix exactly in order to incorporate intr NFS in server side? More details we write down here, more feasible/infeasible plan we can get. I could think of these places: + direct file system calls - open() family, fopen() family in backend/storage - scattered open() etc in the whole backend(seems unlink is with biggest problem) The problem of above is if a signal sneaks in, these syscalls will fail. With a retry, we can fix it. + indirect file system calls - system("xxx") calls, xxx = cp, etc. If intr NFS is enabled, what's the problem exactly? Any others? Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > The problem of above is if a signal sneaks in, these syscalls will fail. > With a retry, we can fix it. It's a bit stickier than that but only a bit. If you just retry then you're saying users have to use kill -9 to get away from the situation. For some filesystem operations that may be the best we can do. But for most it ought to be possible to CHECK_FOR_INTERRUPTS() and handle the regular signals like C-c or kill -1 normally. Even having the single backend exit (to avoid file resource leaks) is nicer than having to restart the entire instance. -- greg
On Sun, 1 Jan 2006, Greg Stark wrote: > > "Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > > > The problem of above is if a signal sneaks in, these syscalls will fail. > > With a retry, we can fix it. > > It's a bit stickier than that but only a bit. If you just retry then you're > saying users have to use kill -9 to get away from the situation. For some > filesystem operations that may be the best we can do. But for most it ought to > be possible to CHECK_FOR_INTERRUPTS() and handle the regular signals like C-c > or kill -1 normally. Even having the single backend exit (to avoid file > resource leaks) is nicer than having to restart the entire instance. > I understand put a CHECK_FOR_INTERRUPTS() in the retry-loop may make more graceful stop, but it won't work in some cases -- notice that the io routines we will patch can be used before the signal mechanism is setup. Regards, Qingqing
Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > I understand put a CHECK_FOR_INTERRUPTS() in the retry-loop may make more > graceful stop, but it won't work in some cases -- notice that the io > routines we will patch can be used before the signal mechanism is setup. I don't think it will help much at all: too many of the operations in question are invoked in places where CHECK_FOR_INTERRUPTS is a no-op. Examples: * disk writes are mostly done by the bgwriter and not backends at all * unlinks are generally done during xact commit/rollback Qingqing's point about failures in system()-invoked commands (think archive_command for PITR) is a mighty good one too. That puts a serious crimp into any illusion that we can really fix this in any reliable way. regards, tom lane
On Sun, 1 Jan 2006, Tom Lane wrote: > Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > > I understand put a CHECK_FOR_INTERRUPTS() in the retry-loop may make more > > graceful stop, but it won't work in some cases -- notice that the io > > routines we will patch can be used before the signal mechanism is setup. > > I don't think it will help much at all: too many of the operations in > question are invoked in places where CHECK_FOR_INTERRUPTS is a no-op. > Examples: > * disk writes are mostly done by the bgwriter and not backends at all > * unlinks are generally done during xact commit/rollback > Right. > Qingqing's point about failures in system()-invoked commands (think > archive_command for PITR) is a mighty good one too. That puts a > serious crimp into any illusion that we can really fix this in any > reliable way. > Not my credit, I just collect Rod & Greg's posts about this here :-) And I still not sure what exactly the problem we want to fix here -- think our target is the "operation should not faild because of EINTR". Regards, Qingqing
From the Linux 'nfs' man page: intr If an NFS file operation has a major timeout and it is hard mounted, then allow signals to interupt the file operation and cause it to return EINTR to the calling program. The default is to not allow file operations to be interrupted. Solaris 'mount_nfs' man page intr | nointr Allow (do not allow) keyboard interrupts to kill a process that is hung while waiting for a response on a hard-mounted file system. The default is intr, which makes it possible for clients to interrupt applications that may be waiting for a remote mount. The Solaris and Linux defaults seem to be the opposite of each other. So I think we are saying the same thing. You can get EINTR with hard+intr mounts. I am not sure what you get with soft mounts on a timeout. Doug McNaught wrote: > Doug Royer <Doug@Royer.com> writes: > > >>The 'intr' option to NFS is not the same as EINTR. It >>it means 'if the server does not respond for a while, >>then return an EINTR', just like any other disk read() >>or write() does when it fails to reply. > > > No, you're thinking of 'soft'. 'intr' (which is actually a modifier > to the 'hard' setting) causes the I/O to hang until the server comes > back or the process gets a signal (in which case EINTR is returned). > > -Doug > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- Doug Royer | http://INET-Consulting.com -------------------------------|----------------------------- We Do Standards - You Need Standards
Attachment
Doug Royer <Doug@Royer.com> writes: > From the Linux 'nfs' man page: > > intr If an NFS file operation has a major timeout and it is > hard mounted, then allow signals to interupt the file > operation and cause it to return EINTR to the calling > program. The default is to not allow file operations to > be interrupted. > > Solaris 'mount_nfs' man page > > intr | nointr > Allow (do not allow) keyboard interrupts to kill > a process that is hung while waiting for a > response on a hard-mounted file system. The > default is intr, which makes it possible for > clients to interrupt applications that may be > waiting for a remote mount. > > The Solaris and Linux defaults seem to be the opposite of each other. Actually they're the same, though differently worded. "Major timeout" means the server has not responded for N milliseconds, not that the client has decided to time out the request. If 'hard' is set, the client will keep trying indefinitely, though you can interrupt it if you've specified 'intr'. > So I think we are saying the same thing. > > You can get EINTR with hard+intr mounts. Yes, *only* if the user specifically decides to send a signal, or if it uses SIGALRM or whatever. I agree that if you expect 'intr' to be used, your code needs to handle EINTR. > I am not sure what you get with soft mounts on a timeout. The Linux manpage implies you get EIO. -Doug
Let me give you a sky-high view of this. Database reliability requires that the disk drive be 100% reliable. If any part of the disk storage fails (I/O write failure, NFS failure) we have to assume that the disk storage is corrupt and the database needs to be restored from backup. The NFS failure modes seem to suggest that any kind of NFS failure makes our storage suspect, meaning we want NFS to be as non-failure mode as possible. Making PostgreSQL work on NFS system itself is risky, and allowing it to work on systems that will soft-failure on writes seems even worse. --------------------------------------------------------------------------- Doug McNaught wrote: > Doug Royer <Doug@Royer.com> writes: > > > From the Linux 'nfs' man page: > > > > intr If an NFS file operation has a major timeout and it is > > hard mounted, then allow signals to interupt the file > > operation and cause it to return EINTR to the calling > > program. The default is to not allow file operations to > > be interrupted. > > > > Solaris 'mount_nfs' man page > > > > intr | nointr > > Allow (do not allow) keyboard interrupts to kill > > a process that is hung while waiting for a > > response on a hard-mounted file system. The > > default is intr, which makes it possible for > > clients to interrupt applications that may be > > waiting for a remote mount. > > > > The Solaris and Linux defaults seem to be the opposite of each other. > > Actually they're the same, though differently worded. "Major timeout" > means the server has not responded for N milliseconds, not that the > client has decided to time out the request. If 'hard' is set, the > client will keep trying indefinitely, though you can interrupt it if > you've specified 'intr'. > > > So I think we are saying the same thing. > > > > You can get EINTR with hard+intr mounts. > > Yes, *only* if the user specifically decides to send a signal, or if > it uses SIGALRM or whatever. I agree that if you expect 'intr' to be > used, your code needs to handle EINTR. > > > I am not sure what you get with soft mounts on a timeout. > > The Linux manpage implies you get EIO. > > -Doug > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
The MOUNT options are opposite. Linux NFS mount - defualts to no-intr Solaris NFS mount - default to intr Doug McNaught wrote: > Doug Royer <Doug@Royer.com> writes: > > >> From the Linux 'nfs' man page: >> >> intr If an NFS file operation has a major timeout and it is >> hard mounted, then allow signals to interupt the file >> operation and cause it to return EINTR to the calling >> program. The default is to not allow file operations to >> be interrupted. >> >>Solaris 'mount_nfs' man page >> >> intr | nointr >> Allow (do not allow) keyboard interrupts to kill >> a process that is hung while waiting for a >> response on a hard-mounted file system. The >> default is intr, which makes it possible for >> clients to interrupt applications that may be >> waiting for a remote mount. >> >>The Solaris and Linux defaults seem to be the opposite of each other. > > > Actually they're the same, though differently worded. "Major timeout" > means the server has not responded for N milliseconds, not that the > client has decided to time out the request. If 'hard' is set, the > client will keep trying indefinitely, though you can interrupt it if > you've specified 'intr'. > > >>So I think we are saying the same thing. >> >>You can get EINTR with hard+intr mounts. > > > Yes, *only* if the user specifically decides to send a signal, or if > it uses SIGALRM or whatever. I agree that if you expect 'intr' to be > used, your code needs to handle EINTR. > > >>I am not sure what you get with soft mounts on a timeout. > > > The Linux manpage implies you get EIO. > > -Doug > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster -- Doug Royer | http://INET-Consulting.com -------------------------------|----------------------------- We Do Standards - You Need Standards
Attachment
Yes - if you assume that EINTR only happens on NFS mounts. My point is that independent of NFS, the error checking that I have found in the code is not complete even for non-NFS file systems. The read() and write() LINUX man pages do NOT specify that EINTR is an NFS-only error. EINTR The call was interrupted by a signal before any data was read. The read() and write() SOLARIS man pages say: EINTR A signal was caught during the read operation and no data was transferred. There are other SVR read() and write() errors: EOVERFLOW (read) The file is a regular file, nbyte is greater than 0, the starting position is before the end-of-file, and the starting position is greater than or equal to the offset maximum established in the open file descrip- tion associated with fildes. EDEADLK The write was going to go to sleep and cause a deadlock situation to occur. EDQUOT The user's quota of disk blocks on the file system containing the file has been exhausted. EFBIG (write) An attempt is made to write a file that exceeds the process's file size limit or the maximum file size (see getrlimit(2) and ulimit(2)). EFBIG The file is a regular file, nbyte is greater than 0, and the starting position is greater than or equal to the offset maximum established in the file description associated with fildes. ENOSPC During a write to an ordinary file, there is no free space left on the device. Bruce Momjian wrote: > Let me give you a sky-high view of this. Database reliability requires > that the disk drive be 100% reliable. If any part of the disk storage > fails (I/O write failure, NFS failure) we have to assume that the disk > storage is corrupt and the database needs to be restored from backup. > The NFS failure modes seem to suggest that any kind of NFS failure makes > our storage suspect, meaning we want NFS to be as non-failure mode as > possible. Making PostgreSQL work on NFS system itself is risky, and > allowing it to work on systems that will soft-failure on writes seems > even worse. > -- Doug Royer | http://INET-Consulting.com -------------------------------|----------------------------- We Do Standards - You Need Standards
Attachment
Doug Royer <Doug@Royer.com> writes: > The MOUNT options are opposite. > > Linux NFS mount - defualts to no-intr > Solaris NFS mount - default to intr Oh, right--I didn't realize that was what you were talking about. -Doug
Doug Royer <Doug@Royer.com> writes: > Yes - if you assume that EINTR only happens on NFS mounts. > My point is that independent of NFS, the error checking > that I have found in the code is not complete even for > non-NFS file systems. > > > The read() and write() LINUX man pages do NOT specify that EINTR > is an NFS-only error. > > EINTR The call was interrupted by a signal before any data was > read. Right, but I think that's because read() and write() also work on sockets and serial ports, which are always interruptible. I have not heard of local-disk filesystem code on any Unix I've seen ever giving EINTR--a process waiting for disk is always in D state, which means it's not interruptible by signals. If I have the time maybe I'll grovel through the Linux sources and verify this, but I'm pretty sure of it. I'm not a PG internals expert by any means, but my $0.02 on this is that we should: a) recommend NOT using NFS for the database storage b) if NFS must be used, recommend 'hard,nointr' mounts c) treat EINTR as an I/O error (I don't know how easy this would be) d) say "if you mount 'soft' and lose data, tough luck for you" -Doug
Doug McNaught wrote: > c) treat EINTR as an I/O error (I don't know how easy this would be) So then at this point - it is detected, so problem solved? If a LOCAL hard drive fails to reply, you hang. Same with hard,intr NFS file system. bytesRead = read(fd, buffer, requestedBytes); if (bytesRead < 0) { switch (errno) { case EAGAIN: #ifdef USING_RECORD_LOCKING_OR_NON_BLOCKING_IO ...do the above read() again... #else /*FALLTHRU*/ #endif default: ... log error and errno... break; } } else if (bytesRead == 0) { ...AT EOF... } else if (bytesRead < requestdBytes) { ...if you care, loop on read until remaining bytes are fetched or at EOF... } return(bytesRead); > d) say "if you mount 'soft' and lose data, tough luck for you" I seem to recall from my days at Sun, you should NOT use soft mount for NFS writes at all. Soft mounts are for non-critical disk resources. (Solaris admin manual?) -- Doug Royer | http://INET-Consulting.com -------------------------------|----------------------------- We Do Standards - You Need Standards
Attachment
On Mon, Jan 02, 2006 at 08:55:47AM -0700, Doug Royer wrote: > > > Doug McNaught wrote: > > >c) treat EINTR as an I/O error (I don't know how easy this would be) > > So then at this point - it is detected, so problem solved? > > If a LOCAL hard drive fails to reply, you hang. Same with hard,intr > NFS file system. Not really. If a local hard drive fails to respond, the kernel times out the request and returns EIO to the app. That's the most annoying thing about NFS. At least even with reading bad floppies where the kernel keeps retrying, eventually the read() returns and you can cancel. With NFS, it never returns if the server never comes back. The kernel is trying to be helpful by returning EINTR to say "ok, it didn't complete. There's no error yet but it may yet work". With local hard drives if they don't respond, you assume they're broken. When NFS servers don't respond you assume someone has temporarily pulled a cable and it will come back soon. Huh? I would vote for the kernel, if the server didn't respond within 5 seconds, to simply return EIO. At least we know how to handle that... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Martijn van Oosterhout <kleptog@svana.org> writes: > I would vote for the kernel, if the server didn't respond within 5 > seconds, to simply return EIO. At least we know how to handle that... You can do this now by mounting 'soft' and setting the timeout appropriately. Whether it's really the best idea, well... -Doug
Martijn van Oosterhout <kleptog@svana.org> writes: > The kernel is trying to be helpful by returning EINTR to say "ok, it > didn't complete. There's no error yet but it may yet work". Well it only returns EINTR if a signal was received. > With local hard drives if they don't respond, you assume they're broken. > When NFS servers don't respond you assume someone has temporarily pulled a > cable and it will come back soon. Huh? Well firstly with local hard drives you never get EINTR. Interrupts won't be delivered until after the syscall returns. You don't get EINTR because in the original BSD implementation it was more efficient to implement it that way and since disk i/o was always extremely fast it didn't threaten to delay your signals. You're mixing up operations timing out with signals being received. The reason you don't want NFS filesystem operations timing out (and you really don't) is that it's *possible* it will come back later. If you're the sysadmin and you're told your NFS server is down so you fix it and it comes back up properly you should be able to expect that the world returns to normal. If you have the "soft" option enabled then you now have to run around restarting every other service in your data center because you don't know which ones might have received an error and crashed. Worse, if any of those programs failed to notice the error (and they're not wrong to, traditionally certain operations never signaled errors) then your data is now corrupt. Some updates have been made but not others, and later updates may be based on the incorrect data. Now on the other hand the "intr" option is entirely reasonable to enable as long as you know you don't have software that doesn't expect it. It only kicks in if an actual signal is received, such as the user hitting C-c. Even if the server comes back 20m later the user isn't going to be upset that his C-c got handled. The only problem is that some software doesn't expect to get EINTR handles it poorly. > I would vote for the kernel, if the server didn't respond within 5 > seconds, to simply return EIO. At least we know how to handle that... How do you handle it? By having Postgres shut down? And then the NFS server comes back and then what? -- greg
Greg Stark wrote: >>I would vote for the kernel, if the server didn't respond within 5 >>seconds, to simply return EIO. At least we know how to handle that... > > > How do you handle it? By having Postgres shut down? And then the NFS server > comes back and then what? Log the error if you can. Refuse new connections - until it is back up. Refuse or hang new queries - until it is back up. Retry? What should be done? -- Doug Royer | http://INET-Consulting.com -------------------------------|----------------------------- We Do Standards - You Need Standards