Thread: EINTR error in SunOS

EINTR error in SunOS

From

Qingqing Zhou

Date:

30 December 2005, 01:08:47

I encountered an error today (can't repeat) on SunOS 5.8:
 --test that we read consecutive LFs properly CREATE TEMP TABLE testnl (a int, b text, c int);
+ ERROR:  could not open relation 1663/16384/37713: Interrupted system call

The reason I guess is the open() call is interrupted by a signal (what
signal BTW?). This error may be specific to SunOS/Solaris, but POSIX does
say that an EINTR is possible on open(), close(), read(), write() and also
the fopen() family:
http://www.opengroup.org/onlinepubs/007908799/xsh/open.html

We have patched read()/write(), shall we do so to open()/close() and also
fopen() family? Patching files other than fd.c seems unnecessary for two
reasons: (1) they are not frequently exercised; (2) they don't have the
basic errno-check code there.

Regards,
Qingqing

Re: EINTR error in SunOS

From

Tom Lane

Date:

30 December 2005, 01:34:03

Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> + ERROR:  could not open relation 1663/16384/37713: Interrupted system call

> The reason I guess is the open() call is interrupted by a signal (what
> signal BTW?).

I've heard of this in connection with NFS ... is your DB on an NFS
filesystem by any chance?
        regards, tom lane

Re: EINTR error in SunOS

From

"Qingqing Zhou"

Date:

30 December 2005, 01:37:10

"Tom Lane" <tgl@sss.pgh.pa.us> wrote
> Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
>> + ERROR:  could not open relation 1663/16384/37713: Interrupted system 
>> call
>
>> The reason I guess is the open() call is interrupted by a signal (what
>> signal BTW?).
>
> I've heard of this in connection with NFS ... is your DB on an NFS
> filesystem by any chance?
>

Exactly. I guess school machines love NFS.

Regards,
Qingqing

Re: EINTR error in SunOS

From

Qingqing Zhou

Date:

30 December 2005, 21:25:06


On Fri, 30 Dec 2005, Tom Lane wrote:
>
> I've heard of this in connection with NFS ... is your DB on an NFS
> filesystem by any chance?
>

I have patched IO routines in backend/storage that POSIX says EINTR is
possible except unlink(). Though POSIX says EINTR is not possible, during
many regressions, I found it sometimes sets this errno on NFS (I still
don't know where is the smoking-gun):
 TRUNCATE TABLE trunc_c,trunc_d,trunc_e;       -- ok
+ WARNING:  could not remove relation 1663/16384/37822: Interrupted system call

There are many other unlink() scattered in backend, some even without
error check. Shall we patch pg_unlink for this situation and replace them
like this:
pg_unlink(const char* path, int errlevel){
retry:    returnCode = unlink(path);    if (returnCode < 0 && errno==EINTR)        goto retry;
    if other_errors        elog(elevel, ...);
    return returnCode;}

Or
pg_unlink(const char* path){    /* no elog -- but we still have to do error check */}

Or
let it be ...

If we decide to do something for unlink(), then we'd better do something
for other EINTR-possible IO routines for fairness :-)

By the way, seems POSIX is not very consistent with EINTR. For example,
closedir() can set EINTR, but opendir()/readdir() can't. Any magic in it?

Regards,
Qingqing

Re: EINTR error in SunOS

From

Greg Stark

Date:

31 December 2005, 02:29:18

Qingqing Zhou <zhouqq@cs.toronto.edu> writes:

> On Fri, 30 Dec 2005, Tom Lane wrote:
> >
> > I've heard of this in connection with NFS ... is your DB on an NFS
> > filesystem by any chance?
> 
> I have patched IO routines in backend/storage that POSIX says EINTR is
> possible except unlink(). Though POSIX says EINTR is not possible, during
> many regressions, I found it sometimes sets this errno on NFS (I still
> don't know where is the smoking-gun):

Well there is a reason intr is not the default for NFS mounts. It's precisely
because it breaks the traditional unix filesystem interface. Syscalls that
historically are not interruptible become interruptible and not all programs
behave properly when that occurs.

In any case POSIX explicitly allows functions to return other errors aside
from those specified as long as it's for error conditions not listed.

[Chapter 2 Section 3, paragraph 6]
 Implementations may support additional errors not included in this list, may generate errors included in this list
undercircumstances other than those described here, or may contain extensions or limitations that prevent some errors
fromoccurring. The ERRORS section on each reference page specifies whether an error shall be returned, or whether it
maybe returned. Implementations shall not generate a different error number from the ones described here for error
conditionsdescribed in this volume of IEEE Std 1003.1-2001, but may generate additional errors unless explicitly
disallowedfor a particular function

Ironically EINTR *is* singled out to be specifically forbidden to be returned
from some system calls but only those in the Threads option which are mostly
pthread* functions. unlink isn't covered by that prohibition.

-- 
greg

Re: EINTR error in SunOS

From

Tom Lane

Date:

31 December 2005, 15:41:02

Greg Stark <gsstark@mit.edu> writes:
> Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
>> I have patched IO routines in backend/storage that POSIX says EINTR is
>> possible except unlink(). Though POSIX says EINTR is not possible, during
>> many regressions, I found it sometimes sets this errno on NFS (I still
>> don't know where is the smoking-gun):

> Well there is a reason intr is not the default for NFS mounts. It's precisely
> because it breaks the traditional unix filesystem interface.

Yeah.  We have looked at this before and decided that trying to defend
against it is too invasive and too fragile (how will you ever be sure
you've fixed everyplace, or keep other places from sneaking in later?)

What I'd rather do is document prominently that running a DB over NFS
isn't recommended, and running it over NFS with interrupts allowed is
just not going to work.
        regards, tom lane

Re: EINTR error in SunOS

From

Qingqing Zhou

Date:

31 December 2005, 15:48:35

On Sat, 31 Dec 2005, Tom Lane wrote:
>
> What I'd rather do is document prominently that running a DB over NFS
> isn't recommended, and running it over NFS with interrupts allowed is
> just not going to work.
>

Agreed. IO syscalls is not the only problem for NFS -- if we can't fix
them in a run, then don't do it.

Regards,
Qingqing

Re: EINTR error in SunOS

From

Rod Taylor

Date:

31 December 2005, 15:48:54

On Sat, 2005-12-31 at 14:40 -0500, Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
> > Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> >> I have patched IO routines in backend/storage that POSIX says EINTR is
> >> possible except unlink(). Though POSIX says EINTR is not possible, during
> >> many regressions, I found it sometimes sets this errno on NFS (I still
> >> don't know where is the smoking-gun):
> 
> > Well there is a reason intr is not the default for NFS mounts. It's precisely
> > because it breaks the traditional unix filesystem interface.

> What I'd rather do is document prominently that running a DB over NFS
> isn't recommended, and running it over NFS with interrupts allowed is
> just not going to work.

Are there issues with having an archive_command which does things with
NFS based filesystems?

--

Re: EINTR error in SunOS

From

Greg Stark

Date:

31 December 2005, 16:57:05

Qingqing Zhou <zhouqq@cs.toronto.edu> writes:

> On Sat, 31 Dec 2005, Tom Lane wrote:
> >
> > What I'd rather do is document prominently that running a DB over NFS
> > isn't recommended, and running it over NFS with interrupts allowed is
> > just not going to work.
> 
> Agreed. IO syscalls is not the only problem for NFS -- if we can't fix
> them in a run, then don't do it.

I don't think that's reasonable. The NFS intr option breaks the traditional
unix filesystem semantics which breaks a lot of older or naive programs. But
that's no reason to decide that Postgres can't handle the new semantics.

Handling EINTR after all file system calls doesn't sound like it would be
terribly hard. And Postgres of all systems has the infrastructure necessary to
handle error conditions, abort and roll back the transaction when a file
system error occurs. I think mainly this means it would be possible to hit C-c
or shut down postgres (uncleanly) when there's a network outage.

-- 
greg

Re: EINTR error in SunOS

From

Qingqing Zhou

Date:

31 December 2005, 17:46:02

On Sat, 31 Dec 2005, Greg Stark wrote:

>
> I don't think that's reasonable. The NFS intr option breaks the traditional
> unix filesystem semantics which breaks a lot of older or naive programs. But
> that's no reason to decide that Postgres can't handle the new semantics.
>

Is that by default the EINTR is truned off in NFS? If so, I don't see that
will be a problem. Sorry for my limited knowledge, is there any
requirements/benefits that people turn on EINTR?

> Handling EINTR after all file system calls doesn't sound like it would be
> terribly hard.

The problem is not restricted to file system. Actually my patched
version(only backend/storage) passed hundreds times of regression without
any problem, but EINTR can hurt other syscalls as well. Find out *all* the
EINTR situtations may need big efforts AFAICS.

Regards,
Qingqing

Re: EINTR error in SunOS

From

Martijn van Oosterhout

Date:

31 December 2005, 17:51:09

On Sat, Dec 31, 2005 at 04:46:02PM -0500, Qingqing Zhou wrote:
> Is that by default the EINTR is truned off in NFS? If so, I don't see that
> will be a problem. Sorry for my limited knowledge, is there any
> requirements/benefits that people turn on EINTR?

I wont speak for anyone else, but the reason I set intr on for NFS
mounts is so that if I turn off the file server I don't get unkillable
processes on the client. Messy sure, and maybe there's a better
solution made since but I really don't like processes stuck in D state
(ie kill -9 won't work). Better the program die in some wierd way than
that...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: EINTR error in SunOS

From

Greg Stark

Date:

31 December 2005, 18:02:05

Qingqing Zhou <zhouqq@cs.toronto.edu> writes:

> On Sat, 31 Dec 2005, Greg Stark wrote:
> 
> >
> > I don't think that's reasonable. The NFS intr option breaks the traditional
> > unix filesystem semantics which breaks a lot of older or naive programs. But
> > that's no reason to decide that Postgres can't handle the new semantics.
> >
> 
> Is that by default the EINTR is truned off in NFS? If so, I don't see that
> will be a problem. Sorry for my limited knowledge, is there any
> requirements/benefits that people turn on EINTR?

That's why the "intr" option (and the "soft") option has traditionally not
been enabled by default in NFS implementations. But many people don't like
that when their NFS server disappears their client applications become
unkillable. They like to be able to hit C-c and stop whatever is running.

In the case of Postgres having "intr" off on the NFS mount point would mean
you couldn't C-c a query stuck because the database is on NFS. Of course it's
not like you would be able to run any more queries after that, but you might
want your terminal back.

You wouldn't even be able to shut down Postgres, even with kill -9. If your
NFS server is unrecoverable and you want to bring up a Postgres instance using
a backup restored some other place you would have to bring it up on another
port or reboot your machine.

That's the kind of thing that leads lots of sysadmins to use the "intr" and
"soft" options. And those sysadmins generally aren't aware of these kinds of
consequences since it's more of a programming level issue.

> > Handling EINTR after all file system calls doesn't sound like it would be
> > terribly hard.
> 
> The problem is not restricted to file system. Actually my patched
> version(only backend/storage) passed hundreds times of regression without
> any problem, but EINTR can hurt other syscalls as well. Find out *all* the
> EINTR situtations may need big efforts AFAICS.

Well NFS is only going to affect filesystem calls. If there are other syscalls
that can signal EINTR on some obscure platform where Postgres isn't handling
it then that's just a run-of-the-mill porting issue.

But like I mentioned in the other thread POSIX is of no help here. With the
exception of the pthreads syscalls POSIX doesn't prohibit functions from
signalling errors other than the ones documented in the specification. So in
other words, just about any function can signal just about any error including
errors that are proprietary additions any time. Good luck :)

-- 
greg

Re: EINTR error in SunOS

From

Qingqing Zhou

Date:

31 December 2005, 18:21:15

On Sat, 31 Dec 2005, Greg Stark wrote:
>
> Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
>
> >
> > Is that by default the EINTR is truned off in NFS? If so, I don't see that
> > will be a problem. Sorry for my limited knowledge, is there any
> > requirements/benefits that people turn on EINTR?
>
> That's why the "intr" option (and the "soft") option has traditionally not
> been enabled by default in NFS implementations. But many people don't like
> that when their NFS server disappears their client applications become
> unkillable. They like to be able to hit C-c and stop whatever is running.
>

Thanks Greg and Martin, I now understand better of intr :-) So we can
killed Postgres or not depends on our signal handler. Query Cancel signal
won't work because "ImmediateInterruptOK" forbids it and the retry style
code in read/write will put the Postgres process into uninterruptable
sleep again. But die signal will work I think.

Regards,
Qingqing

Re: EINTR error in SunOS

From

Greg Stark

Date:

31 December 2005, 20:03:27

Rod Taylor <pg@rbt.ca> writes:

> Are there issues with having an archive_command which does things with
> NFS based filesystems?

Well, whatever command you use for archive_command -- probably just "cp" if
you're using NFS would hang if the NFS server went away. What would happen
then might be interesting. If Postgres finds the archive_command hanging
indefinitely will it correctly avoid recycling the WAL log indefinitely? I
assume so.

What's nonoptimal here is that I don't think there would be any warning that
anything was wrong until the WAL logs eventually filled up their filesystem
and then postgres stopped running. In the meantime your archived WAL logs
would be getting older and older and you would have no indication that
anything was failing.

This was the intention with the NFS error handling. The theory being that
eventually the server comes back up and things resume functioning exactly
where they left off with no lost operations. The upside is you don't have
things failing, then resuming later and unhandled errors in the meantime
leading to data corruption. The downside is there's no way for "cp" and
ultimately Postgres to know anything's wrong except to have a timeout itself
and an arbitrary maximum amount of time to expect operations to take.

-- 
greg

Re: EINTR error in SunOS

From

Doug Royer

Date:

31 December 2005, 20:46:59

EINTR on read() or write() is not unique to NFS.
It can happen on many file systems - it is just seen
less frequently on most of them.

The code should be able to handle ANY valid read()
and write() errno. And EINTR is documented on Linux, BSD,
Solaris (1 and 2), and POSIX.

Even the Linux man pages can return ENTER on read() and
write(). This can happen on soft-mirrors, SCSI disks, and SOME
other disk drivers when they have errors.

The 'intr' option to NFS is not the same as EINTR. It
it means 'if the server does not respond for a while,
then return an EINTR', just like any other disk read()
or write() does when it fails to reply.

I have seen lots of open source code that assumes that all
disk reads and writs work 100% or fail 100%. Many do not
check the return value to see if all data was written or
read from disk. And many do not look at errno at all.
I have NOT looked to see how postgres does it.

If storage/*.c is where the reads occur, it does
very LITTLE when checking for errors.


>>>Handling EINTR after all file system calls doesn't sound like it would be
>>>terribly hard.
>>
>>The problem is not restricted to file system. Actually my patched
>>version(only backend/storage) passed hundreds times of regression without
>>any problem, but EINTR can hurt other syscalls as well. Find out *all* the
>>EINTR situtations may need big efforts AFAICS.
>
>
> Well NFS is only going to affect filesystem calls. If there are other syscalls
> that can signal EINTR on some obscure platform where Postgres isn't handling
> it then that's just a run-of-the-mill porting issue.
>
> But like I mentioned in the other thread POSIX is of no help here. With the
> exception of the pthreads syscalls POSIX doesn't prohibit functions from
> signalling errors other than the ones documented in the specification. So in
> other words, just about any function can signal just about any error including
> errors that are proprietary additions any time. Good luck :)
>

--

Doug Royer                     | http://INET-Consulting.com
-------------------------------|-----------------------------

               We Do Standards - You Need Standards

Attachment

Re: EINTR error in SunOS

From

Doug McNaught

Date:

31 December 2005, 21:21:50

Doug Royer <Doug@Royer.com> writes:

> The 'intr' option to NFS is not the same as EINTR. It
> it means 'if the server does not respond for a while,
> then return an EINTR', just like any other disk read()
> or write() does when it fails to reply.

No, you're thinking of 'soft'.  'intr' (which is actually a modifier
to the 'hard' setting) causes the I/O to hang until the server comes
back or the process gets a signal (in which case EINTR is returned).

-Doug

Re: EINTR error in SunOS

From

"Qingqing Zhou"

Date:

01 January 2006, 02:53:45

"Greg Stark" <gsstark@mit.edu> wrote
>
> Well NFS is only going to affect filesystem calls. If there are other 
> syscalls
> that can signal EINTR on some obscure platform where Postgres isn't 
> handling
> it then that's just a run-of-the-mill porting issue.
>

Ok, NFS just affects filesystem calls(I mix it with another problem). If 
possible, I hope we can draw some conclusion / schetch a fix plan here for 
future developers who want to come up with a patch. The question is:
   Where and how should we fix exactly in order to incorporate intr NFS in 
server side?

More details we write down here, more feasible/infeasible plan we can get. I 
could think of these places:

+ direct file system calls   - open() family, fopen() family in backend/storage   - scattered open() etc in the whole
backend(seems unlink is with 
 
biggest problem)

The problem of above is if a signal sneaks in, these syscalls will fail. 
With a retry, we can fix it.

+ indirect file system calls   - system("xxx") calls, xxx = cp, etc.

If intr NFS is enabled, what's the problem exactly?


Any others?

Regards,
Qingqing

Re: EINTR error in SunOS

From

Greg Stark

Date:

01 January 2006, 03:36:29

"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:

> The problem of above is if a signal sneaks in, these syscalls will fail. 
> With a retry, we can fix it.

It's a bit stickier than that but only a bit. If you just retry then you're
saying users have to use kill -9 to get away from the situation. For some
filesystem operations that may be the best we can do. But for most it ought to
be possible to CHECK_FOR_INTERRUPTS() and handle the regular signals like C-c
or kill -1 normally. Even having the single backend exit (to avoid file
resource leaks) is nicer than having to restart the entire instance.

-- 
greg

Re: EINTR error in SunOS

From

Qingqing Zhou

Date:

01 January 2006, 03:52:50


On Sun, 1 Jan 2006, Greg Stark wrote:
>
> "Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
>
> > The problem of above is if a signal sneaks in, these syscalls will fail.
> > With a retry, we can fix it.
>
> It's a bit stickier than that but only a bit. If you just retry then you're
> saying users have to use kill -9 to get away from the situation. For some
> filesystem operations that may be the best we can do. But for most it ought to
> be possible to CHECK_FOR_INTERRUPTS() and handle the regular signals like C-c
> or kill -1 normally. Even having the single backend exit (to avoid file
> resource leaks) is nicer than having to restart the entire instance.
>

I understand put a CHECK_FOR_INTERRUPTS() in the retry-loop may make more
graceful stop, but it won't work in some cases -- notice that the io
routines we will patch can be used before the signal mechanism is setup.

Regards,
Qingqing

Re: EINTR error in SunOS

From

Tom Lane

Date:

01 January 2006, 13:48:59

Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> I understand put a CHECK_FOR_INTERRUPTS() in the retry-loop may make more
> graceful stop, but it won't work in some cases -- notice that the io
> routines we will patch can be used before the signal mechanism is setup.

I don't think it will help much at all: too many of the operations in
question are invoked in places where CHECK_FOR_INTERRUPTS is a no-op.
Examples:
* disk writes are mostly done by the bgwriter and not backends at all
* unlinks are generally done during xact commit/rollback

Qingqing's point about failures in system()-invoked commands (think
archive_command for PITR) is a mighty good one too.  That puts a
serious crimp into any illusion that we can really fix this in any
reliable way.
        regards, tom lane

Re: EINTR error in SunOS

From

Qingqing Zhou

Date:

01 January 2006, 14:59:55


On Sun, 1 Jan 2006, Tom Lane wrote:

> Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> > I understand put a CHECK_FOR_INTERRUPTS() in the retry-loop may make more
> > graceful stop, but it won't work in some cases -- notice that the io
> > routines we will patch can be used before the signal mechanism is setup.
>
> I don't think it will help much at all: too many of the operations in
> question are invoked in places where CHECK_FOR_INTERRUPTS is a no-op.
> Examples:
> * disk writes are mostly done by the bgwriter and not backends at all
> * unlinks are generally done during xact commit/rollback
>
Right.

> Qingqing's point about failures in system()-invoked commands (think
> archive_command for PITR) is a mighty good one too.  That puts a
> serious crimp into any illusion that we can really fix this in any
> reliable way.
>

Not my credit, I just collect Rod & Greg's posts about this here :-) And I
still not sure what exactly the problem we want to fix here -- think our
target is the "operation should not faild because of EINTR".

Regards,
Qingqing

Re: EINTR error in SunOS

From

Doug Royer

Date:

01 January 2006, 20:58:58

 From the Linux 'nfs' man page:

  intr           If  an  NFS file operation has a major timeout and it is
                 hard mounted, then allow signals to  interupt  the  file
                 operation  and  cause  it to return EINTR to the calling
                 program.  The default is to not allow file operations to
                 be interrupted.

Solaris 'mount_nfs' man page

  intr | nointr
                 Allow (do not allow) keyboard interrupts to kill
                 a  process  that  is  hung  while  waiting for a
                 response on  a  hard-mounted  file  system.  The
                 default  is  intr,  which  makes it possible for
                 clients to interrupt applications  that  may  be
                 waiting for a remote mount.

The Solaris and Linux defaults seem to be the opposite of each other.

So I think we are saying the same thing.

You can get EINTR with hard+intr mounts.

I am not sure what you get with soft mounts on a timeout.

Doug McNaught wrote:
> Doug Royer <Doug@Royer.com> writes:
>
>
>>The 'intr' option to NFS is not the same as EINTR. It
>>it means 'if the server does not respond for a while,
>>then return an EINTR', just like any other disk read()
>>or write() does when it fails to reply.
>
>
> No, you're thinking of 'soft'.  'intr' (which is actually a modifier
> to the 'hard' setting) causes the I/O to hang until the server comes
> back or the process gets a signal (in which case EINTR is returned).
>
> -Doug
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--

Doug Royer                     | http://INET-Consulting.com
-------------------------------|-----------------------------

               We Do Standards - You Need Standards

Attachment

Re: EINTR error in SunOS

From

Doug McNaught

Date:

01 January 2006, 21:10:25

Doug Royer <Doug@Royer.com> writes:

>  From the Linux 'nfs' man page:
>
>   intr           If  an  NFS file operation has a major timeout and it is
>                  hard mounted, then allow signals to  interupt  the  file
>                  operation  and  cause  it to return EINTR to the calling
>                  program.  The default is to not allow file operations to
>                  be interrupted.
>
> Solaris 'mount_nfs' man page
>
>   intr | nointr
>                  Allow (do not allow) keyboard interrupts to kill
>                  a  process  that  is  hung  while  waiting for a
>                  response on  a  hard-mounted  file  system.  The
>                  default  is  intr,  which  makes it possible for
>                  clients to interrupt applications  that  may  be
>                  waiting for a remote mount.
>
> The Solaris and Linux defaults seem to be the opposite of each other.

Actually they're the same, though differently worded.  "Major timeout"
means the server has not responded for N milliseconds, not that the
client has decided to time out the request.  If 'hard' is set, the
client will keep trying indefinitely, though you can interrupt it if
you've specified 'intr'.

> So I think we are saying the same thing.
>
> You can get EINTR with hard+intr mounts.

Yes, *only* if the user specifically decides to send a signal, or if
it uses SIGALRM or whatever.  I agree that if you expect 'intr' to be
used, your code needs to handle EINTR.

> I am not sure what you get with soft mounts on a timeout.

The Linux manpage implies you get EIO.

-Doug

Re: EINTR error in SunOS

From

Bruce Momjian

Date:

01 January 2006, 21:14:46

Let me give you a sky-high view of this.  Database reliability requires
that the disk drive be 100% reliable.  If any part of the disk storage
fails (I/O write failure, NFS failure) we have to assume that the disk
storage is corrupt and the database needs to be restored from backup. 
The NFS failure modes seem to suggest that any kind of NFS failure makes
our storage suspect, meaning we want NFS to be as non-failure mode as
possible.  Making PostgreSQL work on NFS system itself is risky, and
allowing it to work on systems that will soft-failure on writes seems
even worse.

---------------------------------------------------------------------------

Doug McNaught wrote:
> Doug Royer <Doug@Royer.com> writes:
> 
> >  From the Linux 'nfs' man page:
> >
> >   intr           If  an  NFS file operation has a major timeout and it is
> >                  hard mounted, then allow signals to  interupt  the  file
> >                  operation  and  cause  it to return EINTR to the calling
> >                  program.  The default is to not allow file operations to
> >                  be interrupted.
> >
> > Solaris 'mount_nfs' man page
> >
> >   intr | nointr
> >                  Allow (do not allow) keyboard interrupts to kill
> >                  a  process  that  is  hung  while  waiting for a
> >                  response on  a  hard-mounted  file  system.  The
> >                  default  is  intr,  which  makes it possible for
> >                  clients to interrupt applications  that  may  be
> >                  waiting for a remote mount.
> >
> > The Solaris and Linux defaults seem to be the opposite of each other.
> 
> Actually they're the same, though differently worded.  "Major timeout"
> means the server has not responded for N milliseconds, not that the
> client has decided to time out the request.  If 'hard' is set, the
> client will keep trying indefinitely, though you can interrupt it if
> you've specified 'intr'.
> 
> > So I think we are saying the same thing.
> >
> > You can get EINTR with hard+intr mounts.
> 
> Yes, *only* if the user specifically decides to send a signal, or if
> it uses SIGALRM or whatever.  I agree that if you expect 'intr' to be
> used, your code needs to handle EINTR.
> 
> > I am not sure what you get with soft mounts on a timeout.
> 
> The Linux manpage implies you get EIO.
> 
> -Doug
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: EINTR error in SunOS

From

Doug Royer

Date:

01 January 2006, 21:46:49

The MOUNT options are opposite.

Linux NFS mount   - defualts to no-intr
Solaris NFS mount - default to intr


Doug McNaught wrote:
> Doug Royer <Doug@Royer.com> writes:
>
>
>> From the Linux 'nfs' man page:
>>
>>  intr           If  an  NFS file operation has a major timeout and it is
>>                 hard mounted, then allow signals to  interupt  the  file
>>                 operation  and  cause  it to return EINTR to the calling
>>                 program.  The default is to not allow file operations to
>>                 be interrupted.
>>
>>Solaris 'mount_nfs' man page
>>
>>  intr | nointr
>>                 Allow (do not allow) keyboard interrupts to kill
>>                 a  process  that  is  hung  while  waiting for a
>>                 response on  a  hard-mounted  file  system.  The
>>                 default  is  intr,  which  makes it possible for
>>                 clients to interrupt applications  that  may  be
>>                 waiting for a remote mount.
>>
>>The Solaris and Linux defaults seem to be the opposite of each other.
>
>
> Actually they're the same, though differently worded.  "Major timeout"
> means the server has not responded for N milliseconds, not that the
> client has decided to time out the request.  If 'hard' is set, the
> client will keep trying indefinitely, though you can interrupt it if
> you've specified 'intr'.
>
>
>>So I think we are saying the same thing.
>>
>>You can get EINTR with hard+intr mounts.
>
>
> Yes, *only* if the user specifically decides to send a signal, or if
> it uses SIGALRM or whatever.  I agree that if you expect 'intr' to be
> used, your code needs to handle EINTR.
>
>
>>I am not sure what you get with soft mounts on a timeout.
>
>
> The Linux manpage implies you get EIO.
>
> -Doug
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--

Doug Royer                     | http://INET-Consulting.com
-------------------------------|-----------------------------

               We Do Standards - You Need Standards

Attachment

Re: EINTR error in SunOS

From

Doug Royer

Date:

01 January 2006, 21:59:09

Yes - if you assume that EINTR only happens on NFS mounts.
My point is that independent of NFS, the error checking
that I have found in the code is not complete even for
non-NFS file systems.


The read() and write() LINUX man pages do NOT specify that EINTR
is an NFS-only error.

      EINTR  The call was interrupted by a signal before any data was
             read.

The read() and write() SOLARIS man pages say:

      EINTR A signal was caught during the read operation  and  no
            data was transferred.

There are other SVR read() and write() errors:

     EOVERFLOW (read)
            The file is a regular file, nbyte is greater  than  0,
            the  starting  position is before the end-of-file, and
            the starting position is greater than or equal to  the
            offset  maximum  established in the open file descrip-
            tion associated with fildes.

     EDEADLK
            The write was going  to  go  to  sleep   and  cause  a
            deadlock situation to occur.

      EDQUOT
            The user's quota of disk blocks  on  the  file  system
            containing the file has been exhausted.

      EFBIG  (write)
            An attempt is made to write a file  that  exceeds  the
            process's  file  size  limit  or the maximum file size
            (see getrlimit(2) and ulimit(2)).

      EFBIG The file is a regular file, nbyte is greater  than  0,
            and  the starting position is greater than or equal to
            the offset maximum established in the file description
            associated with fildes.

      ENOSPC
            During a write to an ordinary file, there is no   free
            space left on the device.




Bruce Momjian wrote:
> Let me give you a sky-high view of this.  Database reliability requires
> that the disk drive be 100% reliable.  If any part of the disk storage
> fails (I/O write failure, NFS failure) we have to assume that the disk
> storage is corrupt and the database needs to be restored from backup.
> The NFS failure modes seem to suggest that any kind of NFS failure makes
> our storage suspect, meaning we want NFS to be as non-failure mode as
> possible.  Making PostgreSQL work on NFS system itself is risky, and
> allowing it to work on systems that will soft-failure on writes seems
> even worse.
>
--

Doug Royer                     | http://INET-Consulting.com
-------------------------------|-----------------------------

               We Do Standards - You Need Standards

Attachment

Re: EINTR error in SunOS

From

Doug McNaught

Date:

01 January 2006, 22:02:24

Doug Royer <Doug@Royer.com> writes:

> The MOUNT options are opposite.
>
> Linux NFS mount   - defualts to no-intr
> Solaris NFS mount - default to intr

Oh, right--I didn't realize that was what you were talking about.

-Doug

Re: EINTR error in SunOS

From

Doug McNaught

Date:

01 January 2006, 22:09:46

Doug Royer <Doug@Royer.com> writes:

> Yes - if you assume that EINTR only happens on NFS mounts.
> My point is that independent of NFS, the error checking
> that I have found in the code is not complete even for
> non-NFS file systems.
>
>
> The read() and write() LINUX man pages do NOT specify that EINTR
> is an NFS-only error.
>
>       EINTR  The call was interrupted by a signal before any data was
>              read.

Right, but I think that's because read() and write() also work on
sockets and serial ports, which are always interruptible.  I have not
heard of local-disk filesystem code on any Unix I've seen ever giving
EINTR--a process waiting for disk is always in D state, which means
it's not interruptible by signals.  If I have the time maybe I'll
grovel through the Linux sources and verify this, but I'm pretty sure
of it. 

I'm not a PG internals expert by any means, but my $0.02 on this is
that we should:

a) recommend NOT using NFS for the database storage
b) if NFS must be used, recommend 'hard,nointr' mounts
c) treat EINTR as an I/O error (I don't know how easy this would be)
d) say "if you mount 'soft' and lose data, tough luck for you"

-Doug

Re: EINTR error in SunOS

From

Doug Royer

Date:

02 January 2006, 11:55:47


Doug McNaught wrote:

> c) treat EINTR as an I/O error (I don't know how easy this would be)

So then at this point - it is detected, so problem solved?

If a LOCAL hard drive fails to reply, you hang. Same with hard,intr
NFS file system.


    bytesRead = read(fd, buffer, requestedBytes);

    if (bytesRead < 0) {
        switch (errno) {

        case EAGAIN:
#ifdef USING_RECORD_LOCKING_OR_NON_BLOCKING_IO
            ...do the above read() again...
#else
        /*FALLTHRU*/
#endif
        default:
            ... log error and errno...
            break;
        }

    } else if (bytesRead == 0) {
        ...AT EOF...

    } else if (bytesRead < requestdBytes) {
        ...if you care, loop on read until
        remaining bytes are fetched
        or at EOF...
    }

    return(bytesRead);



> d) say "if you mount 'soft' and lose data, tough luck for you"

I seem to recall from my days at Sun, you should NOT use soft
mount for NFS writes at all. Soft mounts are for non-critical
disk resources. (Solaris admin  manual?)

--

Doug Royer                     | http://INET-Consulting.com
-------------------------------|-----------------------------

               We Do Standards - You Need Standards

Attachment

Re: EINTR error in SunOS

From

Martijn van Oosterhout

Date:

02 January 2006, 14:58:36

On Mon, Jan 02, 2006 at 08:55:47AM -0700, Doug Royer wrote:
>
>
> Doug McNaught wrote:
>
> >c) treat EINTR as an I/O error (I don't know how easy this would be)
>
> So then at this point - it is detected, so problem solved?
>
> If a LOCAL hard drive fails to reply, you hang. Same with hard,intr
> NFS file system.

Not really. If a local hard drive fails to respond, the kernel times
out the request and returns EIO to the app. That's the most annoying
thing about NFS. At least even with reading bad floppies where the
kernel keeps retrying, eventually the read() returns and you can
cancel. With NFS, it never returns if the server never comes back.

The kernel is trying to be helpful by returning EINTR to say "ok, it
didn't complete. There's no error yet but it may yet work". With local
hard drives if they don't respond, you assume they're broken. When NFS
servers don't respond you assume someone has temporarily pulled a
cable and it will come back soon. Huh?

I would vote for the kernel, if the server didn't respond within 5
seconds, to simply return EIO. At least we know how to handle that...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: EINTR error in SunOS

From

Doug McNaught

Date:

02 January 2006, 19:37:47

Martijn van Oosterhout <kleptog@svana.org> writes:

> I would vote for the kernel, if the server didn't respond within 5
> seconds, to simply return EIO. At least we know how to handle that...

You can do this now by mounting 'soft' and setting the timeout
appropriately.  Whether it's really the best idea, well...

-Doug

Re: EINTR error in SunOS

From

Greg Stark

Date:

02 January 2006, 19:52:14

Martijn van Oosterhout <kleptog@svana.org> writes:

> The kernel is trying to be helpful by returning EINTR to say "ok, it
> didn't complete. There's no error yet but it may yet work". 

Well it only returns EINTR if a signal was received. 

> With local hard drives if they don't respond, you assume they're broken.
> When NFS servers don't respond you assume someone has temporarily pulled a
> cable and it will come back soon. Huh?

Well firstly with local hard drives you never get EINTR. Interrupts won't be
delivered until after the syscall returns. You don't get EINTR because in the
original BSD implementation it was more efficient to implement it that way and
since disk i/o was always extremely fast it didn't threaten to delay your
signals.

You're mixing up operations timing out with signals being received. The reason
you don't want NFS filesystem operations timing out (and you really don't) is
that it's *possible* it will come back later.

If you're the sysadmin and you're told your NFS server is down so you fix it
and it comes back up properly you should be able to expect that the world
returns to normal.

If you have the "soft" option enabled then you now have to run around
restarting every other service in your data center because you don't know
which ones might have received an error and crashed.

Worse, if any of those programs failed to notice the error (and they're not
wrong to, traditionally certain operations never signaled errors) then your
data is now corrupt. Some updates have been made but not others, and later
updates may be based on the incorrect data.

Now on the other hand the "intr" option is entirely reasonable to enable as
long as you know you don't have software that doesn't expect it. It only kicks
in if an actual signal is received, such as the user hitting C-c. Even if the
server comes back 20m later the user isn't going to be upset that his C-c got
handled. The only problem is that some software doesn't expect to get EINTR
handles it poorly.

> I would vote for the kernel, if the server didn't respond within 5
> seconds, to simply return EIO. At least we know how to handle that...

How do you handle it? By having Postgres shut down? And then the NFS server
comes back and then what?

-- 
greg

Re: EINTR error in SunOS

From

Doug Royer

Date:

03 January 2006, 01:11:11


Greg Stark wrote:

>>I would vote for the kernel, if the server didn't respond within 5
>>seconds, to simply return EIO. At least we know how to handle that...
>
>
> How do you handle it? By having Postgres shut down? And then the NFS server
> comes back and then what?

Log the error if you can.
Refuse new connections - until it is back up.
Refuse or hang new queries - until it is back up.

Retry?

What should be done?

--

Doug Royer                     | http://INET-Consulting.com
-------------------------------|-----------------------------

               We Do Standards - You Need Standards