Thread: Patch to improve reliability of postgresql on linux nfs

Patch to improve reliability of postgresql on linux nfs

From
George Barnett
Date:
Hi Hackers,

I run a number of postgresql installations on NFS and on the whole I find this to be very reliable.  I have however run
intoa few issues when there is concurrent writes from multiple processes. 

I see errors such as the following:

2011-07-31 22:13:35 EST postgres postgres [local] LOG:  connection authorized: user=postgres database=postgres
2011-07-31 22:13:35 EST    ERROR:  could not write block 1 of relation global/2671: wrote only 4096 of 8192 bytes
2011-07-31 22:13:35 EST    HINT:  Check free disk space.
2011-07-31 22:13:35 EST    CONTEXT:  writing block 1 of relation global/2671
2011-07-31 22:13:35 EST [unknown] [unknown]  LOG:  connection received: host=[local]

I have also seen similar errors coming out of the WAL writer, however they occur at the level PANIC, which is a little
moredrastic. 

After spending some time with debug logging turned on and even more time staring at strace, I believe this occurs when
oneprocess was writing to a data file and it received a SIGINT from another process, eg: 
(These logs are from another similar run)

[pid  1804] <... fsync resumed> )       = 0
[pid 10198] kill(1804, SIGINT <unfinished ...>
[pid  1804] lseek(3, 4915200, SEEK_SET) = 4915200
[pid  1804] write(3, "c\320\1\0\1\0\0\0\0\0\0\0\0\0K\2\6\1\0\0\0\0\373B\0\0\0\0\2\0m\0"..., 32768 <unfinished ...>
[pid 10198] <... kill resumed> )        = 0
[pid  1804] <... write resumed> )       = 4096
[pid  1804] --- SIGINT (Interrupt) @ 0 (0) ---
[pid  1804] rt_sigreturn(0x2)           = 4096
[pid  1804] write(2, "\0\0\373\0\f\7\0\0t2011-08-30 20:29:52.999"..., 260) = 260
[pid  1804] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  <unfinished ...>
[pid  1802] <... select resumed> )      = 1 (in [5], left {0, 999000})
[pid  1804] <... rt_sigprocmask resumed> NULL, 8) = 0
[pid  1804] tgkill(1804, 1804, SIGABRT) = 0
[pid  1802] read(5,  <unfinished ...>
[pid  1804] --- SIGABRT (Aborted) @ 0 (0) ---
Process 1804 detached

After finding this, I came up with the following test case which easily replicated our issue:

#!/bin/bash

name=$1
number=1
while true; do
  /usr/bin/psql -c "CREATE USER \"$name$number\" WITH NOSUPERUSER INHERIT NOCREATEROLE NOCREATEDB LOGIN PASSWORD
'pass';"
  /usr/bin/createdb -E UNICODE -O $name$number $name$number
  if `grep -q PANIC /data/postgresql/data/pg_log/*`; then
    exit
  fi
  let number=$number+1
done

When I run a single copy of this script, I have no issues, however when I start up a few more copies to simultaneously
hitthe DB, it crashes quiet quickly - usually within 20 or 30 seconds. 

After looking through the code I found that when postgres calls write() it doesn't retry.  In order to address the
issuewith the PANIC in the WAL writer I set the sync method to o_sync which solved the issue in that part of the code,
howeverI was still seeing failures in other areas of the code (such as the FileWrite function).  Following this, I
spoketo an NFS guru who pointed out that writes under linux are not guaranteed to complete unless you open up O_SYNC or
similaron the file handle.  I had a look in the libc docs and found this: 

http://www.gnu.org/s/libc/manual/html_node/I_002fO-Primitives.html

----
The write function writes up to size bytes from buffer to the file with descriptor filedes. The data in buffer is not
necessarilya character string and a null character is output like any other character. 

The return value is the number of bytes actually written. This may be size, but can always be smaller. Your program
shouldalways call write in a loop, iterating until all the data is written. 
----

After finding this, I checked a number of other pieces of software that we see no issues with on NFS (such as the JVM)
fortheir usage of write().  I confirmed they write in a while loop and set about patching the postgres source. 

I have made this patch against 8.4.8 and confirmed that it fixes the issue we see on our systems.  I have also checked
thatmake check still passes.  

As my C is terrible, I would welcome any comments on the implementation of this patch.

Best regards,

George






Attachment

Re: Patch to improve reliability of postgresql on linux nfs

From
Josh Berkus
Date:
George,

I'm quoting you here because in the version of your email which got
posted to the list your whole explanation got put below the patch text,
making it hard to find the justification for the patch.  Follows:

> I run a number of postgresql installations on NFS and on the whole I find this to be very reliable.  I have however
runinto a few issues when there is concurrent writes from multiple processes.
 
> 
> I see errors such as the following:
> 
> 2011-07-31 22:13:35 EST postgres postgres [local] LOG:  connection authorized: user=postgres database=postgres
> 2011-07-31 22:13:35 EST    ERROR:  could not write block 1 of relation global/2671: wrote only 4096 of 8192 bytes
> 2011-07-31 22:13:35 EST    HINT:  Check free disk space.
> 2011-07-31 22:13:35 EST    CONTEXT:  writing block 1 of relation global/2671
> 2011-07-31 22:13:35 EST [unknown] [unknown]  LOG:  connection received: host=[local]
> 
> I have also seen similar errors coming out of the WAL writer, however they occur at the level PANIC, which is a
littlemore drastic.
 
> 
> After spending some time with debug logging turned on and even more time staring at strace, I believe this occurs
whenone process was writing to a data file and it received a SIGINT from another process, eg:
 
> (These logs are from another similar run)
> 
> [pid  1804] <... fsync resumed> )       = 0
> [pid 10198] kill(1804, SIGINT <unfinished ...>
> [pid  1804] lseek(3, 4915200, SEEK_SET) = 4915200
> [pid  1804] write(3, "c\320\1\0\1\0\0\0\0\0\0\0\0\0K\2\6\1\0\0\0\0\373B\0\0\0\0\2\0m\0"..., 32768 <unfinished ...>
> [pid 10198] <... kill resumed> )        = 0
> [pid  1804] <... write resumed> )       = 4096
> [pid  1804] --- SIGINT (Interrupt) @ 0 (0) ---
> [pid  1804] rt_sigreturn(0x2)           = 4096
> [pid  1804] write(2, "\0\0\373\0\f\7\0\0t2011-08-30 20:29:52.999"..., 260) = 260
> [pid  1804] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  <unfinished ...>
> [pid  1802] <... select resumed> )      = 1 (in [5], left {0, 999000})
> [pid  1804] <... rt_sigprocmask resumed> NULL, 8) = 0
> [pid  1804] tgkill(1804, 1804, SIGABRT) = 0
> [pid  1802] read(5,  <unfinished ...>
> [pid  1804] --- SIGABRT (Aborted) @ 0 (0) ---
> Process 1804 detached
> 
> After finding this, I came up with the following test case which easily replicated our issue:
> 
> #!/bin/bash
> 
> name=$1
> number=1
> while true; do 
>   /usr/bin/psql -c "CREATE USER \"$name$number\" WITH NOSUPERUSER INHERIT NOCREATEROLE NOCREATEDB LOGIN PASSWORD
'pass';"
>   /usr/bin/createdb -E UNICODE -O $name$number $name$number
>   if `grep -q PANIC /data/postgresql/data/pg_log/*`; then
>     exit
>   fi
>   let number=$number+1
> done
> 
> When I run a single copy of this script, I have no issues, however when I start up a few more copies to
simultaneouslyhit the DB, it crashes quiet quickly - usually within 20 or 30 seconds.
 
> 
> After looking through the code I found that when postgres calls write() it doesn't retry.  In order to address the
issuewith the PANIC in the WAL writer I set the sync method to o_sync which solved the issue in that part of the code,
howeverI was still seeing failures in other areas of the code (such as the FileWrite function).  Following this, I
spoketo an NFS guru who pointed out that writes under linux are not guaranteed to complete unless you open up O_SYNC or
similaron the file handle.  I had a look in the libc docs and found this:
 
> 
> http://www.gnu.org/s/libc/manual/html_node/I_002fO-Primitives.html
> 
> ----
> The write function writes up to size bytes from buffer to the file with descriptor filedes. The data in buffer is not
necessarilya character string and a null character is output like any other character.
 
> 
> The return value is the number of bytes actually written. This may be size, but can always be smaller. Your program
shouldalways call write in a loop, iterating until all the data is written.
 
> ----
> 
> After finding this, I checked a number of other pieces of software that we see no issues with on NFS (such as the
JVM)for their usage of write().  I confirmed they write in a while loop and set about patching the postgres source.
 
> 
> I have made this patch against 8.4.8 and confirmed that it fixes the issue we see on our systems.  I have also
checkedthat make check still passes. 
 
> 
> As my C is terrible, I would welcome any comments on the implementation of this patch.
> 
> Best regards,
> 
> George


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Patch to improve reliability of postgresql on linux nfs

From
Thom Brown
Date:
On 9 September 2011 01:04, George Barnett <gbarnett@atlassian.com> wrote:
> After looking through the code I found that when postgres calls write() it doesn't retry.  In order to address the
issuewith the PANIC in the WAL writer I set the sync method to o_sync which solved the issue in that part of the code,
howeverI was still seeing failures in other areas of the code (such as the FileWrite function).  Following this, I
spoketo an NFS guru who pointed out that writes under linux are not guaranteed to complete unless you open up O_SYNC or
similaron the file handle. 

Have you run the test with varying wal_sync_method values?  On Linux
the default is fdatasync because historically Linux hasn't supported
O_DSYNC (a wal_sync_method value of open_datasync).  But I believe as
of Kernel 2.6.33 it does support it.  Have you tried modifying this
parameter in your tests?  Are you even using Linux? (you haven't
specified)

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Patch to improve reliability of postgresql on linux nfs

From
Peter Eisentraut
Date:
On fre, 2011-09-09 at 10:04 +1000, George Barnett wrote:
> After looking through the code I found that when postgres calls
> write() it doesn't retry.  In order to address the issue with the
> PANIC in the WAL writer I set the sync method to o_sync which solved
> the issue in that part of the code, however I was still seeing
> failures in other areas of the code (such as the FileWrite function).
> Following this, I spoke to an NFS guru who pointed out that writes
> under linux are not guaranteed to complete unless you open up O_SYNC
> or similar on the file handle.

I've had this problem many years ago.  I recall that changing the mount
options for NFS also fixed it.  Could you post what mount options you
are using.

(We eventually moved away from NFS at that time, so I didn't pursue it
further, but my analysis back then matched yours.)




Re: Patch to improve reliability of postgresql on linux nfs

From
Tom Lane
Date:
George Barnett <gbarnett@atlassian.com> writes:
> [ patch to retry writes on NFS ]

I'm having a hard time getting excited about this idea, because IMO
NFS is insufficiently reliable to run a database on, and no patch like
this can fix that.  However, some concrete points:

1. If writes need to be retried, why not reads?  (No, that's not an
invitation to expand the scope of the patch; it's a question about NFS
implementation.)

2. What is the rationale for supposing that a retry a nanosecond later
will help?  If it will help, why didn't the kernel just do that?

3. What about EINTR?  If you believe that this is necessary, then the
kernel logically has to return EINTR when it would like you to retry but
it hasn't been able to write any bytes yet.  If you get a zero return
you have to assume that means out-of-disk-space.

4. As coded, the patch behaves incorrectly if you get a zero return on a
retry.  If we were going to do this, I think we'd need to absorb the
errno-munging currently done by callers into the writeAll function.

On the whole I think you'd be better off lobbying your NFS implementors
to provide something closer to the behavior of every other filesystem on
the planet.  Or checking to see if you need to adjust your NFS
configuration, as the other responders mentioned.
        regards, tom lane


Re: Patch to improve reliability of postgresql on linux nfs

From
Bernd Helmle
Date:

--On 9. September 2011 10:27:22 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote:

> On the whole I think you'd be better off lobbying your NFS implementors
> to provide something closer to the behavior of every other filesystem on
> the planet.  Or checking to see if you need to adjust your NFS
> configuration, as the other responders mentioned.

You really need at least mount options 'hard' _and_ 'nointr' on NFS mounts, 
otherwise you are out of luck. Oracle and DB2 guys recommend those settings and 
without them any millisecond of network glitch could disturb things 
unreasonably.

-- 
Thanks
Bernd


Re: Patch to improve reliability of postgresql on linux nfs

From
Noah Misch
Date:
On Fri, Sep 09, 2011 at 10:27:22AM -0400, Tom Lane wrote:
> George Barnett <gbarnett@atlassian.com> writes:
> > [ patch to retry writes on NFS ]
> 
> I'm having a hard time getting excited about this idea, because IMO
> NFS is insufficiently reliable to run a database on, and no patch like
> this can fix that.  However, some concrete points:
> 
> 1. If writes need to be retried, why not reads?  (No, that's not an
> invitation to expand the scope of the patch; it's a question about NFS
> implementation.)

To support all POSIX:2008-conforming read() implementations, we must indeed
retry reads.  I suppose the OP never encountered this in practice, though.

> 2. What is the rationale for supposing that a retry a nanosecond later
> will help?  If it will help, why didn't the kernel just do that?

POSIX:2008 unconditionally permits[1] partial writes of requests exceeding 512
bytes (_POSIX_PIPE_BUF).  We shouldn't complain when a kernel provides a
conforming write(), even if it appears that the kernel achieved little by
using some freedom afforded it.

> 3. What about EINTR?  If you believe that this is necessary, then the
> kernel logically has to return EINTR when it would like you to retry but
> it hasn't been able to write any bytes yet.  If you get a zero return
> you have to assume that means out-of-disk-space.

Assuming conforming SA_RESTART for write()/read(), this will not happen.  The
call will restart and resume blocking until it writes something.

nm

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html


Re: Patch to improve reliability of postgresql on linux nfs

From
"Kevin Grittner"
Date:
Noah Misch <noah@leadboat.com> wrote:
> We shouldn't complain when a kernel provides a conforming write(),
> even if it appears that the kernel achieved little by using some
> freedom afforded it.
I remember we had some compiler warnings in the logging area because
we were making the assumption that no implementation would ever use
that freedom.  I suggested we replace the simple, unchecked calls to
write with a function which did what we expected through an API
conforming loop:
http://archives.postgresql.org/pgsql-hackers/2011-02/msg01719.php
The response was that we could ignore the documented API because we
had "never seen nor heard of it being true for writes to disk
files".  I'm still uncomfortable with that.  Where I have seen
people code to implementation details rather than the documented
API, it has often not turned out well in the long run.
I'd still be willing to put together a patch for that if people buy
into it.
-Kevin


Re: Patch to improve reliability of postgresql on linux nfs

From
Bruce Momjian
Date:
Tom Lane wrote:
> George Barnett <gbarnett@atlassian.com> writes:
> > [ patch to retry writes on NFS ]
> 
> I'm having a hard time getting excited about this idea, because IMO
> NFS is insufficiently reliable to run a database on, and no patch like
> this can fix that.  However, some concrete points:
> 
> 1. If writes need to be retried, why not reads?  (No, that's not an
> invitation to expand the scope of the patch; it's a question about NFS
> implementation.)
> 
> 2. What is the rationale for supposing that a retry a nanosecond later
> will help?  If it will help, why didn't the kernel just do that?
> 
> 3. What about EINTR?  If you believe that this is necessary, then the
> kernel logically has to return EINTR when it would like you to retry but
> it hasn't been able to write any bytes yet.  If you get a zero return
> you have to assume that means out-of-disk-space.
> 
> 4. As coded, the patch behaves incorrectly if you get a zero return on a
> retry.  If we were going to do this, I think we'd need to absorb the
> errno-munging currently done by callers into the writeAll function.
> 
> On the whole I think you'd be better off lobbying your NFS implementors
> to provide something closer to the behavior of every other filesystem on
> the planet.  Or checking to see if you need to adjust your NFS
> configuration, as the other responders mentioned.

Can our NFS documentation be improved (section 17.2.1)?
http://www.postgresql.org/docs/9.1/static/creating-cluster.html

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: Patch to improve reliability of postgresql on linux nfs

From
George Barnett
Date:
On 10/09/2011, at 1:30 AM, Bernd Helmle wrote:

> --On 9. September 2011 10:27:22 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> On the whole I think you'd be better off lobbying your NFS implementors
>> to provide something closer to the behavior of every other filesystem on
>> the planet.  Or checking to see if you need to adjust your NFS
>> configuration, as the other responders mentioned.
>
> You really need at least mount options 'hard' _and_ 'nointr' on NFS mounts, otherwise you are out of luck. Oracle and
DB2guys recommend those settings and without them any millisecond of network glitch could disturb things unreasonably. 

Hi,

My mount options include hard and intr.

George

Re: Patch to improve reliability of postgresql on linux nfs

From
Florian Pflug
Date:
On Sep12, 2011, at 06:30 , George Barnett wrote:
> On 10/09/2011, at 1:30 AM, Bernd Helmle wrote:
>
>> --On 9. September 2011 10:27:22 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>>> On the whole I think you'd be better off lobbying your NFS implementors
>>> to provide something closer to the behavior of every other filesystem on
>>> the planet.  Or checking to see if you need to adjust your NFS
>>> configuration, as the other responders mentioned.
>>
>> You really need at least mount options 'hard' _and_ 'nointr' on NFS mounts, otherwise you are out of luck. Oracle
andDB2 guys recommend those settings and without them any millisecond of network glitch could disturb things
unreasonably.
>
> My mount options include hard and intr.

If you really meant to say "intr" there (and not "nointr") then that probably explains the partial writes.

Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial
read()or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be
negligible.If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent
usfrom ever seeing that. 

best regards,
Florian Pflug



Re: Patch to improve reliability of postgresql on linux nfs

From
George Barnett
Date:
On 12/09/2011, at 3:59 PM, Florian Pflug wrote:

> If you really meant to say "intr" there (and not "nointr") then that probably explains the partial writes.
>
> Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial
read()or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be
negligible.If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent
usfrom ever seeing that. 


Hi Florian,

You are indeed correct.  Setting nointr also resolves my issue.  I could swear I checked this, but obviously not.

It does still concern me that pgsql did not deal with this as gracefully as other software.  I hope the list will
considera patch to resolve that. 

Thanks in advance,

George

Re: Patch to improve reliability of postgresql on linux nfs

From
Peter Eisentraut
Date:
On mån, 2011-09-12 at 16:46 +1000, George Barnett wrote:
> On 12/09/2011, at 3:59 PM, Florian Pflug wrote:
> 
> > If you really meant to say "intr" there (and not "nointr") then that probably explains the partial writes.
> > 
> > Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial
read()or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be
negligible.If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent
usfrom ever seeing that.
 
> 
> 
> Hi Florian,
> 
> You are indeed correct.  Setting nointr also resolves my issue.  I could swear I checked this, but obviously not.
> 
> It does still concern me that pgsql did not deal with this as gracefully as other software.  I hope the list will
considera patch to resolve that.
 

We have signal handling configured so that system calls are not
interrupted.  So there is ordinarily no reason to do anything more
graceful.  The problem is that NFS is in this case not observing that
setting.  It's debatable whether it's worth supporting that; just saying
that the code is correct as it stands.




Re: Patch to improve reliability of postgresql on linux nfs

From
"ktm@rice.edu"
Date:
On Mon, Sep 12, 2011 at 04:46:53PM +1000, George Barnett wrote:
> On 12/09/2011, at 3:59 PM, Florian Pflug wrote:
> 
> > If you really meant to say "intr" there (and not "nointr") then that probably explains the partial writes.
> > 
> > Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial
read()or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be
negligible.If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent
usfrom ever seeing that.
 
> 
> 
> Hi Florian,
> 
> You are indeed correct.  Setting nointr also resolves my issue.  I could swear I checked this, but obviously not.
> 
> It does still concern me that pgsql did not deal with this as gracefully as other software.  I hope the list will
considera patch to resolve that.
 
> 
> Thanks in advance,
> 
> George

Hi George,

Many, many, many other software packages expect I/O usage to be the same on
an NFS volume and a local disk volume, including Oracle. Coding every application,
or more likely mis-coding, to handle this gives every application another chance
to get it wrong. If the OS does this, when it gets it right, all of the apps get
it right. I think you should be surprised when other software actually deals with
broken I/O semantics gracefully rather than concerned when one of a pantheon of
programs does not. My two cents.

Regards,
Ken


Re: Patch to improve reliability of postgresql on linux nfs

From
Florian Pflug
Date:
On Sep12, 2011, at 14:54 , Peter Eisentraut wrote:
> On mån, 2011-09-12 at 16:46 +1000, George Barnett wrote:
>> On 12/09/2011, at 3:59 PM, Florian Pflug wrote:
>>> Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial
read()or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be
negligible.If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent
usfrom ever seeing that. 
>>
>> It does still concern me that pgsql did not deal with this as gracefully as other software.  I hope the list will
considera patch to resolve that. 
>
> We have signal handling configured so that system calls are not
> interrupted.  So there is ordinarily no reason to do anything more
> graceful.  The problem is that NFS is in this case not observing that
> setting.  It's debatable whether it's worth supporting that; just saying
> that the code is correct as it stands.

SA_RESTART doesn't protect against partial reads/writes due to signal delivery,
it only removes the need to check for EINTR. In other words, it retries until
at least one byte has been written, not until all bytes have been written.

The GNU LibC documentation has this to say on the subject
 "There is one situation where resumption never happens no matter which  choice you make: when a data-transfer function
suchas read or write is  interrupted by a signal after transferring part of the data. In this case,  the function
returnsthe number of bytes already transferred, indicating  partial success."[1] 

While it's true that reads and writes are by tradition non-interruptible, I
personally wouldn't bet that they'll stay that way forever. It all depends on
whether the timeouts involved in the communication with a disk are short enough
to mask the difference - once they get too long (or even infinite like in the
case of "hard" NFS mounts) you pay for non-interruptible primitives with
un-killable stuck processes. Since the current trend is to move storage further
away from processing, and to put non-deterministic networks like ethernet between
the two, I expect situations in which read/write primitives are interruptible
to increase, not decrease.

And BTW, the GNU LibC documentations doesn't seem to mention anything about
local reads and writes being non-interruptible. In fact, it even says
 "A signal can arrive and be handled while an I/O primitive such as open or read  is waiting for an I/O device. If the
signalhandler returns, the system faces  the question: what should happen next?"[1] 

Had the GNU people faith in local read and writes being non-interruptible, they'd
probably have said "network device" or "remove device", not "I/O device".

best regards,
Florian Pflug

[1] http://www.gnu.org/s/hello/manual/libc/Interrupted-Primitives.html#Interrupted-Primitives

Re: Patch to improve reliability of postgresql on linux nfs

From
Florian Pflug
Date:
On Sep12, 2011, at 14:54 , ktm@rice.edu wrote:
> Many, many, many other software packages expect I/O usage to be the same on
> an NFS volume and a local disk volume, including Oracle. Coding every application,
> or more likely mis-coding, to handle this gives every application another chance
> to get it wrong. If the OS does this, when it gets it right, all of the apps get
> it right. I think you should be surprised when other software actually deals with
> broken I/O semantics gracefully rather than concerned when one of a pantheon of
> programs does not. My two cents.

I don't buy that. People seem to be perfectly able to code correct networking
applications (correct from a read/write API POV at least), yet those applications
need to deal with partial reads and writes too.

Really, it's not *that* hard to put a retry loop around "read" and "write".

Also, non-interruptible IO primitives are by no means "right". At best, they're
a compromise between complexity and functionality for I/O devices with rather
short (and bounded) communication timeouts - because in that case, processes are
only blocked un-interruptibly for a short while.

best regards,
Florian Pflug



Re: Patch to improve reliability of postgresql on linux nfs

From
Robert Haas
Date:
On Mon, Sep 12, 2011 at 9:39 AM, Florian Pflug <fgp@phlo.org> wrote:
> Really, it's not *that* hard to put a retry loop around "read" and "write".

+1.  I don't see what we're gaining by digging in our heels on this
one.  Retry loops around read() and write() are pretty routine, and I
wouldn't like to bet this is the only case where not having them could
cause an unnecessary failure.

Now, that having been said, I *really* think we could use some better
documentation on which mount options we believe to be safe, and not
just for NFS.  Right now we have practically nothing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Patch to improve reliability of postgresql on linux nfs

From
Florian Pflug
Date:
[CC'ing to the list again - I assume you omitted pgsql-hackers from the
recipient list by accident]

On Sep13, 2011, at 03:00 , George Barnett wrote:
> On 12/09/2011, at 11:39 PM, Florian Pflug wrote:
>> Also, non-interruptible IO primitives are by no means "right". At best, they're
>> a compromise between complexity and functionality for I/O devices with rather
>> short (and bounded) communication timeouts - because in that case, processes are
>> only blocked un-interruptibly for a short while.
>
> Just to expand on that - I'm now in the situation where I can run my nfs mounts
> 'nointr' and postgres will work, but that means if I lose a storage unit I have
> a number of stuck processes, effectively meaning I need to reboot all my frontend
> servers before I can fail over to backup nfs stores.
>
> However, if I run the mounts with intr, then if a storage unit fails, I can fail
> over to a backup node (taking a minor loss of data hit I'm willing to accept) but
> postgres breaks under a moderate insert load.
>
> With the patch I supplied though, I'm able to have most of my cake and eat it.
>
> I'd be very interested in moving this forward - is there something I can change
> in the patch to make it more acceptable for a merge?

Here are a few comments

Tom already remarked that if we do that for write()s, we ought to do it for read()s
also which I agree with. All other primitives like lseek, close, ... should be taken
care of by SA_RESTART, but I'd be a good idea to verify that.

Also, I don't think that POSIX mandates that errno be reset to 0 if a function returns
successfully, making that "returnCode == 0 && errno == 0" check pretty dubious. I'm not
sure of this was what Tom was getting at with his remark about the ENOSPC handling being
wrong in the retry case.

And I also think that if we do this, we might as well handle EINTR correctly, even if
our use of SA_RESTART should prevent us from ever seeing that. The rules surrounding
EINTR and SA_RESTART for read/write are quite subtle...

If we retry, shouldn't be do CHECK_FOR_INTERRUPTS? Otherwise, processes waiting for
a vanished NFS server would be killable only with SIGKILL, not SIGTERM or SIGINT.
But I'm not sure if it's safe to put that into a generic function like pg_write_nointr.

Finally, WriteAll() seems like a poor name for that function. How about pg_write_nointr()?

Here's my suggested implementation for pg_write_nointr. pg_read_nointr should be similar
(but obviously without the ENOSPC handling)

int pg_write_nointr(int fd, const void *bytes, Size amount)
{ int written = 0;
 while (amount > 0) {   int ret;
   ret = write(fd, bytes, amount);   if ((ret < 0) && (errno == EINTR))   {     /* interrupted by signal before first
bytewas written. Retry */     /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */     CHECK_FOR_INTERRUPTS();
continue;  }   else if (ret < 0)   {     /* error occurred. Abort */     return -1;   }   else if (ret == 0)   {     /*
outof disk space. Abort */     return written;   } 
   /* made progress */
   /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */   CHECK_FOR_INTERRUPTS();
   written += ret;   amount -= ret;   bytes = (const char *) bytes + ret; }
}

best regards,
Florian Pflug




Re: Patch to improve reliability of postgresql on linux nfs

From
Florian Pflug
Date:
On Sep13, 2011, at 13:07 , Florian Pflug wrote:
> Here's my suggested implementation for pg_write_nointr. pg_read_nointr should be similar
> (but obviously without the ENOSPC handling)
>
> <wrong pg_write_nointr implementation snipped>

Sorry for the self-reply. I realized only after hitting send that I
got the ENOSPC handling wrong again - we probably ought to check for
ENOSPC as well as ret == 0. Also, it seems preferable to return the
number of bytes actually written instead of -1 if we hit an error during
retry.

With this version, any return value other than <amount> signals an
error, the number of actually written bytes is reported even in the
case of an error (to the best of pg_write_nointr's knowledge), and
errno always indicates the kind of error.

int pg_write_nointr(int fd, const void *bytes, Size amount)
{int written = 0;
while (amount > 0){  int ret;
  ret = write(fd, bytes, amount);
  if ((ret < 0) && (errno == EINTR))  {    /* interrupted by signal before first byte was written. Retry */
    /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */    CHECK_FOR_INTERRUPTS();
    continue;  }  else if (ret < 1)  {    /* error occurred. Abort */
    if (ret == 0)      /* out of disk space */      errno = ENOSPC;
    if (written == 0)      return -1;    else      return written;  }
  /* made progress */  written += ret;  amount -= ret;  bytes = (const char *) bytes + ret;    /* XXX: Is it safe to
callCHECK_FOR_INTERRUPTS here? */  CHECK_FOR_INTERRUPTS();} 
}

best regards,
Florian Pflug



Re: Patch to improve reliability of postgresql on linux nfs

From
"ktm@rice.edu"
Date:
On Tue, Sep 13, 2011 at 01:30:34PM +0200, Florian Pflug wrote:
> On Sep13, 2011, at 13:07 , Florian Pflug wrote:
> > Here's my suggested implementation for pg_write_nointr. pg_read_nointr should be similar
> > (but obviously without the ENOSPC handling)
> > 
> > <wrong pg_write_nointr implementation snipped>
> 
> Sorry for the self-reply. I realized only after hitting send that I
> got the ENOSPC handling wrong again - we probably ought to check for
> ENOSPC as well as ret == 0. Also, it seems preferable to return the
> number of bytes actually written instead of -1 if we hit an error during
> retry.
> 
> With this version, any return value other than <amount> signals an
> error, the number of actually written bytes is reported even in the
> case of an error (to the best of pg_write_nointr's knowledge), and
> errno always indicates the kind of error.
> 
> int pg_write_nointr(int fd, const void *bytes, Size amount)
> {
>  int written = 0;
> 
>  while (amount > 0)
>  {
>    int ret;
> 
>    ret = write(fd, bytes, amount);
> 
>    if ((ret < 0) && (errno == EINTR))
>    {
>      /* interrupted by signal before first byte was written. Retry */
> 
>      /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
>      CHECK_FOR_INTERRUPTS();
> 
>      continue;
>    }
>    else if (ret < 1)
>    {
>      /* error occurred. Abort */
> 
>      if (ret == 0)
>        /* out of disk space */
>        errno = ENOSPC;
> 
>      if (written == 0)
>        return -1;
>      else
>        return written;
>    }
> 
>    /* made progress */
>    written += ret;
>    amount -= ret;
>    bytes = (const char *) bytes + ret;
>    
>    /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */
>    CHECK_FOR_INTERRUPTS();
>  }
> }
> 
> best regards,
> Florian Pflug
> 

It will be interesting to see if there are any performance ramifications to
this new write function.

Regards,
Ken


Re: Patch to improve reliability of postgresql on linux nfs

From
Florian Pflug
Date:
On Sep13, 2011, at 14:58 , ktm@rice.edu wrote:
> It will be interesting to see if there are any performance ramifications to
> this new write function.

What would those be? For non-interruptible reads and writes, the overhead
comes down to an additional function call (if we don't make pg_write_nointr
inlined) and a few conditional jumps (which branch prediction should be
able to take care of). These are bound to disappear in the noise compared
to the cost of the actual syscall.

best regards,
Florian Pflug



Re: Patch to improve reliability of postgresql on linux nfs

From
"ktm@rice.edu"
Date:
On Tue, Sep 13, 2011 at 03:02:57PM +0200, Florian Pflug wrote:
> On Sep13, 2011, at 14:58 , ktm@rice.edu wrote:
> > It will be interesting to see if there are any performance ramifications to
> > this new write function.
> 
> What would those be? For non-interruptible reads and writes, the overhead
> comes down to an additional function call (if we don't make pg_write_nointr
> inlined) and a few conditional jumps (which branch prediction should be
> able to take care of). These are bound to disappear in the noise compared
> to the cost of the actual syscall.
> 
> best regards,
> Florian Pflug
> 
That would be my expectation too. It is just always nice to benchmark changes,
just in case. I have had similar simple changes blow out a cache and have a
much greater impact on performance than might be expected from inspection. :)

Regards,
Ken


Re: Patch to improve reliability of postgresql on linux nfs

From
Aidan Van Dyk
Date:
On Tue, Sep 13, 2011 at 7:30 AM, Florian Pflug <fgp@phlo.org> wrote:

>
> Sorry for the self-reply. I realized only after hitting send that I
> got the ENOSPC handling wrong again - we probably ought to check for
> ENOSPC as well as ret == 0. Also, it seems preferable to return the
> number of bytes actually written instead of -1 if we hit an error during
> retry.
>
> With this version, any return value other than <amount> signals an
> error, the number of actually written bytes is reported even in the
> case of an error (to the best of pg_write_nointr's knowledge), and
> errno always indicates the kind of error.

Personally, I'ld think that's ripe for bugs.   If the contract is that
ret != amount is the "error" case, then don't return -1 for an error
*sometimes*.

If you sometimes return -1 for an error, even though ret != amount is
the *real* test, I'm going to guess there will be lots of chance for
code to do: if (pg_write_no_intr(...) < 0)  ...

which will only catch some of the errors, and happily continue with the rest...

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


Re: Patch to improve reliability of postgresql on linux nfs

From
Florian Pflug
Date:
On Sep13, 2011, at 15:05 , Aidan Van Dyk wrote:
> On Tue, Sep 13, 2011 at 7:30 AM, Florian Pflug <fgp@phlo.org> wrote:
>> Sorry for the self-reply. I realized only after hitting send that I
>> got the ENOSPC handling wrong again - we probably ought to check for
>> ENOSPC as well as ret == 0. Also, it seems preferable to return the
>> number of bytes actually written instead of -1 if we hit an error during
>> retry.
>>
>> With this version, any return value other than <amount> signals an
>> error, the number of actually written bytes is reported even in the
>> case of an error (to the best of pg_write_nointr's knowledge), and
>> errno always indicates the kind of error.
>
> Personally, I'ld think that's ripe for bugs.   If the contract is that
> ret != amount is the "error" case, then don't return -1 for an error
> *sometimes*.

Hm, but isn't that how write() works also? AFAIK (non-interruptible) write()
will return the number of bytes written, which may be less than the requested
number if there's not enough free space, or -1 in case of an error like
an invalid fd being passed.

> If you sometimes return -1 for an error, even though ret != amount is
> the *real* test, I'm going to guess there will be lots of chance for
> code to do:
>  if (pg_write_no_intr(...) < 0)
>   ...
>
> which will only catch some of the errors, and happily continue with the rest...

Yeah, but that's equally wrong for plain write(), so I'm not sure I share
your concern there. Also, I'm not sure how to improve that. We could always
return -1 in case of an error, and "amount" in case of success, but that makes
it impossible to determine how many bytes where actually written (and also feel
awkward). Or we could return 0 instead of -1 if there was an error and zero
bytes where written. But that feels awkward also...

One additional possibility would be to make the signature
 boolean pg_write_nointr(int fd, const void *bytes, int len, int *written)

and simply return true on success and false on error. Callers who're interested
in the number of bytes actually written (in the case of an error) would need to
pass some non-NULL pointer for <written>, while all others would just pass NULL.

Thoughts?

best regards,
Florian Pflug




Re: Patch to improve reliability of postgresql on linux nfs

From
Tom Lane
Date:
Florian Pflug <fgp@phlo.org> writes:
> On Sep13, 2011, at 15:05 , Aidan Van Dyk wrote:
>> Personally, I'ld think that's ripe for bugs.   If the contract is that
>> ret != amount is the "error" case, then don't return -1 for an error
>> *sometimes*.

> Hm, but isn't that how write() works also?

Yeah.  It's not possible to maintain the same error-reporting contract
that bare write() has got, unless you're willing to forget about actual
errors reported by a non-first write attempt.  Which might not be
totally unreasonable, because presumably something similar is going on
under the hood within write() itself.  Most of the errors one might
think are worth reporting would have had to occur on the first write
attempt anyway.

But if you do want to report such errors, I think you have to push the
error reporting logic into the subroutine, which seems a bit messy since
there's quite a variety of error message phrasings out there, all of
which require information that write() itself does not have.  Also, we
do *not* want e.g. gettext() to be invoked unless an error actually
occurs and has to be reported.
        regards, tom lane


Re: Patch to improve reliability of postgresql on linux nfs

From
Aidan Van Dyk
Date:
On Tue, Sep 13, 2011 at 10:14 AM, Florian Pflug <fgp@phlo.org> wrote:

>> Personally, I'ld think that's ripe for bugs.   If the contract is that
>> ret != amount is the "error" case, then don't return -1 for an error
>> *sometimes*.
>
> Hm, but isn't that how write() works also? AFAIK (non-interruptible) write()
> will return the number of bytes written, which may be less than the requested
> number if there's not enough free space, or -1 in case of an error like
> an invalid fd being passed.

Looking through the code, it appears as if all the write calls I've
seen are checking ret != amount, so it's probably not as big a deal
for PG as I fear...

But the subtle change in semantics (from system write ret != amount
not necessarily a real error, hence no errno set) of pg_write ret !=
amount only happening after a "real error" (errno should be set) is
one that could yet lead to confusion.

a.


--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


Re: Patch to improve reliability of postgresql on linux nfs

From
Florian Pflug
Date:
On Sep13, 2011, at 16:25 , Tom Lane wrote:
> Florian Pflug <fgp@phlo.org> writes:
>> On Sep13, 2011, at 15:05 , Aidan Van Dyk wrote:
>>> Personally, I'ld think that's ripe for bugs.   If the contract is that
>>> ret != amount is the "error" case, then don't return -1 for an error
>>> *sometimes*.
> 
>> Hm, but isn't that how write() works also?
> 
> Yeah.  It's not possible to maintain the same error-reporting contract
> that bare write() has got, unless you're willing to forget about actual
> errors reported by a non-first write attempt.

Hm, yeah, but we're only replacing the exclusive or in "either sets errno
*or* returns >= 0 and < amount" by a non-exclusive one. Which, in practice,
doesn't make much difference for callers. They can (and should) continue to
check whether they correct amount of bytes has been written, and they may
still use errno to distinguish different kinds of errors. They should just
do so upon any error condition, not upon us returning -1.

The important thing, I believe, is that we don't withhold any information
from callers, which we don't. If write() sets errno, it must return -1,
so we'll abort and hence leave the errno in place to be inspected by the
caller. And we faithfully track the actual number of bytes written.

Or am I missing something?

> But if you do want to report such errors, I think you have to push the
> error reporting logic into the subroutine, which seems a bit messy since
> there's quite a variety of error message phrasings out there, all of
> which require information that write() itself does not have.  Also, we
> do *not* want e.g. gettext() to be invoked unless an error actually
> occurs and has to be reported.

Yeah, I had the same idea (moving the error reporting into the subroutine)
when I first looked at the OP's patch, but then figured it'd just complicate
the API for no good reason.

best regards,
Florian Pflug




Re: Patch to improve reliability of postgresql on linux nfs

From
Bruce Momjian
Date:
On Tue, Sep 13, 2011 at 10:32:01AM -0400, Aidan Van Dyk wrote:
> On Tue, Sep 13, 2011 at 10:14 AM, Florian Pflug <fgp@phlo.org> wrote:
> 
> >> Personally, I'ld think that's ripe for bugs.   If the contract is that
> >> ret != amount is the "error" case, then don't return -1 for an error
> >> *sometimes*.
> >
> > Hm, but isn't that how write() works also? AFAIK (non-interruptible) write()
> > will return the number of bytes written, which may be less than the requested
> > number if there's not enough free space, or -1 in case of an error like
> > an invalid fd being passed.
> 
> Looking through the code, it appears as if all the write calls I've
> seen are checking ret != amount, so it's probably not as big a deal
> for PG as I fear...
> 
> But the subtle change in semantics (from system write ret != amount
> not necessarily a real error, hence no errno set) of pg_write ret !=
> amount only happening after a "real error" (errno should be set) is
> one that could yet lead to confusion.

I assume there is no TODO here.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +