Thread: Patch to improve reliability of postgresql on linux nfs
Hi Hackers, I run a number of postgresql installations on NFS and on the whole I find this to be very reliable. I have however run intoa few issues when there is concurrent writes from multiple processes. I see errors such as the following: 2011-07-31 22:13:35 EST postgres postgres [local] LOG: connection authorized: user=postgres database=postgres 2011-07-31 22:13:35 EST ERROR: could not write block 1 of relation global/2671: wrote only 4096 of 8192 bytes 2011-07-31 22:13:35 EST HINT: Check free disk space. 2011-07-31 22:13:35 EST CONTEXT: writing block 1 of relation global/2671 2011-07-31 22:13:35 EST [unknown] [unknown] LOG: connection received: host=[local] I have also seen similar errors coming out of the WAL writer, however they occur at the level PANIC, which is a little moredrastic. After spending some time with debug logging turned on and even more time staring at strace, I believe this occurs when oneprocess was writing to a data file and it received a SIGINT from another process, eg: (These logs are from another similar run) [pid 1804] <... fsync resumed> ) = 0 [pid 10198] kill(1804, SIGINT <unfinished ...> [pid 1804] lseek(3, 4915200, SEEK_SET) = 4915200 [pid 1804] write(3, "c\320\1\0\1\0\0\0\0\0\0\0\0\0K\2\6\1\0\0\0\0\373B\0\0\0\0\2\0m\0"..., 32768 <unfinished ...> [pid 10198] <... kill resumed> ) = 0 [pid 1804] <... write resumed> ) = 4096 [pid 1804] --- SIGINT (Interrupt) @ 0 (0) --- [pid 1804] rt_sigreturn(0x2) = 4096 [pid 1804] write(2, "\0\0\373\0\f\7\0\0t2011-08-30 20:29:52.999"..., 260) = 260 [pid 1804] rt_sigprocmask(SIG_UNBLOCK, [ABRT], <unfinished ...> [pid 1802] <... select resumed> ) = 1 (in [5], left {0, 999000}) [pid 1804] <... rt_sigprocmask resumed> NULL, 8) = 0 [pid 1804] tgkill(1804, 1804, SIGABRT) = 0 [pid 1802] read(5, <unfinished ...> [pid 1804] --- SIGABRT (Aborted) @ 0 (0) --- Process 1804 detached After finding this, I came up with the following test case which easily replicated our issue: #!/bin/bash name=$1 number=1 while true; do /usr/bin/psql -c "CREATE USER \"$name$number\" WITH NOSUPERUSER INHERIT NOCREATEROLE NOCREATEDB LOGIN PASSWORD 'pass';" /usr/bin/createdb -E UNICODE -O $name$number $name$number if `grep -q PANIC /data/postgresql/data/pg_log/*`; then exit fi let number=$number+1 done When I run a single copy of this script, I have no issues, however when I start up a few more copies to simultaneously hitthe DB, it crashes quiet quickly - usually within 20 or 30 seconds. After looking through the code I found that when postgres calls write() it doesn't retry. In order to address the issuewith the PANIC in the WAL writer I set the sync method to o_sync which solved the issue in that part of the code, howeverI was still seeing failures in other areas of the code (such as the FileWrite function). Following this, I spoketo an NFS guru who pointed out that writes under linux are not guaranteed to complete unless you open up O_SYNC or similaron the file handle. I had a look in the libc docs and found this: http://www.gnu.org/s/libc/manual/html_node/I_002fO-Primitives.html ---- The write function writes up to size bytes from buffer to the file with descriptor filedes. The data in buffer is not necessarilya character string and a null character is output like any other character. The return value is the number of bytes actually written. This may be size, but can always be smaller. Your program shouldalways call write in a loop, iterating until all the data is written. ---- After finding this, I checked a number of other pieces of software that we see no issues with on NFS (such as the JVM) fortheir usage of write(). I confirmed they write in a while loop and set about patching the postgres source. I have made this patch against 8.4.8 and confirmed that it fixes the issue we see on our systems. I have also checked thatmake check still passes. As my C is terrible, I would welcome any comments on the implementation of this patch. Best regards, George
Attachment
George, I'm quoting you here because in the version of your email which got posted to the list your whole explanation got put below the patch text, making it hard to find the justification for the patch. Follows: > I run a number of postgresql installations on NFS and on the whole I find this to be very reliable. I have however runinto a few issues when there is concurrent writes from multiple processes. > > I see errors such as the following: > > 2011-07-31 22:13:35 EST postgres postgres [local] LOG: connection authorized: user=postgres database=postgres > 2011-07-31 22:13:35 EST ERROR: could not write block 1 of relation global/2671: wrote only 4096 of 8192 bytes > 2011-07-31 22:13:35 EST HINT: Check free disk space. > 2011-07-31 22:13:35 EST CONTEXT: writing block 1 of relation global/2671 > 2011-07-31 22:13:35 EST [unknown] [unknown] LOG: connection received: host=[local] > > I have also seen similar errors coming out of the WAL writer, however they occur at the level PANIC, which is a littlemore drastic. > > After spending some time with debug logging turned on and even more time staring at strace, I believe this occurs whenone process was writing to a data file and it received a SIGINT from another process, eg: > (These logs are from another similar run) > > [pid 1804] <... fsync resumed> ) = 0 > [pid 10198] kill(1804, SIGINT <unfinished ...> > [pid 1804] lseek(3, 4915200, SEEK_SET) = 4915200 > [pid 1804] write(3, "c\320\1\0\1\0\0\0\0\0\0\0\0\0K\2\6\1\0\0\0\0\373B\0\0\0\0\2\0m\0"..., 32768 <unfinished ...> > [pid 10198] <... kill resumed> ) = 0 > [pid 1804] <... write resumed> ) = 4096 > [pid 1804] --- SIGINT (Interrupt) @ 0 (0) --- > [pid 1804] rt_sigreturn(0x2) = 4096 > [pid 1804] write(2, "\0\0\373\0\f\7\0\0t2011-08-30 20:29:52.999"..., 260) = 260 > [pid 1804] rt_sigprocmask(SIG_UNBLOCK, [ABRT], <unfinished ...> > [pid 1802] <... select resumed> ) = 1 (in [5], left {0, 999000}) > [pid 1804] <... rt_sigprocmask resumed> NULL, 8) = 0 > [pid 1804] tgkill(1804, 1804, SIGABRT) = 0 > [pid 1802] read(5, <unfinished ...> > [pid 1804] --- SIGABRT (Aborted) @ 0 (0) --- > Process 1804 detached > > After finding this, I came up with the following test case which easily replicated our issue: > > #!/bin/bash > > name=$1 > number=1 > while true; do > /usr/bin/psql -c "CREATE USER \"$name$number\" WITH NOSUPERUSER INHERIT NOCREATEROLE NOCREATEDB LOGIN PASSWORD 'pass';" > /usr/bin/createdb -E UNICODE -O $name$number $name$number > if `grep -q PANIC /data/postgresql/data/pg_log/*`; then > exit > fi > let number=$number+1 > done > > When I run a single copy of this script, I have no issues, however when I start up a few more copies to simultaneouslyhit the DB, it crashes quiet quickly - usually within 20 or 30 seconds. > > After looking through the code I found that when postgres calls write() it doesn't retry. In order to address the issuewith the PANIC in the WAL writer I set the sync method to o_sync which solved the issue in that part of the code, howeverI was still seeing failures in other areas of the code (such as the FileWrite function). Following this, I spoketo an NFS guru who pointed out that writes under linux are not guaranteed to complete unless you open up O_SYNC or similaron the file handle. I had a look in the libc docs and found this: > > http://www.gnu.org/s/libc/manual/html_node/I_002fO-Primitives.html > > ---- > The write function writes up to size bytes from buffer to the file with descriptor filedes. The data in buffer is not necessarilya character string and a null character is output like any other character. > > The return value is the number of bytes actually written. This may be size, but can always be smaller. Your program shouldalways call write in a loop, iterating until all the data is written. > ---- > > After finding this, I checked a number of other pieces of software that we see no issues with on NFS (such as the JVM)for their usage of write(). I confirmed they write in a while loop and set about patching the postgres source. > > I have made this patch against 8.4.8 and confirmed that it fixes the issue we see on our systems. I have also checkedthat make check still passes. > > As my C is terrible, I would welcome any comments on the implementation of this patch. > > Best regards, > > George -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 9 September 2011 01:04, George Barnett <gbarnett@atlassian.com> wrote: > After looking through the code I found that when postgres calls write() it doesn't retry. In order to address the issuewith the PANIC in the WAL writer I set the sync method to o_sync which solved the issue in that part of the code, howeverI was still seeing failures in other areas of the code (such as the FileWrite function). Following this, I spoketo an NFS guru who pointed out that writes under linux are not guaranteed to complete unless you open up O_SYNC or similaron the file handle. Have you run the test with varying wal_sync_method values? On Linux the default is fdatasync because historically Linux hasn't supported O_DSYNC (a wal_sync_method value of open_datasync). But I believe as of Kernel 2.6.33 it does support it. Have you tried modifying this parameter in your tests? Are you even using Linux? (you haven't specified) -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On fre, 2011-09-09 at 10:04 +1000, George Barnett wrote: > After looking through the code I found that when postgres calls > write() it doesn't retry. In order to address the issue with the > PANIC in the WAL writer I set the sync method to o_sync which solved > the issue in that part of the code, however I was still seeing > failures in other areas of the code (such as the FileWrite function). > Following this, I spoke to an NFS guru who pointed out that writes > under linux are not guaranteed to complete unless you open up O_SYNC > or similar on the file handle. I've had this problem many years ago. I recall that changing the mount options for NFS also fixed it. Could you post what mount options you are using. (We eventually moved away from NFS at that time, so I didn't pursue it further, but my analysis back then matched yours.)
George Barnett <gbarnett@atlassian.com> writes: > [ patch to retry writes on NFS ] I'm having a hard time getting excited about this idea, because IMO NFS is insufficiently reliable to run a database on, and no patch like this can fix that. However, some concrete points: 1. If writes need to be retried, why not reads? (No, that's not an invitation to expand the scope of the patch; it's a question about NFS implementation.) 2. What is the rationale for supposing that a retry a nanosecond later will help? If it will help, why didn't the kernel just do that? 3. What about EINTR? If you believe that this is necessary, then the kernel logically has to return EINTR when it would like you to retry but it hasn't been able to write any bytes yet. If you get a zero return you have to assume that means out-of-disk-space. 4. As coded, the patch behaves incorrectly if you get a zero return on a retry. If we were going to do this, I think we'd need to absorb the errno-munging currently done by callers into the writeAll function. On the whole I think you'd be better off lobbying your NFS implementors to provide something closer to the behavior of every other filesystem on the planet. Or checking to see if you need to adjust your NFS configuration, as the other responders mentioned. regards, tom lane
--On 9. September 2011 10:27:22 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > On the whole I think you'd be better off lobbying your NFS implementors > to provide something closer to the behavior of every other filesystem on > the planet. Or checking to see if you need to adjust your NFS > configuration, as the other responders mentioned. You really need at least mount options 'hard' _and_ 'nointr' on NFS mounts, otherwise you are out of luck. Oracle and DB2 guys recommend those settings and without them any millisecond of network glitch could disturb things unreasonably. -- Thanks Bernd
On Fri, Sep 09, 2011 at 10:27:22AM -0400, Tom Lane wrote: > George Barnett <gbarnett@atlassian.com> writes: > > [ patch to retry writes on NFS ] > > I'm having a hard time getting excited about this idea, because IMO > NFS is insufficiently reliable to run a database on, and no patch like > this can fix that. However, some concrete points: > > 1. If writes need to be retried, why not reads? (No, that's not an > invitation to expand the scope of the patch; it's a question about NFS > implementation.) To support all POSIX:2008-conforming read() implementations, we must indeed retry reads. I suppose the OP never encountered this in practice, though. > 2. What is the rationale for supposing that a retry a nanosecond later > will help? If it will help, why didn't the kernel just do that? POSIX:2008 unconditionally permits[1] partial writes of requests exceeding 512 bytes (_POSIX_PIPE_BUF). We shouldn't complain when a kernel provides a conforming write(), even if it appears that the kernel achieved little by using some freedom afforded it. > 3. What about EINTR? If you believe that this is necessary, then the > kernel logically has to return EINTR when it would like you to retry but > it hasn't been able to write any bytes yet. If you get a zero return > you have to assume that means out-of-disk-space. Assuming conforming SA_RESTART for write()/read(), this will not happen. The call will restart and resume blocking until it writes something. nm [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
Noah Misch <noah@leadboat.com> wrote: > We shouldn't complain when a kernel provides a conforming write(), > even if it appears that the kernel achieved little by using some > freedom afforded it. I remember we had some compiler warnings in the logging area because we were making the assumption that no implementation would ever use that freedom. I suggested we replace the simple, unchecked calls to write with a function which did what we expected through an API conforming loop: http://archives.postgresql.org/pgsql-hackers/2011-02/msg01719.php The response was that we could ignore the documented API because we had "never seen nor heard of it being true for writes to disk files". I'm still uncomfortable with that. Where I have seen people code to implementation details rather than the documented API, it has often not turned out well in the long run. I'd still be willing to put together a patch for that if people buy into it. -Kevin
Tom Lane wrote: > George Barnett <gbarnett@atlassian.com> writes: > > [ patch to retry writes on NFS ] > > I'm having a hard time getting excited about this idea, because IMO > NFS is insufficiently reliable to run a database on, and no patch like > this can fix that. However, some concrete points: > > 1. If writes need to be retried, why not reads? (No, that's not an > invitation to expand the scope of the patch; it's a question about NFS > implementation.) > > 2. What is the rationale for supposing that a retry a nanosecond later > will help? If it will help, why didn't the kernel just do that? > > 3. What about EINTR? If you believe that this is necessary, then the > kernel logically has to return EINTR when it would like you to retry but > it hasn't been able to write any bytes yet. If you get a zero return > you have to assume that means out-of-disk-space. > > 4. As coded, the patch behaves incorrectly if you get a zero return on a > retry. If we were going to do this, I think we'd need to absorb the > errno-munging currently done by callers into the writeAll function. > > On the whole I think you'd be better off lobbying your NFS implementors > to provide something closer to the behavior of every other filesystem on > the planet. Or checking to see if you need to adjust your NFS > configuration, as the other responders mentioned. Can our NFS documentation be improved (section 17.2.1)? http://www.postgresql.org/docs/9.1/static/creating-cluster.html -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 10/09/2011, at 1:30 AM, Bernd Helmle wrote: > --On 9. September 2011 10:27:22 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> On the whole I think you'd be better off lobbying your NFS implementors >> to provide something closer to the behavior of every other filesystem on >> the planet. Or checking to see if you need to adjust your NFS >> configuration, as the other responders mentioned. > > You really need at least mount options 'hard' _and_ 'nointr' on NFS mounts, otherwise you are out of luck. Oracle and DB2guys recommend those settings and without them any millisecond of network glitch could disturb things unreasonably. Hi, My mount options include hard and intr. George
On Sep12, 2011, at 06:30 , George Barnett wrote: > On 10/09/2011, at 1:30 AM, Bernd Helmle wrote: > >> --On 9. September 2011 10:27:22 -0400 Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >>> On the whole I think you'd be better off lobbying your NFS implementors >>> to provide something closer to the behavior of every other filesystem on >>> the planet. Or checking to see if you need to adjust your NFS >>> configuration, as the other responders mentioned. >> >> You really need at least mount options 'hard' _and_ 'nointr' on NFS mounts, otherwise you are out of luck. Oracle andDB2 guys recommend those settings and without them any millisecond of network glitch could disturb things unreasonably. > > My mount options include hard and intr. If you really meant to say "intr" there (and not "nointr") then that probably explains the partial writes. Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial read()or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be negligible.If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent usfrom ever seeing that. best regards, Florian Pflug
On 12/09/2011, at 3:59 PM, Florian Pflug wrote: > If you really meant to say "intr" there (and not "nointr") then that probably explains the partial writes. > > Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial read()or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be negligible.If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent usfrom ever seeing that. Hi Florian, You are indeed correct. Setting nointr also resolves my issue. I could swear I checked this, but obviously not. It does still concern me that pgsql did not deal with this as gracefully as other software. I hope the list will considera patch to resolve that. Thanks in advance, George
On mån, 2011-09-12 at 16:46 +1000, George Barnett wrote: > On 12/09/2011, at 3:59 PM, Florian Pflug wrote: > > > If you really meant to say "intr" there (and not "nointr") then that probably explains the partial writes. > > > > Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial read()or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be negligible.If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent usfrom ever seeing that. > > > Hi Florian, > > You are indeed correct. Setting nointr also resolves my issue. I could swear I checked this, but obviously not. > > It does still concern me that pgsql did not deal with this as gracefully as other software. I hope the list will considera patch to resolve that. We have signal handling configured so that system calls are not interrupted. So there is ordinarily no reason to do anything more graceful. The problem is that NFS is in this case not observing that setting. It's debatable whether it's worth supporting that; just saying that the code is correct as it stands.
On Mon, Sep 12, 2011 at 04:46:53PM +1000, George Barnett wrote: > On 12/09/2011, at 3:59 PM, Florian Pflug wrote: > > > If you really meant to say "intr" there (and not "nointr") then that probably explains the partial writes. > > > > Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial read()or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be negligible.If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent usfrom ever seeing that. > > > Hi Florian, > > You are indeed correct. Setting nointr also resolves my issue. I could swear I checked this, but obviously not. > > It does still concern me that pgsql did not deal with this as gracefully as other software. I hope the list will considera patch to resolve that. > > Thanks in advance, > > George Hi George, Many, many, many other software packages expect I/O usage to be the same on an NFS volume and a local disk volume, including Oracle. Coding every application, or more likely mis-coding, to handle this gives every application another chance to get it wrong. If the OS does this, when it gets it right, all of the apps get it right. I think you should be surprised when other software actually deals with broken I/O semantics gracefully rather than concerned when one of a pantheon of programs does not. My two cents. Regards, Ken
On Sep12, 2011, at 14:54 , Peter Eisentraut wrote: > On mån, 2011-09-12 at 16:46 +1000, George Barnett wrote: >> On 12/09/2011, at 3:59 PM, Florian Pflug wrote: >>> Still, I agree with Noah and Kevin that we ought to deal more gracefully with this, i.e. resubmit after a partial read()or write(). AFAICS there's nothing to be gained by not doing that, and the increase in code complexity should be negligible.If we do that, however, I believe we might as well handle EINTR correctly, even if SA_RESTART should prevent usfrom ever seeing that. >> >> It does still concern me that pgsql did not deal with this as gracefully as other software. I hope the list will considera patch to resolve that. > > We have signal handling configured so that system calls are not > interrupted. So there is ordinarily no reason to do anything more > graceful. The problem is that NFS is in this case not observing that > setting. It's debatable whether it's worth supporting that; just saying > that the code is correct as it stands. SA_RESTART doesn't protect against partial reads/writes due to signal delivery, it only removes the need to check for EINTR. In other words, it retries until at least one byte has been written, not until all bytes have been written. The GNU LibC documentation has this to say on the subject "There is one situation where resumption never happens no matter which choice you make: when a data-transfer function suchas read or write is interrupted by a signal after transferring part of the data. In this case, the function returnsthe number of bytes already transferred, indicating partial success."[1] While it's true that reads and writes are by tradition non-interruptible, I personally wouldn't bet that they'll stay that way forever. It all depends on whether the timeouts involved in the communication with a disk are short enough to mask the difference - once they get too long (or even infinite like in the case of "hard" NFS mounts) you pay for non-interruptible primitives with un-killable stuck processes. Since the current trend is to move storage further away from processing, and to put non-deterministic networks like ethernet between the two, I expect situations in which read/write primitives are interruptible to increase, not decrease. And BTW, the GNU LibC documentations doesn't seem to mention anything about local reads and writes being non-interruptible. In fact, it even says "A signal can arrive and be handled while an I/O primitive such as open or read is waiting for an I/O device. If the signalhandler returns, the system faces the question: what should happen next?"[1] Had the GNU people faith in local read and writes being non-interruptible, they'd probably have said "network device" or "remove device", not "I/O device". best regards, Florian Pflug [1] http://www.gnu.org/s/hello/manual/libc/Interrupted-Primitives.html#Interrupted-Primitives
On Sep12, 2011, at 14:54 , ktm@rice.edu wrote: > Many, many, many other software packages expect I/O usage to be the same on > an NFS volume and a local disk volume, including Oracle. Coding every application, > or more likely mis-coding, to handle this gives every application another chance > to get it wrong. If the OS does this, when it gets it right, all of the apps get > it right. I think you should be surprised when other software actually deals with > broken I/O semantics gracefully rather than concerned when one of a pantheon of > programs does not. My two cents. I don't buy that. People seem to be perfectly able to code correct networking applications (correct from a read/write API POV at least), yet those applications need to deal with partial reads and writes too. Really, it's not *that* hard to put a retry loop around "read" and "write". Also, non-interruptible IO primitives are by no means "right". At best, they're a compromise between complexity and functionality for I/O devices with rather short (and bounded) communication timeouts - because in that case, processes are only blocked un-interruptibly for a short while. best regards, Florian Pflug
On Mon, Sep 12, 2011 at 9:39 AM, Florian Pflug <fgp@phlo.org> wrote: > Really, it's not *that* hard to put a retry loop around "read" and "write". +1. I don't see what we're gaining by digging in our heels on this one. Retry loops around read() and write() are pretty routine, and I wouldn't like to bet this is the only case where not having them could cause an unnecessary failure. Now, that having been said, I *really* think we could use some better documentation on which mount options we believe to be safe, and not just for NFS. Right now we have practically nothing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
[CC'ing to the list again - I assume you omitted pgsql-hackers from the recipient list by accident] On Sep13, 2011, at 03:00 , George Barnett wrote: > On 12/09/2011, at 11:39 PM, Florian Pflug wrote: >> Also, non-interruptible IO primitives are by no means "right". At best, they're >> a compromise between complexity and functionality for I/O devices with rather >> short (and bounded) communication timeouts - because in that case, processes are >> only blocked un-interruptibly for a short while. > > Just to expand on that - I'm now in the situation where I can run my nfs mounts > 'nointr' and postgres will work, but that means if I lose a storage unit I have > a number of stuck processes, effectively meaning I need to reboot all my frontend > servers before I can fail over to backup nfs stores. > > However, if I run the mounts with intr, then if a storage unit fails, I can fail > over to a backup node (taking a minor loss of data hit I'm willing to accept) but > postgres breaks under a moderate insert load. > > With the patch I supplied though, I'm able to have most of my cake and eat it. > > I'd be very interested in moving this forward - is there something I can change > in the patch to make it more acceptable for a merge? Here are a few comments Tom already remarked that if we do that for write()s, we ought to do it for read()s also which I agree with. All other primitives like lseek, close, ... should be taken care of by SA_RESTART, but I'd be a good idea to verify that. Also, I don't think that POSIX mandates that errno be reset to 0 if a function returns successfully, making that "returnCode == 0 && errno == 0" check pretty dubious. I'm not sure of this was what Tom was getting at with his remark about the ENOSPC handling being wrong in the retry case. And I also think that if we do this, we might as well handle EINTR correctly, even if our use of SA_RESTART should prevent us from ever seeing that. The rules surrounding EINTR and SA_RESTART for read/write are quite subtle... If we retry, shouldn't be do CHECK_FOR_INTERRUPTS? Otherwise, processes waiting for a vanished NFS server would be killable only with SIGKILL, not SIGTERM or SIGINT. But I'm not sure if it's safe to put that into a generic function like pg_write_nointr. Finally, WriteAll() seems like a poor name for that function. How about pg_write_nointr()? Here's my suggested implementation for pg_write_nointr. pg_read_nointr should be similar (but obviously without the ENOSPC handling) int pg_write_nointr(int fd, const void *bytes, Size amount) { int written = 0; while (amount > 0) { int ret; ret = write(fd, bytes, amount); if ((ret < 0) && (errno == EINTR)) { /* interrupted by signal before first bytewas written. Retry */ /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */ CHECK_FOR_INTERRUPTS(); continue; } else if (ret < 0) { /* error occurred. Abort */ return -1; } else if (ret == 0) { /* outof disk space. Abort */ return written; } /* made progress */ /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */ CHECK_FOR_INTERRUPTS(); written += ret; amount -= ret; bytes = (const char *) bytes + ret; } } best regards, Florian Pflug
On Sep13, 2011, at 13:07 , Florian Pflug wrote: > Here's my suggested implementation for pg_write_nointr. pg_read_nointr should be similar > (but obviously without the ENOSPC handling) > > <wrong pg_write_nointr implementation snipped> Sorry for the self-reply. I realized only after hitting send that I got the ENOSPC handling wrong again - we probably ought to check for ENOSPC as well as ret == 0. Also, it seems preferable to return the number of bytes actually written instead of -1 if we hit an error during retry. With this version, any return value other than <amount> signals an error, the number of actually written bytes is reported even in the case of an error (to the best of pg_write_nointr's knowledge), and errno always indicates the kind of error. int pg_write_nointr(int fd, const void *bytes, Size amount) {int written = 0; while (amount > 0){ int ret; ret = write(fd, bytes, amount); if ((ret < 0) && (errno == EINTR)) { /* interrupted by signal before first byte was written. Retry */ /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */ CHECK_FOR_INTERRUPTS(); continue; } else if (ret < 1) { /* error occurred. Abort */ if (ret == 0) /* out of disk space */ errno = ENOSPC; if (written == 0) return -1; else return written; } /* made progress */ written += ret; amount -= ret; bytes = (const char *) bytes + ret; /* XXX: Is it safe to callCHECK_FOR_INTERRUPTS here? */ CHECK_FOR_INTERRUPTS();} } best regards, Florian Pflug
On Tue, Sep 13, 2011 at 01:30:34PM +0200, Florian Pflug wrote: > On Sep13, 2011, at 13:07 , Florian Pflug wrote: > > Here's my suggested implementation for pg_write_nointr. pg_read_nointr should be similar > > (but obviously without the ENOSPC handling) > > > > <wrong pg_write_nointr implementation snipped> > > Sorry for the self-reply. I realized only after hitting send that I > got the ENOSPC handling wrong again - we probably ought to check for > ENOSPC as well as ret == 0. Also, it seems preferable to return the > number of bytes actually written instead of -1 if we hit an error during > retry. > > With this version, any return value other than <amount> signals an > error, the number of actually written bytes is reported even in the > case of an error (to the best of pg_write_nointr's knowledge), and > errno always indicates the kind of error. > > int pg_write_nointr(int fd, const void *bytes, Size amount) > { > int written = 0; > > while (amount > 0) > { > int ret; > > ret = write(fd, bytes, amount); > > if ((ret < 0) && (errno == EINTR)) > { > /* interrupted by signal before first byte was written. Retry */ > > /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */ > CHECK_FOR_INTERRUPTS(); > > continue; > } > else if (ret < 1) > { > /* error occurred. Abort */ > > if (ret == 0) > /* out of disk space */ > errno = ENOSPC; > > if (written == 0) > return -1; > else > return written; > } > > /* made progress */ > written += ret; > amount -= ret; > bytes = (const char *) bytes + ret; > > /* XXX: Is it safe to call CHECK_FOR_INTERRUPTS here? */ > CHECK_FOR_INTERRUPTS(); > } > } > > best regards, > Florian Pflug > It will be interesting to see if there are any performance ramifications to this new write function. Regards, Ken
On Sep13, 2011, at 14:58 , ktm@rice.edu wrote: > It will be interesting to see if there are any performance ramifications to > this new write function. What would those be? For non-interruptible reads and writes, the overhead comes down to an additional function call (if we don't make pg_write_nointr inlined) and a few conditional jumps (which branch prediction should be able to take care of). These are bound to disappear in the noise compared to the cost of the actual syscall. best regards, Florian Pflug
On Tue, Sep 13, 2011 at 03:02:57PM +0200, Florian Pflug wrote: > On Sep13, 2011, at 14:58 , ktm@rice.edu wrote: > > It will be interesting to see if there are any performance ramifications to > > this new write function. > > What would those be? For non-interruptible reads and writes, the overhead > comes down to an additional function call (if we don't make pg_write_nointr > inlined) and a few conditional jumps (which branch prediction should be > able to take care of). These are bound to disappear in the noise compared > to the cost of the actual syscall. > > best regards, > Florian Pflug > That would be my expectation too. It is just always nice to benchmark changes, just in case. I have had similar simple changes blow out a cache and have a much greater impact on performance than might be expected from inspection. :) Regards, Ken
On Tue, Sep 13, 2011 at 7:30 AM, Florian Pflug <fgp@phlo.org> wrote: > > Sorry for the self-reply. I realized only after hitting send that I > got the ENOSPC handling wrong again - we probably ought to check for > ENOSPC as well as ret == 0. Also, it seems preferable to return the > number of bytes actually written instead of -1 if we hit an error during > retry. > > With this version, any return value other than <amount> signals an > error, the number of actually written bytes is reported even in the > case of an error (to the best of pg_write_nointr's knowledge), and > errno always indicates the kind of error. Personally, I'ld think that's ripe for bugs. If the contract is that ret != amount is the "error" case, then don't return -1 for an error *sometimes*. If you sometimes return -1 for an error, even though ret != amount is the *real* test, I'm going to guess there will be lots of chance for code to do: if (pg_write_no_intr(...) < 0) ... which will only catch some of the errors, and happily continue with the rest... a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Sep13, 2011, at 15:05 , Aidan Van Dyk wrote: > On Tue, Sep 13, 2011 at 7:30 AM, Florian Pflug <fgp@phlo.org> wrote: >> Sorry for the self-reply. I realized only after hitting send that I >> got the ENOSPC handling wrong again - we probably ought to check for >> ENOSPC as well as ret == 0. Also, it seems preferable to return the >> number of bytes actually written instead of -1 if we hit an error during >> retry. >> >> With this version, any return value other than <amount> signals an >> error, the number of actually written bytes is reported even in the >> case of an error (to the best of pg_write_nointr's knowledge), and >> errno always indicates the kind of error. > > Personally, I'ld think that's ripe for bugs. If the contract is that > ret != amount is the "error" case, then don't return -1 for an error > *sometimes*. Hm, but isn't that how write() works also? AFAIK (non-interruptible) write() will return the number of bytes written, which may be less than the requested number if there's not enough free space, or -1 in case of an error like an invalid fd being passed. > If you sometimes return -1 for an error, even though ret != amount is > the *real* test, I'm going to guess there will be lots of chance for > code to do: > if (pg_write_no_intr(...) < 0) > ... > > which will only catch some of the errors, and happily continue with the rest... Yeah, but that's equally wrong for plain write(), so I'm not sure I share your concern there. Also, I'm not sure how to improve that. We could always return -1 in case of an error, and "amount" in case of success, but that makes it impossible to determine how many bytes where actually written (and also feel awkward). Or we could return 0 instead of -1 if there was an error and zero bytes where written. But that feels awkward also... One additional possibility would be to make the signature boolean pg_write_nointr(int fd, const void *bytes, int len, int *written) and simply return true on success and false on error. Callers who're interested in the number of bytes actually written (in the case of an error) would need to pass some non-NULL pointer for <written>, while all others would just pass NULL. Thoughts? best regards, Florian Pflug
Florian Pflug <fgp@phlo.org> writes: > On Sep13, 2011, at 15:05 , Aidan Van Dyk wrote: >> Personally, I'ld think that's ripe for bugs. If the contract is that >> ret != amount is the "error" case, then don't return -1 for an error >> *sometimes*. > Hm, but isn't that how write() works also? Yeah. It's not possible to maintain the same error-reporting contract that bare write() has got, unless you're willing to forget about actual errors reported by a non-first write attempt. Which might not be totally unreasonable, because presumably something similar is going on under the hood within write() itself. Most of the errors one might think are worth reporting would have had to occur on the first write attempt anyway. But if you do want to report such errors, I think you have to push the error reporting logic into the subroutine, which seems a bit messy since there's quite a variety of error message phrasings out there, all of which require information that write() itself does not have. Also, we do *not* want e.g. gettext() to be invoked unless an error actually occurs and has to be reported. regards, tom lane
On Tue, Sep 13, 2011 at 10:14 AM, Florian Pflug <fgp@phlo.org> wrote: >> Personally, I'ld think that's ripe for bugs. If the contract is that >> ret != amount is the "error" case, then don't return -1 for an error >> *sometimes*. > > Hm, but isn't that how write() works also? AFAIK (non-interruptible) write() > will return the number of bytes written, which may be less than the requested > number if there's not enough free space, or -1 in case of an error like > an invalid fd being passed. Looking through the code, it appears as if all the write calls I've seen are checking ret != amount, so it's probably not as big a deal for PG as I fear... But the subtle change in semantics (from system write ret != amount not necessarily a real error, hence no errno set) of pg_write ret != amount only happening after a "real error" (errno should be set) is one that could yet lead to confusion. a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Sep13, 2011, at 16:25 , Tom Lane wrote: > Florian Pflug <fgp@phlo.org> writes: >> On Sep13, 2011, at 15:05 , Aidan Van Dyk wrote: >>> Personally, I'ld think that's ripe for bugs. If the contract is that >>> ret != amount is the "error" case, then don't return -1 for an error >>> *sometimes*. > >> Hm, but isn't that how write() works also? > > Yeah. It's not possible to maintain the same error-reporting contract > that bare write() has got, unless you're willing to forget about actual > errors reported by a non-first write attempt. Hm, yeah, but we're only replacing the exclusive or in "either sets errno *or* returns >= 0 and < amount" by a non-exclusive one. Which, in practice, doesn't make much difference for callers. They can (and should) continue to check whether they correct amount of bytes has been written, and they may still use errno to distinguish different kinds of errors. They should just do so upon any error condition, not upon us returning -1. The important thing, I believe, is that we don't withhold any information from callers, which we don't. If write() sets errno, it must return -1, so we'll abort and hence leave the errno in place to be inspected by the caller. And we faithfully track the actual number of bytes written. Or am I missing something? > But if you do want to report such errors, I think you have to push the > error reporting logic into the subroutine, which seems a bit messy since > there's quite a variety of error message phrasings out there, all of > which require information that write() itself does not have. Also, we > do *not* want e.g. gettext() to be invoked unless an error actually > occurs and has to be reported. Yeah, I had the same idea (moving the error reporting into the subroutine) when I first looked at the OP's patch, but then figured it'd just complicate the API for no good reason. best regards, Florian Pflug
On Tue, Sep 13, 2011 at 10:32:01AM -0400, Aidan Van Dyk wrote: > On Tue, Sep 13, 2011 at 10:14 AM, Florian Pflug <fgp@phlo.org> wrote: > > >> Personally, I'ld think that's ripe for bugs. If the contract is that > >> ret != amount is the "error" case, then don't return -1 for an error > >> *sometimes*. > > > > Hm, but isn't that how write() works also? AFAIK (non-interruptible) write() > > will return the number of bytes written, which may be less than the requested > > number if there's not enough free space, or -1 in case of an error like > > an invalid fd being passed. > > Looking through the code, it appears as if all the write calls I've > seen are checking ret != amount, so it's probably not as big a deal > for PG as I fear... > > But the subtle change in semantics (from system write ret != amount > not necessarily a real error, hence no errno set) of pg_write ret != > amount only happening after a "real error" (errno should be set) is > one that could yet lead to confusion. I assume there is no TODO here. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +