Thread: "make check" fails over NFS or tmpfs
Hi, We've encountered failures of "make check", when we put PostgreSQL data directory on a NFS filesystem or a tmpfs filesystem. It doesn't always fail, but fails occasionally. Is this expected behavior of PostgreSQL? If it's expected, what is the reason of this symptom? I grep'ed the source code of PostgreSQL, but it seems it doesn't use problematic operations (for NFS) like flock(2) or F_SETLK/F_SETLKW of fcntl(2)... So, I guess (theoretically) it should work fine over NFS or tmpfs. Only idea which strucks me is there is some nasty bug in Linux. ;-) Of course, we are using single instance of PostgreSQL on single machine. i.e. We are NOT accessing the data directory from either multiple machines or multiple PostgreSQL instances. To give an actual example, when we invoked the following shell script: $ cat ~/regress-loop.sh #!/bin/sh loop=1 make clean while true; do echo "############### loop = $loop ##################" make check ret=$? if [ $ret -ne 0 ]; then echo "error @ loop = $loop (return value = $ret)" exit $ret fi loop=`expr $loop + 1` done Errors like the following happen, sometimes: $ sh ~/regress-loop.sh : : make: *** [check] Error 2 error @ loop = 26 (return value = 2) We observed this symptom under the following conditions: 1. putting PGDATA on NFS-async filesystem: NFS (async) NFS client: PostgreSQL version: 8.1.3 OS version: Fedora Core 3 Linux NFS server: OS version: Fedora Core 3 Linux "async" is specified in /etc/exports, thus the server violates the NFS protocol, and replys to requests before it stores changes to its disk. How many loops until it fails: 3000 loops or more 2. putting PGDATA on NFS filesystem: NFS NFS client: PostgreSQL version: 8.1.3 OS version: Fedora Core 4 Linux NFS server: OS version: Fedora Core 5 Linux How many loops until it fails: approximately 300 loops 3. putting PGDATA on tmpfs filesystem: tmpfs PostgreSQL version: 8.1.3 OS version: Fedora Core 5 Linux How many loops until it fails: approximately 100 loops This symptom never happens over ext3fs, as far as we see. I attached the diff between expected results and actual results in this mail. Any ideas appreciated, except using local filesystem. ;-) -- SODA Noriyuki *** ./expected/tablespace.out Tue May 16 13:03:24 2006 --- ./results/tablespace.out Fri May 19 21:04:30 2006 *************** *** 35,37 **** --- 35,38 ---- NOTICE: drop cascades to table testschema.foo -- Should succeed DROP TABLESPACE testspace; + ERROR: tablespace "testspace" is not empty ====================================================================== *** ./expected/tablespace.out Fri May 19 15:28:32 2006 --- ./results/tablespace.out Sat May 20 06:13:18 2006 *************** *** 35,37 **** --- 35,38 ---- NOTICE: drop cascades to table testschema.foo -- Should succeed DROP TABLESPACE testspace; + ERROR: tablespace "testspace" is not empty ====================================================================== *** ./expected/sanity_check.out Fri Sep 9 05:07:42 2005 --- ./results/sanity_check.out Fri May 19 16:31:37 2006 *************** *** 17,22 **** --- 17,24 ---- circle_tbl | t fast_emp4000 | t func_index_heap | t + gcircle_tbl | t + gpolygon_tbl | t hash_f8_heap | t hash_i4_heap | t hash_name_heap | t *************** *** 68,74 **** shighway | t tenk1 | t tenk2 | t ! (58 rows) -- -- another sanity check: every system catalog that has OIDs should have --- 70,76 ---- shighway | t tenk1 | t tenk2 | t ! (60 rows) -- -- another sanity check: every system catalog that has OIDs should have ======================================================================
SODA Noriyuki <soda@sra.co.jp> writes: > We've encountered failures of "make check", when we put PostgreSQL > data directory on a NFS filesystem or a tmpfs filesystem. > It doesn't always fail, but fails occasionally. Is the NFS filesystem mounted fail-soft? As a rule, database people will tell you to just go away if you admit to running a database over NFS. Its idea of reliability is at least an order of magnitude worse than what we consider acceptable. But hard mounting helps. regards, tom lane
Thanks for the reply. >>>>> On Mon, 22 May 2006 01:05:38 -0400, Tom Lane <tgl@sss.pgh.pa.us> said: > SODA Noriyuki <soda@sra.co.jp> writes: >> We've encountered failures of "make check", when we put PostgreSQL >> data directory on a NFS filesystem or a tmpfs filesystem. >> It doesn't always fail, but fails occasionally. > Is the NFS filesystem mounted fail-soft? It's mounted with "hard" option. > As a rule, database people will tell you to just go away if you > admit to running a database over NFS. Its idea of reliability > is at least an order of magnitude worse than what we consider > acceptable. Yeah ;) But weren't you surprised by the fact that tmpfs doesn't work either, and fails more quicker than nfs? Anyway, thanks for the reply. -- SODA Noriyuki
SODA Noriyuki <soda@sra.co.jp> writes: > NFS server: > OS version: Fedora Core 3 Linux > "async" is specified in /etc/exports, thus the server violates > the NFS protocol, and replys to requests before it stores > changes to its disk. The reason the protocol is speced the way it is is because it's the only way to gaurantee the semantics match the traditional unix semantics of a local filesystem. That said I would have expected a good NFS server to still live up to everything important as long as the server doesn't actually crash or get shut down at any point. I certainly would have expected tmpfs to live up to the traditional unix filesystem semantics. > *** 35,37 **** > --- 35,38 ---- > NOTICE: drop cascades to table testschema.foo > -- Should succeed > DROP TABLESPACE testspace; > + ERROR: tablespace "testspace" is not empty This one looks like the unlink is returning before it completes and then subsequent operations (perhaps only if they come from other processes?) are allowed to see the old filesystem state. That really ought not every happen even with async and certainly not in tmpfs. This might bear some further testing. Can you send the exact commands you used to set up the tmpfs filesystem? Also, it might be worth checking if Fedora Core 3 has any relevant known bugs. > ====================================================================== > > *** ./expected/sanity_check.out Fri Sep 9 05:07:42 2005 > --- ./results/sanity_check.out Fri May 19 16:31:37 2006 > *************** > *** 17,22 **** > --- 17,24 ---- > circle_tbl | t > fast_emp4000 | t > func_index_heap | t > + gcircle_tbl | t > + gpolygon_tbl | t > hash_f8_heap | t > hash_i4_heap | t > hash_name_heap | t This seems pretty mystifying. Perhaps it's leftover stuff from the tablespace that failed to get dropped? -- greg
On Mon, 2006-05-22 at 06:15, SODA Noriyuki wrote: Hei > We've encountered failures of "make check", when we put PostgreSQL > data directory on a NFS filesystem or a tmpfs filesystem. > It doesn't always fail, but fails occasionally. > To have a database on a NFS filesystem is a disaster waiting to happen, specially if the NFS server is running on Linux. Not to talk on the performance penalty of running the database via NFS. I would say that anything is better than NFS for running a database. But if you absolutely have to use NFS, run NFS via TCP not UDP, use hard and turn off all cache ..... In the server side we are talking about 'sync' and 'no_wdelay' parameters and in the client about 'bg','hard','intr','noac' and 'tcp', probably the throughput will improve by increasing rsize and wsize to 16384 or even 32768. regards, -- Rafael Martinez, <r.m.guerrero@usit.uio.no> Center for Information Technology Services University of Oslo, Norway PGP Public Key: http://folk.uio.no/rafael/
>>>>> On 22 May 2006 03:00:55 -0400, Greg Stark <gsstark@mit.edu> said: > That said I would have expected a good NFS server to still live up to > everything important as long as the server doesn't actually crash or get shut > down at any point. > I certainly would have expected tmpfs to live up to the traditional unix > filesystem semantics. Yeah. That was what I hoped, although it was too optimistic. >> *** 35,37 **** >> --- 35,38 ---- >> NOTICE: drop cascades to table testschema.foo >> -- Should succeed >> DROP TABLESPACE testspace; >> + ERROR: tablespace "testspace" is not empty > This one looks like the unlink is returning before it completes and then > subsequent operations (perhaps only if they come from other processes?) are > allowed to see the old filesystem state. That really ought not every happen > even with async and certainly not in tmpfs. > This might bear some further testing. Can you send the exact > commands you used to set up the tmpfs filesystem? These two "tablespace not empty" results both came from NFS mounts. (I should have said that explicitly, sorry.) So, this means the REMOVE RPC sometimes may overtake other RPCs? Hmm... FWIW, no option was specified to mount tmpfs. >> *** ./expected/sanity_check.out Fri Sep 9 05:07:42 2005 >> --- ./results/sanity_check.out Fri May 19 16:31:37 2006 >> *************** >> *** 17,22 **** >> --- 17,24 ---- >> circle_tbl | t >> fast_emp4000 | t >> func_index_heap | t >> + gcircle_tbl | t >> + gpolygon_tbl | t >> hash_f8_heap | t >> hash_i4_heap | t >> hash_name_heap | t > This seems pretty mystifying. Perhaps it's leftover stuff from the > tablespace that failed to get dropped? No. Because this is a result from tmpfs, and before this failure, "make check" passed almost 100 times on this tmpfs. It seems your explanation can describe what's happening in the NFS case, Thanks! -- soda
"Rafael Martinez, Guerrero" <r.m.guerrero@usit.uio.no> writes: > I would say that anything is better than NFS for running a database. But > if you absolutely have to use NFS, run NFS via TCP not UDP, use hard and > turn off all cache ..... In the server side we are talking about 'sync' > and 'no_wdelay' parameters and in the client about > 'bg','hard','intr','noac' and 'tcp', probably the throughput will > improve by increasing rsize and wsize to 16384 or even 32768. Using TCP with NFS is only really helpful when you have a high latency high bandwidth link which isn't going to be a terribly positive environment for postgres. I'm not sure about all your other recommendations either, they strike me as a bit cargo-cultish. Certainly not mounting your filesystem soft will protect against unknowingly losing data if your server crashes, and boosting wsize and rsize will help though the optimal values will depend on your specific environment. But the others shouldn't be terribly relevant -- hell, bg only affects the actual mount operation. While I'm leery about recommending any network filesystem for anything that depends on the filesystem as heavily as a database, of all the network filesystems NFS takes the most care to maintain solid semantics. The main problem is that people are always looking for new and interesting ways to defeat those semantics with options like soft mounts. Certainly I can't agree with "anything is better than NFS", what would you recommend, samba? Now that I've read up on what "async" does it seems like the errors are a pretty predictable consequence. Making directory operations asynchronous is going to break a LOT of things. Most Unix mail servers, for example, also depend on directory operations being synchronous. I would expect "async" to cause Postgres errors on any filesystem that supports it. "async" "intr" and "soft" seem like the real foot-guns here. -- greg
Am Montag, 22. Mai 2006 09:17 schrieb Rafael Martinez, Guerrero: > To have a database on a NFS filesystem is a disaster waiting to happen, > specially if the NFS server is running on Linux. Not to talk on the > performance penalty of running the database via NFS. With all due respect, this is a bunch of FUD. When used properly, NFS is perfectly safe and well-performing for a PostgreSQL database or any other application. -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Mon, 2006-05-22 at 10:25, Greg Stark wrote: > Using TCP with NFS is only really helpful when you have a high latency high > bandwidth link which isn't going to be a terribly positive environment for > postgres. > Well, having a protocol that by definition says that datagrams may arrive out of order or go missing without notice does not sound like a good thing to have a database running on. [.......] > environment. But the others shouldn't be terribly relevant -- hell, bg only > affects the actual mount operation. > The result of using cut & paste ;) not directly relevant to postgres but nice to have when mounting the nfs directory. > While I'm leery about recommending any network filesystem for anything that > depends on the filesystem as heavily as a database, of all the network > filesystems NFS takes the most care to maintain solid semantics. The main > problem is that people are always looking for new and interesting ways to > defeat those semantics with options like soft mounts. Certainly I can't agree > with "anything is better than NFS", what would you recommend, samba? > Samba? :-) not at all, it was a way of saying how bad idea is to run a database via NFS if you want reliability and performance. Not everybody agrees with this, but well, they can do what the want with their data. > > "async" "intr" and "soft" seem like the real foot-guns here. Why do you think 'intr' is a bad thing, from man pages: " ........ If an NFS file operation has a major timeout and it is hard mounted, then allow signals to interupt the file operation and cause it to return EINTR to the calling program. The default is to not allow file operations to be interrupted ....." This will be like an error reported by the filesystem, the program will get the information and will take care of the problem instead of waiting indefinitely for a respons not comming and having the database probably in a nonconsistent state. With 'noac' I was thinking about two processes trying to access the same file at the same time, better not to have some cache in our way that alter the real state of the file to other processes. -- Rafael Martinez, <r.m.guerrero@usit.uio.no> Center for Information Technology Services University of Oslo, Norway PGP Public Key: http://folk.uio.no/rafael/
On Mon, 2006-05-22 at 11:00, Peter Eisentraut wrote: > Am Montag, 22. Mai 2006 09:17 schrieb Rafael Martinez, Guerrero: > > To have a database on a NFS filesystem is a disaster waiting to happen, > > specially if the NFS server is running on Linux. Not to talk on the > > performance penalty of running the database via NFS. > > With all due respect, this is a bunch of FUD. When used properly, NFS is > perfectly safe and well-performing for a PostgreSQL database or any other > application. Well, I do not agree with what you say. NFS and specially if the server is running Linux is not perfectly safe and well-performing for a database system, specially in a busy one. Our experience with NFS on linux is that not always works as is suppose to do. Yes, it works ok in many cases and yes, it works ok for some type of applications but it does not work for the level of reliability and integrity we want in our systems/databases. regards, -- Rafael Martinez, <r.m.guerrero@usit.uio.no> Center for Information Technology Services University of Oslo, Norway PGP Public Key: http://folk.uio.no/rafael/
On Mon, May 22, 2006 at 04:34:34PM +0900, SODA Noriyuki wrote: > These two "tablespace not empty" results both came from NFS mounts. > (I should have said that explicitly, sorry.) > So, this means the REMOVE RPC sometimes may overtake other RPCs? > Hmm... Maybe an explicit fsync() is needed on the directory before trying to remove it. Still, this seems like something the protocol should deal with. There is one possibility, considering it's NFS. On normal unix filesystems, if you delete a file which is still open, the directory entry goes away and you can rmdir the directory. However, due to the way NFS works, the file can't really be deleted on the server because NFS is stateless, the connection and all open files should in principle survive a server restart. Perhaps another postgresql process has a file open still, and that is preventing the rmdir from suceeding. In this case, the issue should manifest itself on windows also, since it also does not permit the deletion of an open file (or maybe it does now). Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
Am Montag, 22. Mai 2006 11:30 schrieb Rafael Martinez, Guerrero: > Our experience with NFS on linux is that not always works as is suppose > to do. Yes, it works ok in many cases and yes, it works ok for some type > of applications but it does not work for the level of reliability and > integrity we want in our systems/databases. I have no experience with running NFS on Linux in a database environment, so I grant you that this may not be a good choice. But that is a problem of the server implementation, not the protocol or the file system. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Thomas Friedman (NYT editorial writer) mentions PostgreSQL and EnterpriseDB in his column (published today in syndication in the hinterlands) about how outsourcing is helping the economies in many places.
SODA Noriyuki <soda@sra.co.jp> writes: > On 22 May 2006 03:00:55 -0400, Greg Stark <gsstark@mit.edu> said: > + gcircle_tbl | t > + gpolygon_tbl | t >> This seems pretty mystifying. Perhaps it's leftover stuff from the >> tablespace that failed to get dropped? > No. > Because this is a result from tmpfs, and before this failure, > "make check" passed almost 100 times on this tmpfs. It looks to me like this is just a possible result from sufficiently weird timing. gcircle_tbl and gpolygon_tbl are temp tables made in the create_index test, which runs just before sanity_check. If the backend running create_index hadn't finished deleting its temp tables yet, they could still be present when sanity_check looks in pg_class. Curious that we've never seen this on any other platform though. regards, tom lane
On Mon, 2006-05-22 at 04:12, Rafael Martinez, Guerrero wrote: > On Mon, 2006-05-22 at 10:25, Greg Stark wrote: > > Using TCP with NFS is only really helpful when you have a high latency high > > bandwidth link which isn't going to be a terribly positive environment for > > postgres. > > > > Well, having a protocol that by definition says that datagrams may > arrive out of order or go missing without notice does not sound like a > good thing to have a database running on. > > [.......] > > environment. But the others shouldn't be terribly relevant -- hell, bg only > > affects the actual mount operation. > > > > The result of using cut & paste ;) not directly relevant to postgres but > nice to have when mounting the nfs directory. > > > While I'm leery about recommending any network filesystem for anything that > > depends on the filesystem as heavily as a database, of all the network > > filesystems NFS takes the most care to maintain solid semantics. The main > > problem is that people are always looking for new and interesting ways to > > defeat those semantics with options like soft mounts. Certainly I can't agree > > with "anything is better than NFS", what would you recommend, samba? > > > > Samba? :-) not at all, it was a way of saying how bad idea is to run a > database via NFS if you want reliability and performance. Not everybody > agrees with this, but well, they can do what the want with their data. Given my experiences with Linux, NFS, and Samba in the past, I would say Samba is a MUCH better choice for network file systems under Linux than NFS, especially if you're using different kernel versions on the systems and what not. It seems that if you find the right kernel on both sides of a Linux - Linux NFS system, then it can be very stable. But that's only a small percentage of the time. Most of the time, I've had serious issues with Linux and NFS, and I'm a big proponent of Linux in general. But the NFS implementation has serious issues.
"Rafael Martinez, Guerrero" <r.m.guerrero@usit.uio.no> writes: > Why do you think 'intr' is a bad thing, from man pages: > " ........ If an NFS file operation has a major timeout and it is > hard mounted, then allow signals to interupt the file operation and > cause it to return EINTR to the calling program. The default is to not > allow file operations to be interrupted ....." > > This will be like an error reported by the filesystem, the program will > get the information and will take care of the problem instead of waiting > indefinitely for a respons not comming and having the database probably > in a nonconsistent state. Traditional file systems guaranteed it never happened, so older applications do not expect to have filesystem operations interrupted. Many do not check for it or do not handle it properly. I recall a conversation a while back about Postgres in particular not checking for it. > With 'noac' I was thinking about two processes trying to access the same > file at the same time, better not to have some cache in our way that > alter the real state of the file to other processes. The description of the option gave me the impression that this would only be an issue if your processes were on two different clients. -- greg
On Mon, May 22, 2006 at 12:52:33PM -0400, Greg Stark wrote: > "Rafael Martinez, Guerrero" <r.m.guerrero@usit.uio.no> writes: > > > Why do you think 'intr' is a bad thing, from man pages: > > " ........ If an NFS file operation has a major timeout and it is > > hard mounted, then allow signals to interupt the file operation and > > cause it to return EINTR to the calling program. The default is to not > > allow file operations to be interrupted ....." > > Traditional file systems guaranteed it never happened, so older applications > do not expect to have filesystem operations interrupted. Many do not check for > it or do not handle it properly. I recall a conversation a while back about > Postgres in particular not checking for it. I've occasionally wondered if this is a SysV vs BSD thing. Under SysV signal semantics, any signal would cause the current system call to return EINTR. The list of system calls that could be interrupted is long, and include just about anything filesystem related. So programs with any kind of signal handling would handle the broken-NFS case automatically. BSD signal semantics (what postgres uses) make all system calls restart across signals. Thus, a system call can never return EINTR unless you have non-blocking I/O enabled. These programs would be confused by unexpected EINTRs. Postgres doesn't check EINTR on all filesystem system call and thus would be susceptable to the above problem. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
Martijn van Oosterhout wrote: > On Mon, May 22, 2006 at 12:52:33PM -0400, Greg Stark wrote: >> "Rafael Martinez, Guerrero" <r.m.guerrero@usit.uio.no> writes: >> >>> Why do you think 'intr' is a bad thing, from man pages: >>> " ........ If an NFS file operation has a major timeout and it is >>> hard mounted, then allow signals to interupt the file operation and >>> cause it to return EINTR to the calling program. The default is to not >>> allow file operations to be interrupted ....." >> Traditional file systems guaranteed it never happened, so older applications >> do not expect to have filesystem operations interrupted. Many do not check for >> it or do not handle it properly. I recall a conversation a while back about >> Postgres in particular not checking for it. > > I've occasionally wondered if this is a SysV vs BSD thing. Under SysV > signal semantics, any signal would cause the current system call to > return EINTR. The list of system calls that could be interrupted is > long, and include just about anything filesystem related. So programs > with any kind of signal handling would handle the broken-NFS case > automatically. > > BSD signal semantics (what postgres uses) make all system calls > restart across signals. Thus, a system call can never return EINTR > unless you have non-blocking I/O enabled. These programs would be > confused by unexpected EINTRs. AFAIK, linux actually abort syscalls when an signal arrives, and it's just the libc that restarts them automatically. So, actually, doing do { ret = syscall(args) ; } until (ret != EINTR) in your code should be equivalent to telling the libc to provide BSD semantics, and just do ret = syscall(args) ; > Postgres doesn't check EINTR on all filesystem system call and thus > would be susceptable to the above problem. Even if postgres checked for EINTR, what could it possibly do in that case? Just retrying wont have any advantage over simply mounting with "nointr" - it would still just hang when the nfs-server dies. greetings, Florian Pflug
On Wed, May 24, 2006 at 12:16:13AM +0200, Florian G. Pflug wrote: > >BSD signal semantics (what postgres uses) make all system calls > >restart across signals. Thus, a system call can never return EINTR > >unless you have non-blocking I/O enabled. These programs would be > >confused by unexpected EINTRs. > AFAIK, linux actually abort syscalls when an signal arrives, and it's > just the libc that restarts them automatically. So, actually, doing All UNIX OS's do something similar. After all, if you define a signal handler, the kernel has to return to user space to execute your handler. All BSD did was always restart the syscall (your loop, though probably just by fiddling the instruction pointer)) whereas SysV never did. Nowadays you can choose which way you want it using sigaction(). I think the real lesson is that you can emulate BSD semantics if you have SysV semantics, but not vice-versa. > >Postgres doesn't check EINTR on all filesystem system call and thus > >would be susceptable to the above problem. > Even if postgres checked for EINTR, what could it possibly do in that case? > Just retrying wont have any advantage over simply mounting with "nointr" - > it would still just hang when the nfs-server dies. Well, it could check whether statement_tineout has passed and return an error rather than hanging... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.