Re: Patch to improve reliability of postgresql on linux nfs - Mailing list pgsql-hackers

From Josh Berkus
Subject Re: Patch to improve reliability of postgresql on linux nfs
Date
Msg-id 4E695AFB.90606@agliodbs.com
Whole thread Raw
In response to Patch to improve reliability of postgresql on linux nfs  (George Barnett <gbarnett@atlassian.com>)
List pgsql-hackers
George,

I'm quoting you here because in the version of your email which got
posted to the list your whole explanation got put below the patch text,
making it hard to find the justification for the patch.  Follows:

> I run a number of postgresql installations on NFS and on the whole I find this to be very reliable.  I have however
runinto a few issues when there is concurrent writes from multiple processes.
 
> 
> I see errors such as the following:
> 
> 2011-07-31 22:13:35 EST postgres postgres [local] LOG:  connection authorized: user=postgres database=postgres
> 2011-07-31 22:13:35 EST    ERROR:  could not write block 1 of relation global/2671: wrote only 4096 of 8192 bytes
> 2011-07-31 22:13:35 EST    HINT:  Check free disk space.
> 2011-07-31 22:13:35 EST    CONTEXT:  writing block 1 of relation global/2671
> 2011-07-31 22:13:35 EST [unknown] [unknown]  LOG:  connection received: host=[local]
> 
> I have also seen similar errors coming out of the WAL writer, however they occur at the level PANIC, which is a
littlemore drastic.
 
> 
> After spending some time with debug logging turned on and even more time staring at strace, I believe this occurs
whenone process was writing to a data file and it received a SIGINT from another process, eg:
 
> (These logs are from another similar run)
> 
> [pid  1804] <... fsync resumed> )       = 0
> [pid 10198] kill(1804, SIGINT <unfinished ...>
> [pid  1804] lseek(3, 4915200, SEEK_SET) = 4915200
> [pid  1804] write(3, "c\320\1\0\1\0\0\0\0\0\0\0\0\0K\2\6\1\0\0\0\0\373B\0\0\0\0\2\0m\0"..., 32768 <unfinished ...>
> [pid 10198] <... kill resumed> )        = 0
> [pid  1804] <... write resumed> )       = 4096
> [pid  1804] --- SIGINT (Interrupt) @ 0 (0) ---
> [pid  1804] rt_sigreturn(0x2)           = 4096
> [pid  1804] write(2, "\0\0\373\0\f\7\0\0t2011-08-30 20:29:52.999"..., 260) = 260
> [pid  1804] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  <unfinished ...>
> [pid  1802] <... select resumed> )      = 1 (in [5], left {0, 999000})
> [pid  1804] <... rt_sigprocmask resumed> NULL, 8) = 0
> [pid  1804] tgkill(1804, 1804, SIGABRT) = 0
> [pid  1802] read(5,  <unfinished ...>
> [pid  1804] --- SIGABRT (Aborted) @ 0 (0) ---
> Process 1804 detached
> 
> After finding this, I came up with the following test case which easily replicated our issue:
> 
> #!/bin/bash
> 
> name=$1
> number=1
> while true; do 
>   /usr/bin/psql -c "CREATE USER \"$name$number\" WITH NOSUPERUSER INHERIT NOCREATEROLE NOCREATEDB LOGIN PASSWORD
'pass';"
>   /usr/bin/createdb -E UNICODE -O $name$number $name$number
>   if `grep -q PANIC /data/postgresql/data/pg_log/*`; then
>     exit
>   fi
>   let number=$number+1
> done
> 
> When I run a single copy of this script, I have no issues, however when I start up a few more copies to
simultaneouslyhit the DB, it crashes quiet quickly - usually within 20 or 30 seconds.
 
> 
> After looking through the code I found that when postgres calls write() it doesn't retry.  In order to address the
issuewith the PANIC in the WAL writer I set the sync method to o_sync which solved the issue in that part of the code,
howeverI was still seeing failures in other areas of the code (such as the FileWrite function).  Following this, I
spoketo an NFS guru who pointed out that writes under linux are not guaranteed to complete unless you open up O_SYNC or
similaron the file handle.  I had a look in the libc docs and found this:
 
> 
> http://www.gnu.org/s/libc/manual/html_node/I_002fO-Primitives.html
> 
> ----
> The write function writes up to size bytes from buffer to the file with descriptor filedes. The data in buffer is not
necessarilya character string and a null character is output like any other character.
 
> 
> The return value is the number of bytes actually written. This may be size, but can always be smaller. Your program
shouldalways call write in a loop, iterating until all the data is written.
 
> ----
> 
> After finding this, I checked a number of other pieces of software that we see no issues with on NFS (such as the
JVM)for their usage of write().  I confirmed they write in a while loop and set about patching the postgres source.
 
> 
> I have made this patch against 8.4.8 and confirmed that it fixes the issue we see on our systems.  I have also
checkedthat make check still passes. 
 
> 
> As my C is terrible, I would welcome any comments on the implementation of this patch.
> 
> Best regards,
> 
> George


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


pgsql-hackers by date:

Previous
From: George Barnett
Date:
Subject: Patch to improve reliability of postgresql on linux nfs
Next
From: Robert Haas
Date:
Subject: Re: Large C files