I've been working closely with Black Duck Software, and their customer,
to get to the bottom of $subject, and we have just declared success.
Here is a summary of the problem and solution for the archives.
The end-customer has a fairly beefy server, lots of RAM and CPUs, and is
running an I/O intensive application that takes hours to complete.
PGDATA resides on an NFS mounted NetApp. The OS is x86_64 RHEL, and was
up to date with the latest kernel.
They had been experiencing the "data beyond EOF" ERROR fairly often, but
sporadically, for some time. We installed an instrumented version of
Postgres that pointed toward a kernel bug -- specifically
lseek(...,SEEK_END) returning the wrong result, consistent with postgres
source code comments (although those were written in reference to a much
older kernel).
Fortunately for us, the kernel NFS client maintainer, Trond Myklebust,
happened to be involved in this investigation, and he was able to find
and fix the kernel bugs at the root of this problem. We have now
finished sufficient testing to convince ourselves that Trond's patch
indeed solves the problem at hand, at least in the form we have been
experiencing.
Part of his patch was a backport from newer kernels, but part was
completely new. He has submitted (or will) the new part upstream for
inclusion in future kernels, and submitted a bug report to Red Hat so
that hopefully the patch will be included in updated kernel RPMs from
Red Hat.
For reference, the bug report can be found here:
https://bugzilla.redhat.com/show_bug.cgi?id=672981
Trond was also kind enough to provide a patched version of the current
RHEL kernel. An x86_64 and source RPM are available here in case someone
has an immediate need:
http://www.joeconway.com/rpms/kernel-2.6.18-238.el5.nfslseekfixv2.x86_64.rpm
http://www.joeconway.com/rpms/kernel-2.6.18-238.el5.nfslseekfixv2.src.rpm
HTH,
Joe
--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support