Some thoughts on NFS - Mailing list pgsql-hackers

From Thomas Munro
Subject Some thoughts on NFS
Date
Msg-id CA+hUKGKa-HtBHJaBUJuZHsKwvVkxW2nE0W8BqRVOhNr5yNgiDA@mail.gmail.com
Whole thread Raw
Responses Re: Some thoughts on NFS
Re: Some thoughts on NFS
Re: Some thoughts on NFS
List pgsql-hackers
Hello hackers,

As discussed in various threads, PostgreSQL-on-NFS is viewed with
suspicion.  Perhaps others knew this already, but I first learned of
the specific mechanism (or at least one of them) for corruption from
Craig Ringer's writing[1] about fsync() on Linux.

The problem is that close() and fsync() could report ENOSPC,
indicating that your dirty data has been dropped from the Linux page
cache, and then future fsync() operations could succeed as if nothing
happened.  It's easy to see that happening[2].

Since Craig's report, we committed a patch based on his PANIC
proposal: we now panic on fsync() and close() failure.  Recovering
from the WAL may or may not be possible, but at no point will we allow
a checkpoint to retry and bogusly succeed.

So, does this mean we fixed the problems with NFS?  Not sure, but I do
see a couple of problems (and they're problems Craig raised in his
thread):

The first is practical.  Running out of diskspace (or quota) is not
all that rare (much more common that EIO from a dying disk, I'd
guess), and definitely recoverable by an administrator: just create
more space.  It would be really nice to avoid panicking for an
*expected* condition.

To do that, I think we'd need to move the ENOSPC error back relation
extension time (when we call pwrite()), as happens for local
filesystems.  Luckily NFS 4.2 provides a mechanism to do that: the NFS
4.2 ALLOCATE[3] command.  To make this work, I think there are two
subproblems to solve:

1.  Figure out how to get the ALLOCATE command all the way through the
stack from PostgreSQL to the remote NFS server, and know for sure that
it really happened.  On the Debian buster Linux 4.18 system I checked,
fallocate() reports EOPNOTSUPP for fallocate(), and posix_fallocate()
appears to succeed but it doesn't really do anything at all (though I
understand that some versions sometimes write zeros to simulate
allocation, which in this case would be equally useless as it doesn't
reserve anything on an NFS server).  We need the server and NFS client
and libc to be of the right version and cooperate and tell us that
they have really truly reserved space, but there isn't currently a way
as far as I can tell.  How can we achieve that, without writing our
own NFS client?

2.  Deal with the resulting performance suckage.  Extending 8kb at a
time with synchronous network round trips won't fly.

A theoretical question I thought of is whether there are any
interleavings of operations that allow a checkpoint to complete
bogusly, while a concurrent close() in a regular backend fails with
EIO for data that was included in the checkpoint, and panics.  I
*suspect* the answer is that every interleaving is safe for 4.16+
kernels that report IO errors to every descriptor.  In older kernels I
wonder if there could be a schedule where an arbitrary backend eats
the error while closing, then the checkpointer calls fsync()
successfully and then logs a checkpoint, and then then the arbitrary
backend panics (too late).  I suspect EIO on close() doesn't happen in
practice on regular local filesystems, which is why I mention it in
the context of NFS, but I could be wrong about that.

Everything I said above about NFS may also apply to CIFS, I dunno.

[1]
https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAEepm%3D1FGo%3DACPKRmAxvb53mBwyVC%3DTDwTE0DMzkWjdbAYw7sw%40mail.gmail.com
[3] https://tools.ietf.org/html/rfc7862#page-64

-- 
Thomas Munro
https://enterprisedb.com


pgsql-hackers by date:

Previous
From: "Matsumura, Ryo"
Date:
Subject: RE: SQL statement PREPARE does not work in ECPG
Next
From: "Tsunakawa, Takayuki"
Date:
Subject: RE: Protect syscache from bloating with negative cache entries