Some thoughts on NFS - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Some thoughts on NFS |
Date | |
Msg-id | CA+hUKGKa-HtBHJaBUJuZHsKwvVkxW2nE0W8BqRVOhNr5yNgiDA@mail.gmail.com Whole thread Raw |
Responses |
Re: Some thoughts on NFS
Re: Some thoughts on NFS Re: Some thoughts on NFS |
List | pgsql-hackers |
Hello hackers, As discussed in various threads, PostgreSQL-on-NFS is viewed with suspicion. Perhaps others knew this already, but I first learned of the specific mechanism (or at least one of them) for corruption from Craig Ringer's writing[1] about fsync() on Linux. The problem is that close() and fsync() could report ENOSPC, indicating that your dirty data has been dropped from the Linux page cache, and then future fsync() operations could succeed as if nothing happened. It's easy to see that happening[2]. Since Craig's report, we committed a patch based on his PANIC proposal: we now panic on fsync() and close() failure. Recovering from the WAL may or may not be possible, but at no point will we allow a checkpoint to retry and bogusly succeed. So, does this mean we fixed the problems with NFS? Not sure, but I do see a couple of problems (and they're problems Craig raised in his thread): The first is practical. Running out of diskspace (or quota) is not all that rare (much more common that EIO from a dying disk, I'd guess), and definitely recoverable by an administrator: just create more space. It would be really nice to avoid panicking for an *expected* condition. To do that, I think we'd need to move the ENOSPC error back relation extension time (when we call pwrite()), as happens for local filesystems. Luckily NFS 4.2 provides a mechanism to do that: the NFS 4.2 ALLOCATE[3] command. To make this work, I think there are two subproblems to solve: 1. Figure out how to get the ALLOCATE command all the way through the stack from PostgreSQL to the remote NFS server, and know for sure that it really happened. On the Debian buster Linux 4.18 system I checked, fallocate() reports EOPNOTSUPP for fallocate(), and posix_fallocate() appears to succeed but it doesn't really do anything at all (though I understand that some versions sometimes write zeros to simulate allocation, which in this case would be equally useless as it doesn't reserve anything on an NFS server). We need the server and NFS client and libc to be of the right version and cooperate and tell us that they have really truly reserved space, but there isn't currently a way as far as I can tell. How can we achieve that, without writing our own NFS client? 2. Deal with the resulting performance suckage. Extending 8kb at a time with synchronous network round trips won't fly. A theoretical question I thought of is whether there are any interleavings of operations that allow a checkpoint to complete bogusly, while a concurrent close() in a regular backend fails with EIO for data that was included in the checkpoint, and panics. I *suspect* the answer is that every interleaving is safe for 4.16+ kernels that report IO errors to every descriptor. In older kernels I wonder if there could be a schedule where an arbitrary backend eats the error while closing, then the checkpointer calls fsync() successfully and then logs a checkpoint, and then then the arbitrary backend panics (too late). I suspect EIO on close() doesn't happen in practice on regular local filesystems, which is why I mention it in the context of NFS, but I could be wrong about that. Everything I said above about NFS may also apply to CIFS, I dunno. [1] https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAEepm%3D1FGo%3DACPKRmAxvb53mBwyVC%3DTDwTE0DMzkWjdbAYw7sw%40mail.gmail.com [3] https://tools.ietf.org/html/rfc7862#page-64 -- Thomas Munro https://enterprisedb.com
pgsql-hackers by date: