Thread: Re: Strange issue with NFS mounted PGDATA on ugreen NAS
On 12/31/2024 12:22 pm, Larry Rosenman wrote: > When I try to drop a database, PostgreSQL leaves files in the directory > and does not even try to delete them. > > PostgreSQL 16.6, FreeBSD 14.2, PGDATA mounted NFS from UGreen NAS. > > Truss of the create/delete attached. > > It does NOT seem to happen PG < 16. > > Ideas? as a followup, 17.2 exhibits the same behavior, but 15.10 and below work just fine on the same VM and share. -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: ler@lerctr.org US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010
Larry Rosenman <ler@lerctr.org> writes: > On 12/31/2024 12:22 pm, Larry Rosenman wrote: >> When I try to drop a database, PostgreSQL leaves files in the directory >> and does not even try to delete them. >> PostgreSQL 16.6, FreeBSD 14.2, PGDATA mounted NFS from UGreen NAS. FWIW, I couldn't reproduce such a problem with PG running on macOS Sequoia and using an NFS mount from a RHEL8 machine. (I tried with current master code, but I don't believe that we fixed any related bugs recently.) Can you try some other combinations of host OS and NFS source, and see if you can identify what's needed to cause the failure? What it looks like is that readdir() is skipping over some files, which would almost certainly be an NFS server bug. But that theory could be wrong. regards, tom lane
On Wed, Jan 1, 2025 at 11:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Larry Rosenman <ler@lerctr.org> writes: > > On 12/31/2024 12:22 pm, Larry Rosenman wrote: > >> When I try to drop a database, PostgreSQL leaves files in the directory > >> and does not even try to delete them. > >> PostgreSQL 16.6, FreeBSD 14.2, PGDATA mounted NFS from UGreen NAS. > What it looks like is that readdir() is skipping over some files, > which would almost certainly be an NFS server bug. But that > theory could be wrong. Hmm. So in 15 and earlier, rmtree() would read all the file names into memory and then unlink them, but 54e72b66 changed that and started unlinking in the loop. Perhaps that wasn't a great idea, but it's hardly the only place that directory contents might change across readdir() calls (it's just a place that does that itself). As has been analysed by the pgbackrest guys, readdir() can be flaky over changing directories on some NFS (and maybe other network FSs) implementations (unresolved AFAIK and different from this problem, they were missing files during backups due to concurrent changes). The implementation-specific cookie scheme for encoding a sort of cursor position across readdir() calls has various different problems on various different OSes, NFS implementations and underlying local file systems (I looked into this quite a bit when that last discussion happened, it's a mess). I wonder what a UGreen NAS is running.
On 12/31/2024 5:24 pm, Thomas Munro wrote: > On Wed, Jan 1, 2025 at 11:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Larry Rosenman <ler@lerctr.org> writes: >> > On 12/31/2024 12:22 pm, Larry Rosenman wrote: >> >> When I try to drop a database, PostgreSQL leaves files in the directory >> >> and does not even try to delete them. >> >> PostgreSQL 16.6, FreeBSD 14.2, PGDATA mounted NFS from UGreen NAS. > >> What it looks like is that readdir() is skipping over some files, >> which would almost certainly be an NFS server bug. But that >> theory could be wrong. > > Hmm. So in 15 and earlier, rmtree() would read all the file names > into memory and then unlink them, but 54e72b66 changed that and > started unlinking in the loop. Perhaps that wasn't a great idea, but > it's hardly the only place that directory contents might change across > readdir() calls (it's just a place that does that itself). > > As has been analysed by the pgbackrest guys, readdir() can be flaky > over changing directories on some NFS (and maybe other network FSs) > implementations (unresolved AFAIK and different from this problem, > they were missing files during backups due to concurrent changes). > The implementation-specific cookie scheme for encoding a sort of > cursor position across readdir() calls has various different problems > on various different OSes, NFS implementations and underlying local > file systems (I looked into this quite a bit when that last discussion > happened, it's a mess). I wonder what a UGreen NAS is running. UGreen is running a (modified) Debian Bookworm Linux. -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: ler@lerctr.org US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010
Larry Rosenman <ler@lerctr.org> writes: > On 12/31/2024 5:24 pm, Thomas Munro wrote: >> The implementation-specific cookie scheme for encoding a sort of >> cursor position across readdir() calls has various different problems >> on various different OSes, NFS implementations and underlying local >> file systems (I looked into this quite a bit when that last discussion >> happened, it's a mess). I wonder what a UGreen NAS is running. > UGreen is running a (modified) Debian Bookworm Linux. Do you know what its underlying file system is? regards, tom lane
On 12/31/2024 5:37 pm, Tom Lane wrote: > Larry Rosenman <ler@lerctr.org> writes: >> On 12/31/2024 5:24 pm, Thomas Munro wrote: >>> The implementation-specific cookie scheme for encoding a sort of >>> cursor position across readdir() calls has various different problems >>> on various different OSes, NFS implementations and underlying local >>> file systems (I looked into this quite a bit when that last >>> discussion >>> happened, it's a mess). I wonder what a UGreen NAS is running. > >> UGreen is running a (modified) Debian Bookworm Linux. > > Do you know what its underlying file system is? > > regards, tom lane btrfs -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: ler@lerctr.org US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010
Larry Rosenman <ler@lerctr.org> writes: > On 12/31/2024 5:37 pm, Tom Lane wrote: >> Do you know what its underlying file system is? > btrfs OK. My test was with XFS underneath the NFS service. regards, tom lane
On Wed, Jan 1, 2025 at 1:20 PM Kenneth Marshall <ktm@rice.edu> wrote: > On Tue, Dec 31, 2024 at 06:58:14PM -0500, Tom Lane wrote: > > Larry Rosenman <ler@lerctr.org> writes: > > > On 12/31/2024 5:37 pm, Tom Lane wrote: > > >> Do you know what its underlying file system is? > > > > > btrfs > Maybe there are some btrfs or nfs options that can be used to mitigate > this effect. Otherwise, a bug report to Debian would be in order, I guess. Mount option readdirsize on the client side should hide the problem up to some size you choose, but you can't set it large enough for high numbers of relations/forks/segments. Guessing what is happening here: I suspect BTRFS might have positional offsets 1, 2, 3, ... for directory entries' d_off (the value visible in struct direct, used for telldir(), seekdir(), and NFS's behind-the-curtain paging scheme), and they might slide when you unlink stuff. Perhaps not immediately, but when the directory fd is closed on the NFS server (nearly immediately I guess given the stateless nature of NFS, it doesn't matter that the client has its directory fd open). That would explain how you finished up with so many missed files. I think XFS's d_off points to the next entry in a btree leaf page scan, which sounds a lot more stable... until someone else unlinks the next item underneath you and/or the system decides to compact stuff, who knows... And other systems have other schemes based on hashes or raw offsets, with different degrees of stability and anomalies (cf ELOOP for hash collisions). NFS is at least supposed to tell the client that its cookie has been invalidated with a cookie-invalidation-cookie called cookieverf. But there isn't any specified way to recover. FreeBSD's client looks like it might try to, but I'm not sure if that Linux's server even implements it. Anyway, I'll write a patch to change rmtree() to buffer the names in memory. In theory there could be hundreds of gigabytes of pathnames, so perhaps I should do it in batches; I'll look into that.
Thomas Munro <thomas.munro@gmail.com> writes: > NFS is at least supposed to tell the client that its cookie has been > invalidated with a cookie-invalidation-cookie called cookieverf. But > there isn't any specified way to recover. FreeBSD's client looks like > it might try to, but I'm not sure if that Linux's server even > implements it. ISTM we used to disclaim responsibility for data integrity if you try to put PGDATA on NFS. I looked at the current wording about NFS in runtime.sgml and was frankly shocked at how optimistic it is. Shouldn't we be saying something closer to "if it breaks you get to keep both pieces"? > Anyway, I'll write a patch to change rmtree() to buffer the names in > memory. In theory there could be hundreds of gigabytes of pathnames, > so perhaps I should do it in batches; I'll look into that. This feels a lot like putting lipstick on a pig. regards, tom lane
On Wed, Jan 1, 2025 at 6:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > ISTM we used to disclaim responsibility for data integrity if you > try to put PGDATA on NFS. I looked at the current wording about > NFS in runtime.sgml and was frankly shocked at how optimistic it is. > Shouldn't we be saying something closer to "if it breaks you > get to keep both pieces"? I now suspect this specific readdir() problem is in FreeBSD's NFS client. See below. There have also been reports of missed files from (IIRC) Linux clients without much analysis, but that doesn't seem too actionable from here unless someone can come up with a repro or at least some solid details to investigate; those involved unspecified (possibly appliance/cloud) NFS and CIFS file servers. The other issue I know of personally is NFS ENOSPC, which has some exciting disappearing-committed-data failure modes caused by lazy allocation on Linux's implementation (and possibly others), that I've written about before. But actually that one is not strictly an NFS-only issue, it's just really easy to hit that way, and I have a patch to fix it on our side, which I hope to re-post soon. Independently of this, really as it's tangled up with quite a few other things... > > Anyway, I'll write a patch to change rmtree() to buffer the names in > > memory. In theory there could be hundreds of gigabytes of pathnames, > > so perhaps I should do it in batches; I'll look into that. > > This feels a lot like putting lipstick on a pig. Hehe. Yeah. Abandoned. I see this issue here with a FreeBSD client talking to a Debian server exporting BTRFS or XFS, even with dirreadsize set high so that multi-request paging is not expected. Looking at Wireshark and the NFS spec (disclaimer: I have never studied NFS at this level before, addito salis grano), what I see is a READDIR request with cookie=0 (good), and which receives a response containing the whole directory listing and a final entry marker eof=1 (good), but then FreeBSD unexpectedly (to me) sends *another* READDIR request with cookie=662, which is a real cookie that was received somewhere in the middle of the first response on the entry for "13816_fsm", and that entry was followed by an entry for "13816_vm". The second request gets a response that begins at "13816_vm" (correct on the server's part). Then the client sends REMOVE (unlink) requests for some but not all of the files, including "13816_fsm" but not "13816_vm". Then it sends yet another READDIR request with cookie=0 (meaning go from the top), and gets a non-empty directory listing, but immediately sends RMDIR, which unsurprisingly fails NFS3ERR_NOTEMPTY. So my best guess so far is that FreeBSD's NFS client must be corrupting its directory cache when files are unlinked, and it's not the server's fault. I don't see any obvious problem with the way the cookies work. Seems like material for a minimised bug report elsewhere, and not our issue.
Thomas Munro <thomas.munro@gmail.com> writes: > I now suspect this specific readdir() problem is in FreeBSD's NFS > client. See below. There have also been reports of missed files from > (IIRC) Linux clients without much analysis, but that doesn't seem too > actionable from here unless someone can come up with a repro or at > least some solid details to investigate; those involved unspecified > (possibly appliance/cloud) NFS and CIFS file servers. I forgot to report back, but yesterday I spent time unsuccessfully trying to reproduce the problem with macOS client and NFS server using btrfs (a Synology NAS running some no-name version of Linux). So that lends additional weight to your conclusion that it isn't specifically a btrfs bug. > I see this issue here with a FreeBSD client talking to a Debian server > exporting BTRFS or XFS, even with dirreadsize set high so that > multi-request paging is not expected. Looking at Wireshark and the > NFS spec (disclaimer: I have never studied NFS at this level before, > addito salis grano), what I see is a READDIR request with cookie=0 > (good), and which receives a response containing the whole directory > listing and a final entry marker eof=1 (good), but then FreeBSD > unexpectedly (to me) sends *another* READDIR request with cookie=662, > which is a real cookie that was received somewhere in the middle of > the first response on the entry for "13816_fsm", and that entry was > followed by an entry for "13816_vm". The second request gets a > response that begins at "13816_vm" (correct on the server's part). > Then the client sends REMOVE (unlink) requests for some but not all of > the files, including "13816_fsm" but not "13816_vm". Then it sends > yet another READDIR request with cookie=0 (meaning go from the top), > and gets a non-empty directory listing, but immediately sends RMDIR, > which unsurprisingly fails NFS3ERR_NOTEMPTY. So my best guess so far > is that FreeBSD's NFS client must be corrupting its directory cache > when files are unlinked, and it's not the server's fault. I don't see > any obvious problem with the way the cookies work. Seems like > material for a minimised bug report elsewhere, and not our issue. Yeah, that seems pretty damning. Thanks for looking into it. regards, tom lane
I wrote: > I forgot to report back, but yesterday I spent time unsuccessfully > trying to reproduce the problem with macOS client and NFS server > using btrfs (a Synology NAS running some no-name version of Linux). Also, I *can* reproduce it using the same NFS server and a FreeBSD 14.2 client. At least with this pair of machines, the behavior seems deterministic: "createdb foo" followed by "dropdb foo" leaves the same set of not-dropped files behind each time. I added a bit of debug logging to rmtree.c and verified that it's not seeing anything odd happening, except that readdir never returns anything about the missed files. I will file a bug report, unless you already did? regards, tom lane