Thread: Fwd: Re: A new look at old NFS readdir() problems?
@Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on my issue. -------- Original Message -------- Subject: Re: A new look at old NFS readdir() problems? Date: 01/02/2025 10:08 am From: Rick Macklem <rick.macklem@gmail.com> To: Thomas Munro <tmunro@freebsd.org> Cc: Rick Macklem <rmacklem@freebsd.org>, Larry Rosenman <ler@lerctr.org> On Thu, Jan 2, 2025 at 2:50 AM Thomas Munro <tmunro@freebsd.org> wrote: > > CAUTION: This email originated from outside of the University of > Guelph. Do not click links or open attachments unless you recognize the > sender and know the content is safe. If in doubt, forward suspicious > emails to IThelp@uoguelph.ca. > > > Hi Rick > CC: ler > > I hope you don't mind me reaching out directly, I just didn't really > want to spam existing bug reports without sufficient understanding to > actually help yet... but I figured I should get in touch and see if > you have any clues or words of warning, since you've worked on so much > of the NFS code. I'm a minor FBSD contributor and interested in file > systems, but not knowledgeable about NFS; I run into/debug/report a > lot of file system bugs on a lot of systems in my day job on > databases. I'm interested to see if I can help with this problem. > Existing ancient report and interesting email: > > https://lists.freebsd.org/pipermail/freebsd-fs/2014-October/020155.html > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=57696 > > What we ran into is not the "bad cookie" state, which doesn't really > seem to be recoverable in general, from what I understand (though the > FreeBSD code apparently would try, huh). It's a simple case where the > NFS client requests a whole directory with a large READDIR request, > and then tries to unlink all the files in a traditional > while-readdir()-unlink() loop that works on other systems. In general, NFS is not a POSIX compliant file system, due to its protocol design. The above is one example. The only "safe" way is to opendir() or rewinddir() after every removal. The above usually works (and always worked for UFS long ago) because the directory offset cookies for subsequent entries in the directory after the entry unlinked happened to "still be valid". That is no longer true for FreeBSD's UFS nor for many other file systems that can be exported. If the client reads the entire directory in one READDIR, then it is fine, since it has no need to the directory offset cookies. However, there is a limit to how much a single READDIR can do (these days for NFSv4.1/4.2, it could be raised to just over 1Mbyte, however FreeBSD limits it to 8K at the moment). Another way to work around the problem is to read the entire directory into the client via READDIRs before starting to do the unlinks. The opendir()/readdir() code in libc could be hacked to do that, but I have never tried to push such a patch into FreeBSD. (It would be limited by how much memory can be malloc()'d, that is pretty generous compared to even large directorys with 10s of thousand entries.) The above is true for all versions of NFS up to NFSv4.2, which is the current one and unless some future version of NFS does READDIR differently (I won't live long enough to see this;-), it will always be the case. If my comment above was not clear, the following encoding is the "safe" way to remove all entries in a directory. do { dir = opendir("X"); dp = readdir(dir); if (dp != NULL) unlink(dp->d_name); close(dir); } while (dp != NULL); In theory, the directory_offset_cookie was supposed to handle this, but it has never worked correctly, for a couple of reasons. 1 - RFC1813 (the NFSv3 one) did not describe the cookie verifier correctly. It should only change when cookies for extant entries change. The description suggested it should change whenever an entry is deleted, since that cookie is no longer valid. 2 - #1 only works if directory offset cookies for other entries in the directory do not change when an entry is deleted. This used to be the case for UFS, but was broken in FreeBSD when a commit many years ago optimized ufs_readdir() to compress out invalid entries. Doing this changes the directory offset cookies every time an entry is deleted at the beginning of a directory block. rick > On FreeBSD > it seems to clobber its own directory cache, make extra unnecessary > READDIR requests, and skip some of the files. Or maybe I have no idea > what's going on and this is a hopelessly naive question and mission > :-) > > Here's what we learned so far starting from Larry's report: > > https://www.postgresql.org/message-id/flat/04f95c3c13d4a9db87b3ac082a9f4877%40lerctr.org > > Note that this issue has nothing to do with "bad cookie" errors (I > doubt the server I'm talking to even implements that -- instead it > tries to have cookies that are persistent/stable). > > Also, while looking into this and initially suspecting cookie > stability bugs (incorrectly), I checked a bunch of local file systems > to understand how their cookies work, and I think I found a related > problem when FreeBSD exports UFS, too. I didn't repro this with NFS > but it's clearly visible from d_off locally with certain touch, rm > sequences. First, let me state what I think the cookie should be > trying to achieve, on a system that doesn't implement "bad cookie" but > instead wants cookies that are persistent/always valid: if you make a > series of READDIR requests using the cookie from the final entry of > the previous response, it should be impossible to miss any entry that > existed before your first call to readdir(), and impossible to see any > entry twice. It is left undefined whether entries created after that > time are visible, since anything else would require unbounded time or > space via locks or multi-version magic (= isolation problems from > database-land). > > Going back to the early 80s, Sun UFS looks good (based on illumos > source code) because it doesn't seem to move entries after they are > created. That must have been the only file system when they invented > VFS and NFS. Various other systems since have been either complex but > apparently good (ZFS/ZAP cursors can tolerate up to 2^16 hash > collisions which I think we can call statistically impossible, XFS > claims to be completely stable though I didn't understand fully why, > BTRFS assigns incrementing numbers that will hopefully not wrap, ...), > or nearly-good-enough-but-ugh (ext4 uses hashes like ZFS but > apparently fails with ELOOP on hash collisions?). I was expecting > FreeBSD UFS to be like Sun UFS but I don't think it is! In the UFS > code since at least 4.3BSD (but apparently not in the Sun version, > forked before or removed later?), inserting a new entry can compact a > directory page, which moves the offset of a directory entry lower. > AFAICS we can't move an entry lower, or we risk skipping it in NFS > readdir(), and we can't move it higher, or we risk double-reporting it > in readdir(). Or am I missing something? > > Thanks for reading and happy new year, > > Thomas Munro -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: ler@lerctr.org US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010
Larry Rosenman <ler@lerctr.org> writes: > @Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on > my issue. Thanks for reaching out to him. So if I'm reading this correctly, there's little point in filing a FreeBSD bug because it'll be dismissed as unfixable. This leaves us in rather a nasty position. Sure, we could rewrite rmtree() as Thomas suggested upthread, but I'm still of the opinion that that's smearing lipstick on a pig. rmtree() is the least of our worries: it doesn't need to expect that anybody else will be modifying the target directory, plus it can easily restart its scan without complicated bookkeeping. I doubt we can make such an assumption for all our uses of readdir(), or that it's okay to miss or double-process files in every one of them. I'm still of the opinion that the best thing to do is disclaim safety of storing a database on NFS. regards, tom lane
On Thu, Jan 2, 2025 at 03:48:53PM -0500, Tom Lane wrote: > Larry Rosenman <ler@lerctr.org> writes: > > @Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on > > my issue. > > Thanks for reaching out to him. So if I'm reading this correctly, > there's little point in filing a FreeBSD bug because it'll be > dismissed as unfixable. > > This leaves us in rather a nasty position. Sure, we could rewrite > rmtree() as Thomas suggested upthread, but I'm still of the opinion > that that's smearing lipstick on a pig. rmtree() is the least of > our worries: it doesn't need to expect that anybody else will be > modifying the target directory, plus it can easily restart its scan > without complicated bookkeeping. I doubt we can make such an > assumption for all our uses of readdir(), or that it's okay to > miss or double-process files in every one of them. > > I'm still of the opinion that the best thing to do is disclaim > safety of storing a database on NFS. It would be nice to have some details on the docs about why NFS can cause problems so we have something to point to when people ask. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.
On 01/02/2025 2:50 pm, Bruce Momjian wrote: > On Thu, Jan 2, 2025 at 03:48:53PM -0500, Tom Lane wrote: >> Larry Rosenman <ler@lerctr.org> writes: >> > @Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on >> > my issue. >> >> Thanks for reaching out to him. So if I'm reading this correctly, >> there's little point in filing a FreeBSD bug because it'll be >> dismissed as unfixable. >> >> This leaves us in rather a nasty position. Sure, we could rewrite >> rmtree() as Thomas suggested upthread, but I'm still of the opinion >> that that's smearing lipstick on a pig. rmtree() is the least of >> our worries: it doesn't need to expect that anybody else will be >> modifying the target directory, plus it can easily restart its scan >> without complicated bookkeeping. I doubt we can make such an >> assumption for all our uses of readdir(), or that it's okay to >> miss or double-process files in every one of them. >> >> I'm still of the opinion that the best thing to do is disclaim >> safety of storing a database on NFS. > > It would be nice to have some details on the docs about why NFS can > cause problems so we have something to point to when people ask. What about doing what Rick suggests? do { dir = opendir("X"); dp = readdir(dir); if (dp != NULL) unlink(dp->d_name); close(dir); } while (dp != NULL); ? -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: ler@lerctr.org US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010
On Thu, Jan 2, 2025 at 3:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm still of the opinion that the best thing to do is disclaim > safety of storing a database on NFS. If we're going to disclaim support for NFS, it would certainly be better to do that clearly and with reasons than otherwise. However, I suspect a substantial number of users will still use NFS and then blame us when it doesn't work; or alternatively say that our software sucks because it doesn't work on NFS. Neither of which are very nice outcomes. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jan 2, 2025 at 3:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm still of the opinion that the best thing to do is disclaim >> safety of storing a database on NFS. > If we're going to disclaim support for NFS, it would certainly be > better to do that clearly and with reasons than otherwise. However, I > suspect a substantial number of users will still use NFS and then > blame us when it doesn't work; or alternatively say that our software > sucks because it doesn't work on NFS. Neither of which are very nice > outcomes. Are you prepared to buy into "we will make every bit of code that uses readdir() proof against arbitrary lies from readdir()"? I'm not: I cannot see how to do that in any but the simplest cases -- rmtree() being about the simplest. Even if we had a bulletproof design in mind, it's pretty hard to believe that future patches would apply it every time. I think it's inevitable that we'd have bugs like "sometimes not every file gets fsync'd", which would be impossible to find until someone's data gets eaten, and maybe still impossible to find afterwards. (To be clear: if this is how FreeBSD acts, then I'm afraid we already do have such bugs. The rmtree case is just easier to observe than a missed fsync.) regards, tom lane
On 01/02/2025 3:42 pm, Thomas Munro wrote: > On Fri, Jan 3, 2025 at 10:16 AM Larry Rosenman <ler@lerctr.org> wrote: >> What about doing what Rick suggests? >> do { >> dir = opendir("X"); >> dp = readdir(dir); >> if (dp != NULL) >> unlink(dp->d_name); >> close(dir); >> } while (dp != NULL); >> ? > > That only works for unlink loops where we expect no concurrent > modifications. DROP DATABASE qualifies, but we use readdir() on > directories that might be changing in various other places, including > backups. They'd silently fail to do their job, and can't use that > technique as they aren't unlinking as they go and so they would never > make progress. Which leads to the lipstick/pig conclusion: we'd have > other, worse, buggy behaviour still. > > It might be possible to make our own readdir snapshot thing that > checks a shmem counter of all links/unlinks before and after and > retries. But ugh... Would it make sense for ONLY drop database to have the above loop? -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: ler@lerctr.org US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010
Larry Rosenman <ler@lerctr.org> writes: > Would it make sense for ONLY drop database to have the above loop? Not really. We'd just be papering over the most-easily-observable consequence of readdir's malfeasance. There'd still be problems like basebackups omitting files, missed fsyncs potentially leading to data loss, etc. I am sort of wondering though why we've not heard reports of this many times before. Did FreeBSD recently change something in this area? Also, if as they argue it's a fundamental problem in the NFS protocol, why are we failing to reproduce it with other clients? regards, tom lane
On 01/02/2025 4:56 pm, Tom Lane wrote: > Larry Rosenman <ler@lerctr.org> writes: >> Would it make sense for ONLY drop database to have the above loop? > > Not really. We'd just be papering over the most-easily-observable > consequence of readdir's malfeasance. There'd still be problems > like basebackups omitting files, missed fsyncs potentially leading > to data loss, etc. > > I am sort of wondering though why we've not heard reports of this > many times before. Did FreeBSD recently change something in this > area? Also, if as they argue it's a fundamental problem in the NFS > protocol, why are we failing to reproduce it with other clients? > > regards, tom lane Pre version 16, we worked just fine. There was a change in 16. -- Larry Rosenman http://www.lerctr.org/~ler Phone: +1 214-642-9640 E-Mail: ler@lerctr.org US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010
Larry Rosenman <ler@lerctr.org> writes: > On 01/02/2025 4:56 pm, Tom Lane wrote: >> I am sort of wondering though why we've not heard reports of this >> many times before. Did FreeBSD recently change something in this >> area? Also, if as they argue it's a fundamental problem in the NFS >> protocol, why are we failing to reproduce it with other clients? > Pre version 16, we worked just fine. There was a change in 16. Well, as Thomas already mentioned, rmtree() used to work by reading the directory and making an in-memory copy of the whole file list before it started to delete anything. But reverting that wouldn't fix all of the other problems that we now know exist. (In other words, "we worked just fine" only for small values of "fine".) regards, tom lane
On Fri, Jan 3, 2025 at 10:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > (To be clear: if this is how FreeBSD acts, then I'm afraid we already > do have such bugs. The rmtree case is just easier to observe than a > missed fsync.) For what little it's worth, I'm not quite convinced yet that FreeBSD's client isn't more broken than it needs to be. Lots of systems I looked at have stable cookies in practice (as NFS 4 recommends), including the one used in this report, so it seems like a more basic problem. At the risk of being wrong on the internet, I don't see any fundamental reason why a readdir() scan can't have no-skip, no-duplicate, no-fail semantics for stable-cookie, no-verification servers. And this case works perfectly with a couple of other NFS clients implementations that you and I tried. As for systems that don't have stable cookies, well then they should implement the cookie verification scheme and AFAICS the readdir() scan should then fail if it can't recover, or it would expose isolation anomalies offensive to database hacker sensibilities. I think it should be theoretically possible to recover in some happy cases (maybe: when enough state is still around in cache to deduplicate). But that shouldn't be necessary on filers using eg ZFS or BTRFS whose cookies are intended to be stable. A server could also do MVCC magic, and from a quick google, I guess NetApp WAFL might do that as it has the concept of "READDIR expired", which smells a bit like ORA-01555: snapshot too old.
Thomas Munro <thomas.munro@gmail.com> writes: > For what little it's worth, I'm not quite convinced yet that FreeBSD's > client isn't more broken than it needs to be. I'm suspicious of that too. The wireshark trace you described is hard to read any other way than that FreeBSD went out of its way to deliver incorrect information. I'm prepared to believe that we can't work correctly on NFS servers that don't do the stable-cookie thing, but why isn't it succeeding on ones that do? regards, tom lane
I wrote: > Thomas Munro <thomas.munro@gmail.com> writes: >> For what little it's worth, I'm not quite convinced yet that FreeBSD's >> client isn't more broken than it needs to be. > I'm suspicious of that too. I poked at this a little further. I made the attached stand-alone test case (you don't need any more than "cc -o rmtree rmtree.c" to build it, then point the script at some NFS-mounted directory). This fails with my NAS at least as far back as FreeBSD 11.0. I also tried it on NetBSD 9.2 which seems fine. regards, tom lane #! /bin/sh set -e TESTDIR="$1" mkdir "$TESTDIR" i=0 while [ $i -lt 1000 ] do touch "$TESTDIR/$i" i=`expr $i + 1` done ./rmtree "$TESTDIR" /*------------------------------------------------------------------------- * * rmtree.c * * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * * IDENTIFICATION * src/common/rmtree.c * *------------------------------------------------------------------------- */ #include <stdio.h> #include <stdlib.h> #include <stdbool.h> #include <string.h> #include <unistd.h> #include <errno.h> #include <sys/stat.h> #include <sys/types.h> #include <dirent.h> typedef enum PGFileType { PGFILETYPE_ERROR, PGFILETYPE_UNKNOWN, PGFILETYPE_REG, PGFILETYPE_DIR, PGFILETYPE_LNK, } PGFileType; static void * palloc(size_t size) { void *tmp; /* Avoid unportable behavior of malloc(0) */ if (size == 0) size = 1; tmp = malloc(size); if (tmp == NULL) { fprintf(stderr, "out of memory\n"); exit(1); } return tmp; } static void * repalloc(void *ptr, size_t size) { void *tmp; /* Avoid unportable behavior of realloc(NULL, 0) */ if (ptr == NULL && size == 0) size = 1; tmp = realloc(ptr, size); if (!tmp) { fprintf(stderr, "out of memory\n"); exit(1); } return tmp; } static char * pstrdup(const char *in) { char *tmp; if (!in) { fprintf(stderr, "cannot duplicate null pointer (internal error)\n"); exit(1); } tmp = strdup(in); if (!tmp) { fprintf(stderr, "out of memory\n"); exit(1); } return tmp; } /* * Return the type of a directory entry. */ static PGFileType get_dirent_type(const char *path, const struct dirent *de, bool look_through_symlinks) { PGFileType result; /* * Some systems tell us the type directly in the dirent struct, but that's * a BSD and Linux extension not required by POSIX. Even when the * interface is present, sometimes the type is unknown, depending on the * filesystem. */ #if defined(DT_REG) && defined(DT_DIR) && defined(DT_LNK) if (de->d_type == DT_REG) result = PGFILETYPE_REG; else if (de->d_type == DT_DIR) result = PGFILETYPE_DIR; else if (de->d_type == DT_LNK && !look_through_symlinks) result = PGFILETYPE_LNK; else result = PGFILETYPE_UNKNOWN; #else result = PGFILETYPE_UNKNOWN; #endif if (result == PGFILETYPE_UNKNOWN) { struct stat fst; int sret; if (look_through_symlinks) sret = stat(path, &fst); else sret = lstat(path, &fst); if (sret < 0) { result = PGFILETYPE_ERROR; fprintf(stderr, "could not stat file \"%s\": %m\n", path); } else if (S_ISREG(fst.st_mode)) result = PGFILETYPE_REG; else if (S_ISDIR(fst.st_mode)) result = PGFILETYPE_DIR; else if (S_ISLNK(fst.st_mode)) result = PGFILETYPE_LNK; } return result; } /* * rmtree * * Delete a directory tree recursively. * Assumes path points to a valid directory. * Deletes everything under path. * If rmtopdir is true deletes the directory too. * Returns true if successful, false if there was any problem. * (The details of the problem are reported already, so caller * doesn't really have to say anything more, but most do.) */ static bool rmtree(const char *path, bool rmtopdir) { char pathbuf[8192]; DIR *dir; struct dirent *de; bool result = true; size_t dirnames_size = 0; size_t dirnames_capacity = 8; char **dirnames; dir = opendir(path); if (dir == NULL) { fprintf(stderr, "could not open directory \"%s\": %m\n", path); return false; } dirnames = (char **) palloc(sizeof(char *) * dirnames_capacity); while (errno = 0, (de = readdir(dir))) { if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0) continue; snprintf(pathbuf, sizeof(pathbuf), "%s/%s", path, de->d_name); switch (get_dirent_type(pathbuf, de, false)) { case PGFILETYPE_ERROR: /* already logged, press on */ break; case PGFILETYPE_DIR: /* * Defer recursion until after we've closed this directory, to * avoid using more than one file descriptor at a time. */ if (dirnames_size == dirnames_capacity) { dirnames = repalloc(dirnames, sizeof(char *) * dirnames_capacity * 2); dirnames_capacity *= 2; } dirnames[dirnames_size++] = pstrdup(pathbuf); break; default: if (unlink(pathbuf) != 0 && errno != ENOENT) { fprintf(stderr, "could not remove file \"%s\": %m\n", pathbuf); result = false; } break; } } if (errno != 0) { fprintf(stderr, "could not read directory \"%s\": %m\n", path); result = false; } closedir(dir); /* Now recurse into the subdirectories we found. */ for (size_t i = 0; i < dirnames_size; ++i) { if (!rmtree(dirnames[i], true)) result = false; free(dirnames[i]); } if (rmtopdir) { if (rmdir(path) != 0) { fprintf(stderr, "could not remove directory \"%s\": %m\n", path); result = false; } } free(dirnames); return result; } int main(int argc, char **argv) { if (argc != 2) { fprintf(stderr, "usage: %s target-directory\n", argv[0]); exit(1); } if (!rmtree(argv[1], true)) { fprintf(stderr, "rmtree failed\n"); exit(1); } return 0; }
On Thu, Jan 2, 2025 at 4:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Are you prepared to buy into "we will make every bit of code that uses > readdir() proof against arbitrary lies from readdir()"? I'm not: > I cannot see how to do that in any but the simplest cases -- rmtree() > being about the simplest. Even if we had a bulletproof design in > mind, it's pretty hard to believe that future patches would apply > it every time. I think it's inevitable that we'd have bugs like > "sometimes not every file gets fsync'd", which would be impossible > to find until someone's data gets eaten, and maybe still impossible > to find afterwards. No. We obviously cannot guarantee correct output in the face of arbitrarily incorrect input. What I'm doing is being practical about how the real world works. People use NFS. This is not that different from the fsync debacle a few years ago -- the fsync semantics on Linux did (and AFAIK, still do) make it impossible to write correct programs, and then on top of that our code had bugs in it, too. We didn't just say "oh, well, I guess you shouldn't use Linux". We tried to make our code as robust as it could be in the face of kernel code that behaved in a manner that was fairly ridiculous relative to our needs. This case doesn't seem that different to me. I certainly think it's reasonable to debate whether any particular mitigations that someone might propose adding to our code are unreasonably difficult. But the complaint as presented is "the code you used to have worked better than the code you have now," which doesn't necessarily support the idea that the requested changes are unreasonable. Of course, if FreeBSD is buggier than it needs to be, then I hope that also gets fixed. But even if it doesn't, surely the right thing to do would be to clearly document that we don't work *with NFS on FreeBSD* rather than that we don't work with *any NFS*. Unless of course we really (and unavoidably) don't work with any NFS, and then we should document that, with reasons. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jan 3, 2025 at 8:33 AM Robert Haas <robertmhaas@gmail.com> wrote:
We tried to make our code as robust as it could be in the face of kernel code that behaved in a manner that was fairly ridiculous relative to our needs. This case doesn't seem that different to me.
+1. Seems a shame that freebsd chooses such "optimizations", but making our code do various workarounds and jump through hoops to support various OS quirks (hello Win32 fans!) seems a burden we agreed to take on a long time ago.
Cheers,
Greg
On 1/3/25 09:47, Greg Sabino Mullane wrote: > On Fri, Jan 3, 2025 at 8:33 AM Robert Haas <robertmhaas@gmail.com > <mailto:robertmhaas@gmail.com>> wrote: > > We tried to make our code as robust as it could be in the face of > kernel code that behaved in a manner that was fairly ridiculous > relative to our needs. This case doesn't seem that different to me. > > > +1. Seems a shame that freebsd chooses such "optimizations", but making > our code do various workarounds and jump through hoops to support > various OS quirks (hello Win32 fans!) seems a burden we agreed to take > on a long time ago. FWIW, we observed this issue in pgBackRest a few years ago -- as you can imagine we do a lot of scanning so readdir gets a real workout. We had one issue reported [1] involving Alpine Linux and CIFS and another [2] with SLES and NFS. We also had at least one internal report that involved RHEL and a proprietary storage appliance. I'm not certain if we received any reports for FreeBSD but it kind of rings a bell. Over some time and various reports it seemed that any storage was potentially a problem. I resisted the notion that we would have to work around something that seemed to be an obvious kernel bug but in the end I capitulated. We fixed this by making a snapshot of each directory before performing any operations on that directory (as has been suggested upthread). One advantage we have is that our storage is very centralized since we deal with a number of storage types so there are no readdirs in the general code base. It was still a pretty major patch [3] but a lot of it was removing the callbacks that we had used previously and adding optimizations to reduce memory consumption. One more thing to note -- we are still assuming that Postgres is running on storage that is not subject to this issue. Even with our new methodology if Postgres is deleting files while we are trying to build a backup manifest that could cause us (and base backup) problems. The only solution I came up with for that problem was to keep reading the directory until we get two snapshots that match -- not very attractive but probably workable for pgBackRest. I doubt the same could be said for Postgres. Regards, -David --- [1] https://github.com/pgbackrest/pgbackrest/issues/1754 [2] https://github.com/pgbackrest/pgbackrest/issues/1423 [3] https://github.com/pgbackrest/pgbackrest/commit/75623d45
On 03.01.25 02:58, Tom Lane wrote: > I wrote: >> Thomas Munro <thomas.munro@gmail.com> writes: >>> For what little it's worth, I'm not quite convinced yet that FreeBSD's >>> client isn't more broken than it needs to be. > >> I'm suspicious of that too. > > I poked at this a little further. I made the attached stand-alone > test case (you don't need any more than "cc -o rmtree rmtree.c" > to build it, then point the script at some NFS-mounted directory). > This fails with my NAS at least as far back as FreeBSD 11.0. > I also tried it on NetBSD 9.2 which seems fine. If you use some GUI file manager, point at a directory, and select "remove this directory with everything in it", what does it do internally? Surely it runs a readdir() loop and unlinks the files as it gets to them? Or does it indeed slurp the whole directory tree into memory before starting the deletion? It seems like we're probably not the first to have this use case, so there should be some information or problem reports about this somewhere out there. Or if not, then I'd be tempted to think, someone broke it recently, even if the NFS hacker argues that it cannot possibly work and never should have worked.
Peter Eisentraut <peter@eisentraut.org> writes: > On 03.01.25 02:58, Tom Lane wrote: >> I poked at this a little further. I made the attached stand-alone >> test case (you don't need any more than "cc -o rmtree rmtree.c" >> to build it, then point the script at some NFS-mounted directory). >> This fails with my NAS at least as far back as FreeBSD 11.0. >> I also tried it on NetBSD 9.2 which seems fine. > If you use some GUI file manager, point at a directory, and select > "remove this directory with everything in it", what does it do > internally? Surely it runs a readdir() loop and unlinks the files as it > gets to them? Or does it indeed slurp the whole directory tree into > memory before starting the deletion? One thing I noticed while testing yesterday is that "rm -rf foo" worked even in cases where "rmtree foo" didn't. I did not look into FreeBSD's rm code, but I'll bet it has the sort of retry logic that was recommended to us upthread. I agree with your point though that it's hard to believe that everyone does that in every case where it would matter. As for pre-existing bug reports, I found https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=57696 but there's basically no new information there, other than the not-so-surprising fact that renaming directory entries triggers the failure as efficiently as unlinking does. regards, tom lane
On Sat, Jan 4, 2025 at 7:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > One thing I noticed while testing yesterday is that "rm -rf foo" > worked even in cases where "rmtree foo" didn't. I did not look > into FreeBSD's rm code, but I'll bet it has the sort of retry > logic that was recommended to us upthread. It punts the directory scan to the fts_read() etc functions (directory traversal helpers widely inherited from ancestral BSD but not in any standard), which buffers the scan and probably hides this issue: https://github.com/freebsd/freebsd-src/blob/main/bin/rm/rm.c https://github.com/freebsd/freebsd-src/blob/main/lib/libc/gen/fts.c I doubt that hides all potential problems though, if I have understood the vague outline of this bug correctly: perhaps if you ran large enough rm -r, and you unlinked a file concurrently with that loop, you could break it, that is, cause it to skip innocent files other than the one you unlinked? glibc also has an implementation of the BSD fts functions, and uses them in its rm: https://github.com/coreutils/coreutils/blob/master/src/rm.c https://github.com/coreutils/coreutils/blob/master/src/remove.c > As for pre-existing bug reports, I found > > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=57696 > > but there's basically no new information there, other than > the not-so-surprising fact that renaming directory entries > triggers the failure as efficiently as unlinking does. FWIW I am discussing this off-list with Rick, I *think* we can distinguish between "gripes abouts NFS that we can't do much about" and "extra thing going wrong here". The fog of superstition around NFS is thick because so many investigations end at a vendor boundary/black box, but here we can understand the cookie scheme and trace through all the layers here...
Thomas Munro <thomas.munro@gmail.com> writes: > I doubt that hides all potential problems though, if I have understood > the vague outline of this bug correctly: perhaps if you ran large > enough rm -r, and you unlinked a file concurrently with that loop, you > could break it, that is, cause it to skip innocent files other than > the one you unlinked? Yeah. The thing that makes this untenably bad for us is that it's not only your own process's actions that can break readdir(), it's the actions of some unrelated process. So even "buffer the whole directory contents" isn't a reliable fix, because someone else could unlink a file while you're reading the directory. One way to "fix" this from our side is to institute some system-wide interlock that prevents file deletions (and renames, and maybe even creations?) while any process is executing a readdir loop. That's several steps beyond awful from a performance standpoint, and what's worse is that it's still not reliable. It only takes one bit of code that isn't on board with the locking protocol to put you right back at square one. It might not even be code that's part of Postgres proper, so long as it has access to PGDATA. As an example: it'd be unsafe to modify postgresql.conf with emacs, or any other editor that likes to make a backup file. So IMV that's no fix at all. The only other thing I can think of is to read and buffer the whole directory (no matter how large), and then do it *again*, and repeat till we get the same results twice in a row. That's likewise just horrid from a performance standpoint. Worse, I'm not entirely convinced that it's a 100% fix: it seems to rely on rather debatable assumptions about the possible consequences of concurrent directory mods. > FWIW I am discussing this off-list with Rick, I *think* we can > distinguish between "gripes abouts NFS that we can't do much about" > and "extra thing going wrong here". The fog of superstition around > NFS is thick because so many investigations end at a vendor > boundary/black box, but here we can understand the cookie scheme and > trace through all the layers here... I wouldn't have any problem with saying that we don't support NFS implementations that don't have stable cookies. But so far I haven't found any supported platform except FreeBSD that fails the rmtree test against my Synology NAS. I think the circumstantial evidence that FreeBSD is doing something wrong, or wronger than necessary, is strong. regards, tom lane
I wrote: > I wouldn't have any problem with saying that we don't support NFS > implementations that don't have stable cookies. But so far I haven't > found any supported platform except FreeBSD that fails the rmtree test > against my Synology NAS. To expand on that: I've now found that Linux, macOS, NetBSD, OpenBSD, and illumos/OpenIndiana[1] pass that test. I don't believe that Windows does NFS. That means that FreeBSD is the only one of our supported platforms that fails. I think we should simply document that NFS on FreeBSD is broken, and await somebody getting annoyed enough to fix that brokenness. regards, tom lane [1] Getting OpenIndiana to work under qemu is not a task for those easily discouraged.
On Sat, Jan 4, 2025 at 5:48 AM David Steele <david@pgbackrest.org> wrote: > We had one issue reported [1] involving Alpine Linux and CIFS and Not directly relevant for pgbackrest probably, but I noticed that Alpine comes up in a few reports of failing rm -r on CIFS. I think it might be because BSD and GNU rm use fts to buffer pathnames in user space (narrow window), while Alpine uses busybox rm which has a classic readdir()/unlink() loop: https://github.com/brgl/busybox/blob/master/coreutils/rm.c https://github.com/brgl/busybox/blob/master/libbb/remove_file.c As for CIFS, there are lots of reports of this sort of thing from Linux CIFS clients. I am suspicious of the 32 bit monotonic resume_key apparently being used to seek to a starting position. I don't plan to investigate myself, but ... is that even trying to avoid skips and duplicates?
On Sat, Jan 4, 2025 at 3:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > To expand on that: I've now found that Linux, macOS, NetBSD, OpenBSD, > and illumos/OpenIndiana[1] pass that test. I don't believe that Windows > does NFS. That means that FreeBSD is the only one of our supported > platforms that fails. I think we should simply document that NFS on > FreeBSD is broken, and await somebody getting annoyed enough to > fix that brokenness. Yeah, that seems like very strong evidence against FreeBSD, but I think Thomas Munro's point about CIFS is worth considering. That is rather widely used, and if the same workarounds would help both that and FreeBSD's NFS, we might want to adopt it even if it's not a complete fix. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > Yeah, that seems like very strong evidence against FreeBSD, but I > think Thomas Munro's point about CIFS is worth considering. That is > rather widely used, and if the same workarounds would help both that > and FreeBSD's NFS, we might want to adopt it even if it's not a > complete fix. TBH, I am happy that PG is now failing in a fairly visible way on filesystems that are broken in this fashion. I think that is better than silent data corruption in obscure circumstances, which is where we were before and would be again if we band-aid rmtree() and do nothing else. Nor do I think it's worth the effort to try to become fully bulletproof on the point. I think we should document that CIFS is unsupported. The docs could say something like: Storing databases on filesystems with unreliable readdir() is not supported and can lead to data corruption. If you observe warnings like "directory is not empty" when trying to drop a database, that is strong evidence that the filesystem has unreliable readdir(). Filesystems known to have this problem include NFS on FreeBSD and CIFS on all platforms. We could s/FreeBSD/older FreeBSD/ if something comes of Thomas' efforts to fix that situation. regards, tom lane
On 1/4/25 11:07, Thomas Munro wrote: > On Sat, Jan 4, 2025 at 5:48 AM David Steele <david@pgbackrest.org> wrote: >> We had one issue reported [1] involving Alpine Linux and CIFS and > > Not directly relevant for pgbackrest probably, but I noticed that > Alpine comes up in a few reports of failing rm -r on CIFS. I think it > might be because BSD and GNU rm use fts to buffer pathnames in user > space (narrow window), while Alpine uses busybox rm which has a > classic readdir()/unlink() loop: Yeah, this doesn't affect pgBackRest because we have our own rmtree that uses snapshots (for the last few years, at least). > As for CIFS, there are lots of reports of this sort of thing from > Linux CIFS clients. There may be users running Postgres on CIFS but my guess is that is rare -- at least I have never seen anyone doing it. I'm more concerned about the report we saw on SUSE/NFS [1]. If that report is accurate it indicates this may not be something we can just document and move on from -- unless we are willing to entirely drop support for NFS. Regards, -David [1] https://github.com/pgbackrest/pgbackrest/issues/1423
David Steele <david@pgbackrest.org> writes: > On 1/4/25 11:07, Thomas Munro wrote: >> As for CIFS, there are lots of reports of this sort of thing from >> Linux CIFS clients. > There may be users running Postgres on CIFS but my guess is that is rare > -- at least I have never seen anyone doing it. It'd be news to me too. I wondered if I could test it locally, but while my NAS knows half a dozen such protocols it's never heard of CIFS. > I'm more concerned about the report we saw on SUSE/NFS [1]. If that > report is accurate it indicates this may not be something we can just > document and move on from -- unless we are willing to entirely drop > support for NFS. > [1] https://github.com/pgbackrest/pgbackrest/issues/1423 I installed an up-to-date OpenSUSE image (Leap 15.6) and it passes my "rmtree" test just fine with my NAS. The report you cite doesn't have any details on what the NFS server was, but I'd be inclined to guess that that server's filesystem lacked support for stable NFS cookies. regards, tom lane
On 1/6/25 19:08, Tom Lane wrote: > David Steele <david@pgbackrest.org> writes: >> On 1/4/25 11:07, Thomas Munro wrote: >>> As for CIFS, there are lots of reports of this sort of thing from >>> Linux CIFS clients. > >> There may be users running Postgres on CIFS but my guess is that is rare >> -- at least I have never seen anyone doing it. > > It'd be news to me too. I wondered if I could test it locally, but > while my NAS knows half a dozen such protocols it's never heard of > CIFS. You may also know it as SMB or Samba. pgBackRest skips directory fsyncs if the repository type is set to CIFS so I think we'd know if anybody was running on CIFS as I'm fairly certain a directory fsync will return a hard error. That may be implementation dependent, though. >> I'm more concerned about the report we saw on SUSE/NFS [1]. If that >> report is accurate it indicates this may not be something we can just >> document and move on from -- unless we are willing to entirely drop >> support for NFS. >> [1] https://github.com/pgbackrest/pgbackrest/issues/1423 > > I installed an up-to-date OpenSUSE image (Leap 15.6) and it passes > my "rmtree" test just fine with my NAS. The report you cite > doesn't have any details on what the NFS server was, but I'd be > inclined to guess that that server's filesystem lacked support > for stable NFS cookies. The internal report we received might have had a similar cause. Sure seems like a minefield for any user trying to figure out if their setup is compliant, though. In many setups (especially production) a drop database is rare. Regards, -David
On Tue, Jan 7, 2025 at 01:03:05AM +0000, David Steele wrote: > > > I'm more concerned about the report we saw on SUSE/NFS [1]. If that > > > report is accurate it indicates this may not be something we can just > > > document and move on from -- unless we are willing to entirely drop > > > support for NFS. > > > [1] https://github.com/pgbackrest/pgbackrest/issues/1423 > > > > I installed an up-to-date OpenSUSE image (Leap 15.6) and it passes > > my "rmtree" test just fine with my NAS. The report you cite > > doesn't have any details on what the NFS server was, but I'd be > > inclined to guess that that server's filesystem lacked support > > for stable NFS cookies. > > The internal report we received might have had a similar cause. Sure seems > like a minefield for any user trying to figure out if their setup is > compliant, though. In many setups (especially production) a drop database is > rare. Will people now always get a clear error on failure? Crazy idea, but could we have initdb or postmaster start test this? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Do not let urgent matters crowd out time for investment in the future.