Thread: Fwd: Re: A new look at old NFS readdir() problems?

Fwd: Re: A new look at old NFS readdir() problems?

From
Larry Rosenman
Date:
@Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on 
my issue.


-------- Original Message --------
Subject: Re: A new look at old NFS readdir() problems?
Date: 01/02/2025 10:08 am
 From: Rick Macklem <rick.macklem@gmail.com>
To: Thomas Munro <tmunro@freebsd.org>
Cc: Rick Macklem <rmacklem@freebsd.org>, Larry Rosenman <ler@lerctr.org>

On Thu, Jan 2, 2025 at 2:50 AM Thomas Munro <tmunro@freebsd.org> wrote:
> 
> CAUTION: This email originated from outside of the University of 
> Guelph. Do not click links or open attachments unless you recognize the 
> sender and know the content is safe. If in doubt, forward suspicious 
> emails to IThelp@uoguelph.ca.
> 
> 
> Hi Rick
> CC: ler
> 
> I hope you don't mind me reaching out directly, I just didn't really
> want to spam existing bug reports without sufficient understanding to
> actually help yet...  but I figured I should get in touch and see if
> you have any clues or words of warning, since you've worked on so much
> of the NFS code.  I'm a minor FBSD contributor and interested in file
> systems, but not knowledgeable about NFS; I run into/debug/report a
> lot of file system bugs on a lot of systems in my day job on
> databases.  I'm interested to see if I can help with this problem.
> Existing ancient report and interesting email:
> 
> https://lists.freebsd.org/pipermail/freebsd-fs/2014-October/020155.html
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=57696
> 
> What we ran into is not the "bad cookie" state, which doesn't really
> seem to be recoverable in general, from what I understand (though the
> FreeBSD code apparently would try, huh).  It's a simple case where the
> NFS client requests a whole directory with a large READDIR request,
> and then tries to unlink all the files in a traditional
> while-readdir()-unlink() loop that works on other systems.
In general, NFS is not a POSIX compliant file system, due to its 
protocol
design. The above is one example. The only "safe" way is to opendir() or
rewinddir() after every removal.

The above usually works (and always worked for UFS long ago) because
the directory offset cookies for subsequent entries in the directory 
after
the entry unlinked happened to "still be valid". That is no longer true
for FreeBSD's UFS nor for many other file systems that can be exported.

If the client reads the entire directory in one READDIR, then it is 
fine,
since it has no need to the directory offset cookies. However, there is
a limit to how much a single READDIR can do (these days for NFSv4.1/4.2,
it could be raised to just over 1Mbyte, however FreeBSD limits it to 8K 
at
the moment).

Another way to work around the problem is to read the entire directory
into the client via READDIRs before starting to do the unlinks.
The opendir()/readdir() code in libc could be hacked to do that,
but I have never tried to push such a patch into FreeBSD.
(It would be limited by how much memory can be malloc()'d, that
is pretty generous compared to even large directorys with 10s of
thousand entries.)

The above is true for all versions of NFS up to NFSv4.2, which is
the current one and unless some future version of NFS does READDIR
differently (I won't live long enough to see this;-), it will always
be the case.

If my comment above was not clear, the following encoding is the "safe"
way to remove all entries in a directory.

do {
       dir = opendir("X");
        dp = readdir(dir);
        if (dp != NULL)
             unlink(dp->d_name);
         close(dir);
} while (dp != NULL);

In theory, the directory_offset_cookie was supposed to handle this, but 
it
has never worked correctly, for a couple of reasons.
1 - RFC1813 (the NFSv3 one) did not describe the cookie verifier 
correctly.
      It should only change when cookies for extant entries change. The
description
      suggested it should change whenever an entry is deleted, since that 
cookie
      is no longer valid.
2 - #1 only works if directory offset cookies for other entries in the 
directory
     do not change when an entry is deleted. This used to be the case for 
UFS,
     but was broken in FreeBSD when a commit many years ago optimized
     ufs_readdir() to compress out invalid entries. Doing this changes 
the
     directory offset cookies every time an entry is deleted at the 
beginning
     of a directory block.

rick
>  On FreeBSD
> it seems to clobber its own directory cache, make extra unnecessary
> READDIR requests, and skip some of the files.  Or maybe I have no idea
> what's going on and this is a hopelessly naive question and mission
> :-)
> 
> Here's what we learned so far starting from Larry's report:
> 
> https://www.postgresql.org/message-id/flat/04f95c3c13d4a9db87b3ac082a9f4877%40lerctr.org
> 
> Note that this issue has nothing to do with "bad cookie" errors (I
> doubt the server I'm talking to even implements that -- instead it
> tries to have cookies that are persistent/stable).
> 
> Also, while looking into this and initially suspecting cookie
> stability bugs (incorrectly), I checked a bunch of local file systems
> to understand how their cookies work, and I think I found a related
> problem when FreeBSD exports UFS, too.  I didn't repro this with NFS
> but it's clearly visible from d_off locally with certain touch, rm
> sequences.  First, let me state what I think the cookie should be
> trying to achieve, on a system that doesn't implement "bad cookie" but
> instead wants cookies that are persistent/always valid:  if you make a
> series of READDIR requests using the cookie from the final entry of
> the previous response, it should be impossible to miss any entry that
> existed before your first call to readdir(), and impossible to see any
> entry twice.  It is left undefined whether entries created after that
> time are visible, since anything else would require unbounded time or
> space via locks or multi-version magic (= isolation problems from
> database-land).
> 
> Going back to the early 80s, Sun UFS looks good (based on illumos
> source code) because it doesn't seem to move entries after they are
> created.  That must have been the only file system when they invented
> VFS and NFS.  Various other systems since have been either complex but
> apparently good (ZFS/ZAP cursors can tolerate up to 2^16 hash
> collisions which I think we can call statistically impossible, XFS
> claims to be completely stable though I didn't understand fully why,
> BTRFS assigns incrementing numbers that will hopefully not wrap, ...),
> or nearly-good-enough-but-ugh (ext4 uses hashes like ZFS but
> apparently fails with ELOOP on hash collisions?).  I was expecting
> FreeBSD UFS to be like Sun UFS but I don't think it is!  In the UFS
> code since at least 4.3BSD (but apparently not in the Sun version,
> forked before or removed later?), inserting a new entry can compact a
> directory page, which moves the offset of a directory entry lower.
> AFAICS we can't move an entry lower, or we risk skipping it in NFS
> readdir(), and we can't move it higher, or we risk double-reporting it
> in readdir().  Or am I missing something?
> 
> Thanks for reading and happy new year,
> 
> Thomas Munro

-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Tom Lane
Date:
Larry Rosenman <ler@lerctr.org> writes:
> @Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on 
> my issue.

Thanks for reaching out to him.  So if I'm reading this correctly,
there's little point in filing a FreeBSD bug because it'll be
dismissed as unfixable.

This leaves us in rather a nasty position.  Sure, we could rewrite
rmtree() as Thomas suggested upthread, but I'm still of the opinion
that that's smearing lipstick on a pig.  rmtree() is the least of
our worries: it doesn't need to expect that anybody else will be
modifying the target directory, plus it can easily restart its scan
without complicated bookkeeping.  I doubt we can make such an
assumption for all our uses of readdir(), or that it's okay to
miss or double-process files in every one of them.

I'm still of the opinion that the best thing to do is disclaim
safety of storing a database on NFS.

            regards, tom lane



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Bruce Momjian
Date:
On Thu, Jan  2, 2025 at 03:48:53PM -0500, Tom Lane wrote:
> Larry Rosenman <ler@lerctr.org> writes:
> > @Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on 
> > my issue.
> 
> Thanks for reaching out to him.  So if I'm reading this correctly,
> there's little point in filing a FreeBSD bug because it'll be
> dismissed as unfixable.
> 
> This leaves us in rather a nasty position.  Sure, we could rewrite
> rmtree() as Thomas suggested upthread, but I'm still of the opinion
> that that's smearing lipstick on a pig.  rmtree() is the least of
> our worries: it doesn't need to expect that anybody else will be
> modifying the target directory, plus it can easily restart its scan
> without complicated bookkeeping.  I doubt we can make such an
> assumption for all our uses of readdir(), or that it's okay to
> miss or double-process files in every one of them.
> 
> I'm still of the opinion that the best thing to do is disclaim
> safety of storing a database on NFS.

It would be nice to have some details on the docs about why NFS can
cause problems so we have something to point to when people ask.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.





Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Larry Rosenman
Date:
On 01/02/2025 2:50 pm, Bruce Momjian wrote:
> On Thu, Jan  2, 2025 at 03:48:53PM -0500, Tom Lane wrote:
>> Larry Rosenman <ler@lerctr.org> writes:
>> > @Tom Lane: This is what Rick Macklem (NFS dev on FreeBSD) has to say on
>> > my issue.
>> 
>> Thanks for reaching out to him.  So if I'm reading this correctly,
>> there's little point in filing a FreeBSD bug because it'll be
>> dismissed as unfixable.
>> 
>> This leaves us in rather a nasty position.  Sure, we could rewrite
>> rmtree() as Thomas suggested upthread, but I'm still of the opinion
>> that that's smearing lipstick on a pig.  rmtree() is the least of
>> our worries: it doesn't need to expect that anybody else will be
>> modifying the target directory, plus it can easily restart its scan
>> without complicated bookkeeping.  I doubt we can make such an
>> assumption for all our uses of readdir(), or that it's okay to
>> miss or double-process files in every one of them.
>> 
>> I'm still of the opinion that the best thing to do is disclaim
>> safety of storing a database on NFS.
> 
> It would be nice to have some details on the docs about why NFS can
> cause problems so we have something to point to when people ask.

What about doing what Rick suggests?
do {
       dir = opendir("X");
        dp = readdir(dir);
        if (dp != NULL)
             unlink(dp->d_name);
         close(dir);
} while (dp != NULL);
?
-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Robert Haas
Date:
On Thu, Jan 2, 2025 at 3:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm still of the opinion that the best thing to do is disclaim
> safety of storing a database on NFS.

If we're going to disclaim support for NFS, it would certainly be
better to do that clearly and with reasons than otherwise. However, I
suspect a substantial number of users will still use NFS and then
blame us when it doesn't work; or alternatively say that our software
sucks because it doesn't work on NFS. Neither of which are very nice
outcomes.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jan 2, 2025 at 3:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm still of the opinion that the best thing to do is disclaim
>> safety of storing a database on NFS.

> If we're going to disclaim support for NFS, it would certainly be
> better to do that clearly and with reasons than otherwise. However, I
> suspect a substantial number of users will still use NFS and then
> blame us when it doesn't work; or alternatively say that our software
> sucks because it doesn't work on NFS. Neither of which are very nice
> outcomes.

Are you prepared to buy into "we will make every bit of code that uses
readdir() proof against arbitrary lies from readdir()"?  I'm not:
I cannot see how to do that in any but the simplest cases -- rmtree()
being about the simplest.  Even if we had a bulletproof design in
mind, it's pretty hard to believe that future patches would apply
it every time.  I think it's inevitable that we'd have bugs like
"sometimes not every file gets fsync'd", which would be impossible
to find until someone's data gets eaten, and maybe still impossible
to find afterwards.

(To be clear: if this is how FreeBSD acts, then I'm afraid we already
do have such bugs.  The rmtree case is just easier to observe than a
missed fsync.)

            regards, tom lane



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Larry Rosenman
Date:
On 01/02/2025 3:42 pm, Thomas Munro wrote:
> On Fri, Jan 3, 2025 at 10:16 AM Larry Rosenman <ler@lerctr.org> wrote:
>> What about doing what Rick suggests?
>> do {
>>        dir = opendir("X");
>>         dp = readdir(dir);
>>         if (dp != NULL)
>>              unlink(dp->d_name);
>>          close(dir);
>> } while (dp != NULL);
>> ?
> 
> That only works for unlink loops where we expect no concurrent
> modifications.  DROP DATABASE qualifies, but we use readdir() on
> directories that might be changing in various other places, including
> backups.  They'd silently fail to do their job, and can't use that
> technique as they aren't unlinking as they go and so they would never
> make progress.  Which leads to the lipstick/pig conclusion: we'd have
> other, worse, buggy behaviour still.
> 
> It might be possible to make our own readdir snapshot thing that
> checks a shmem counter of all links/unlinks before and after and
> retries.  But ugh...

Would it make sense for ONLY drop database to have the above loop?

-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Tom Lane
Date:
Larry Rosenman <ler@lerctr.org> writes:
> Would it make sense for ONLY drop database to have the above loop?

Not really.  We'd just be papering over the most-easily-observable
consequence of readdir's malfeasance.  There'd still be problems
like basebackups omitting files, missed fsyncs potentially leading
to data loss, etc.

I am sort of wondering though why we've not heard reports of this
many times before.  Did FreeBSD recently change something in this
area?  Also, if as they argue it's a fundamental problem in the NFS
protocol, why are we failing to reproduce it with other clients?

            regards, tom lane



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Larry Rosenman
Date:
On 01/02/2025 4:56 pm, Tom Lane wrote:
> Larry Rosenman <ler@lerctr.org> writes:
>> Would it make sense for ONLY drop database to have the above loop?
> 
> Not really.  We'd just be papering over the most-easily-observable
> consequence of readdir's malfeasance.  There'd still be problems
> like basebackups omitting files, missed fsyncs potentially leading
> to data loss, etc.
> 
> I am sort of wondering though why we've not heard reports of this
> many times before.  Did FreeBSD recently change something in this
> area?  Also, if as they argue it's a fundamental problem in the NFS
> protocol, why are we failing to reproduce it with other clients?
> 
>             regards, tom lane
Pre version 16, we worked just fine.  There was a change in 16.

-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Tom Lane
Date:
Larry Rosenman <ler@lerctr.org> writes:
> On 01/02/2025 4:56 pm, Tom Lane wrote:
>> I am sort of wondering though why we've not heard reports of this
>> many times before.  Did FreeBSD recently change something in this
>> area?  Also, if as they argue it's a fundamental problem in the NFS
>> protocol, why are we failing to reproduce it with other clients?

> Pre version 16, we worked just fine.  There was a change in 16.

Well, as Thomas already mentioned, rmtree() used to work by reading
the directory and making an in-memory copy of the whole file list
before it started to delete anything.  But reverting that wouldn't
fix all of the other problems that we now know exist.  (In other
words, "we worked just fine" only for small values of "fine".)

            regards, tom lane



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Thomas Munro
Date:
On Fri, Jan 3, 2025 at 10:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> (To be clear: if this is how FreeBSD acts, then I'm afraid we already
> do have such bugs.  The rmtree case is just easier to observe than a
> missed fsync.)

For what little it's worth, I'm not quite convinced yet that FreeBSD's
client isn't more broken than it needs to be.  Lots of systems I
looked at have stable cookies in practice (as NFS 4 recommends),
including the one used in this report, so it seems like a more basic
problem.  At the risk of being wrong on the internet, I don't see any
fundamental reason why a readdir() scan can't have no-skip,
no-duplicate, no-fail semantics for stable-cookie, no-verification
servers.  And this case works perfectly with a couple of other NFS
clients implementations that you and I tried.

As for systems that don't have stable cookies, well then they should
implement the cookie verification scheme and AFAICS the readdir() scan
should then fail if it can't recover, or it would expose isolation
anomalies offensive to database hacker sensibilities.  I think it
should be theoretically possible to recover in some happy cases
(maybe: when enough state is still around in cache to deduplicate).
But that shouldn't be necessary on filers using eg ZFS or BTRFS whose
cookies are intended to be stable.  A server could also do MVCC magic,
and from a quick google, I guess NetApp WAFL might do that as it has
the concept of "READDIR expired", which smells a bit like ORA-01555:
snapshot too old.



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> For what little it's worth, I'm not quite convinced yet that FreeBSD's
> client isn't more broken than it needs to be.

I'm suspicious of that too.  The wireshark trace you described is hard
to read any other way than that FreeBSD went out of its way to deliver
incorrect information.  I'm prepared to believe that we can't work
correctly on NFS servers that don't do the stable-cookie thing, but
why isn't it succeeding on ones that do?

            regards, tom lane



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Tom Lane
Date:
I wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> For what little it's worth, I'm not quite convinced yet that FreeBSD's
>> client isn't more broken than it needs to be.

> I'm suspicious of that too.

I poked at this a little further.  I made the attached stand-alone
test case (you don't need any more than "cc -o rmtree rmtree.c"
to build it, then point the script at some NFS-mounted directory).
This fails with my NAS at least as far back as FreeBSD 11.0.
I also tried it on NetBSD 9.2 which seems fine.

            regards, tom lane

#! /bin/sh

set -e

TESTDIR="$1"

mkdir "$TESTDIR"

i=0
while [ $i -lt 1000 ]
do
  touch "$TESTDIR/$i"
  i=`expr $i + 1`
done

./rmtree "$TESTDIR"
/*-------------------------------------------------------------------------
 *
 * rmtree.c
 *
 * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * IDENTIFICATION
 *      src/common/rmtree.c
 *
 *-------------------------------------------------------------------------
 */
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <dirent.h>


typedef enum PGFileType
{
    PGFILETYPE_ERROR,
    PGFILETYPE_UNKNOWN,
    PGFILETYPE_REG,
    PGFILETYPE_DIR,
    PGFILETYPE_LNK,
} PGFileType;


static void *
palloc(size_t size)
{
    void       *tmp;

    /* Avoid unportable behavior of malloc(0) */
    if (size == 0)
        size = 1;
    tmp = malloc(size);
    if (tmp == NULL)
    {
        fprintf(stderr, "out of memory\n");
        exit(1);
    }
    return tmp;
}

static void *
repalloc(void *ptr, size_t size)
{
    void       *tmp;

    /* Avoid unportable behavior of realloc(NULL, 0) */
    if (ptr == NULL && size == 0)
        size = 1;
    tmp = realloc(ptr, size);
    if (!tmp)
    {
        fprintf(stderr, "out of memory\n");
        exit(1);
    }
    return tmp;
}

static char *
pstrdup(const char *in)
{
    char       *tmp;

    if (!in)
    {
        fprintf(stderr,
                "cannot duplicate null pointer (internal error)\n");
        exit(1);
    }
    tmp = strdup(in);
    if (!tmp)
    {
        fprintf(stderr, "out of memory\n");
        exit(1);
    }
    return tmp;
}

/*
 * Return the type of a directory entry.
 */
static PGFileType
get_dirent_type(const char *path,
                const struct dirent *de,
                bool look_through_symlinks)
{
    PGFileType    result;

    /*
     * Some systems tell us the type directly in the dirent struct, but that's
     * a BSD and Linux extension not required by POSIX.  Even when the
     * interface is present, sometimes the type is unknown, depending on the
     * filesystem.
     */
#if defined(DT_REG) && defined(DT_DIR) && defined(DT_LNK)
    if (de->d_type == DT_REG)
        result = PGFILETYPE_REG;
    else if (de->d_type == DT_DIR)
        result = PGFILETYPE_DIR;
    else if (de->d_type == DT_LNK && !look_through_symlinks)
        result = PGFILETYPE_LNK;
    else
        result = PGFILETYPE_UNKNOWN;
#else
    result = PGFILETYPE_UNKNOWN;
#endif

    if (result == PGFILETYPE_UNKNOWN)
    {
        struct stat fst;
        int            sret;


        if (look_through_symlinks)
            sret = stat(path, &fst);
        else
            sret = lstat(path, &fst);

        if (sret < 0)
        {
            result = PGFILETYPE_ERROR;
            fprintf(stderr, "could not stat file \"%s\": %m\n", path);
        }
        else if (S_ISREG(fst.st_mode))
            result = PGFILETYPE_REG;
        else if (S_ISDIR(fst.st_mode))
            result = PGFILETYPE_DIR;
        else if (S_ISLNK(fst.st_mode))
            result = PGFILETYPE_LNK;
    }

    return result;
}


/*
 *    rmtree
 *
 *    Delete a directory tree recursively.
 *    Assumes path points to a valid directory.
 *    Deletes everything under path.
 *    If rmtopdir is true deletes the directory too.
 *    Returns true if successful, false if there was any problem.
 *    (The details of the problem are reported already, so caller
 *    doesn't really have to say anything more, but most do.)
 */
static bool
rmtree(const char *path, bool rmtopdir)
{
    char        pathbuf[8192];
    DIR           *dir;
    struct dirent *de;
    bool        result = true;
    size_t        dirnames_size = 0;
    size_t        dirnames_capacity = 8;
    char      **dirnames;

    dir = opendir(path);
    if (dir == NULL)
    {
        fprintf(stderr, "could not open directory \"%s\": %m\n", path);
        return false;
    }

    dirnames = (char **) palloc(sizeof(char *) * dirnames_capacity);

    while (errno = 0, (de = readdir(dir)))
    {
        if (strcmp(de->d_name, ".") == 0 ||
            strcmp(de->d_name, "..") == 0)
            continue;
        snprintf(pathbuf, sizeof(pathbuf), "%s/%s", path, de->d_name);
        switch (get_dirent_type(pathbuf, de, false))
        {
            case PGFILETYPE_ERROR:
                /* already logged, press on */
                break;
            case PGFILETYPE_DIR:

                /*
                 * Defer recursion until after we've closed this directory, to
                 * avoid using more than one file descriptor at a time.
                 */
                if (dirnames_size == dirnames_capacity)
                {
                    dirnames = repalloc(dirnames,
                                        sizeof(char *) * dirnames_capacity * 2);
                    dirnames_capacity *= 2;
                }
                dirnames[dirnames_size++] = pstrdup(pathbuf);
                break;
            default:
                if (unlink(pathbuf) != 0 && errno != ENOENT)
                {
                    fprintf(stderr, "could not remove file \"%s\": %m\n", pathbuf);
                    result = false;
                }
                break;
        }
    }

    if (errno != 0)
    {
        fprintf(stderr, "could not read directory \"%s\": %m\n", path);
        result = false;
    }

    closedir(dir);

    /* Now recurse into the subdirectories we found. */
    for (size_t i = 0; i < dirnames_size; ++i)
    {
        if (!rmtree(dirnames[i], true))
            result = false;
        free(dirnames[i]);
    }

    if (rmtopdir)
    {
        if (rmdir(path) != 0)
        {
            fprintf(stderr, "could not remove directory \"%s\": %m\n", path);
            result = false;
        }
    }

    free(dirnames);

    return result;
}

int
main(int argc, char **argv)
{
    if (argc != 2)
    {
        fprintf(stderr, "usage: %s target-directory\n", argv[0]);
        exit(1);
    }

    if (!rmtree(argv[1], true))
    {
        fprintf(stderr, "rmtree failed\n");
        exit(1);
    }

    return 0;
}

Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Robert Haas
Date:
On Thu, Jan 2, 2025 at 4:52 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Are you prepared to buy into "we will make every bit of code that uses
> readdir() proof against arbitrary lies from readdir()"?  I'm not:
> I cannot see how to do that in any but the simplest cases -- rmtree()
> being about the simplest.  Even if we had a bulletproof design in
> mind, it's pretty hard to believe that future patches would apply
> it every time.  I think it's inevitable that we'd have bugs like
> "sometimes not every file gets fsync'd", which would be impossible
> to find until someone's data gets eaten, and maybe still impossible
> to find afterwards.

No. We obviously cannot guarantee correct output in the face of
arbitrarily incorrect input. What I'm doing is being practical about
how the real world works. People use NFS. This is not that different
from the fsync debacle a few years ago -- the fsync semantics on Linux
did (and AFAIK, still do) make it impossible to write correct
programs, and then on top of that our code had bugs in it, too. We
didn't just say "oh, well, I guess you shouldn't use Linux". We tried
to make our code as robust as it could be in the face of kernel code
that behaved in a manner that was fairly ridiculous relative to our
needs. This case doesn't seem that different to me.

I certainly think it's reasonable to debate whether any particular
mitigations that someone might propose adding to our code are
unreasonably difficult. But the complaint as presented is "the code
you used to have worked better than the code you have now," which
doesn't necessarily support the idea that the requested changes are
unreasonable. Of course, if FreeBSD is buggier than it needs to be,
then I hope that also gets fixed. But even if it doesn't, surely the
right thing to do would be to clearly document that we don't work
*with NFS on FreeBSD* rather than that we don't work with *any NFS*.
Unless of course we really (and unavoidably) don't work with any NFS,
and then we should document that, with reasons.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Greg Sabino Mullane
Date:
On Fri, Jan 3, 2025 at 8:33 AM Robert Haas <robertmhaas@gmail.com> wrote:
We tried to make our code as robust as it could be in the face of kernel code that behaved in a manner that was fairly ridiculous relative to our needs. This case doesn't seem that different to me.

+1. Seems a shame that freebsd chooses such "optimizations", but making our code do various workarounds and jump through hoops to support various OS quirks (hello Win32 fans!) seems a burden we agreed to take on a long time ago.

Cheers,
Greg

Re: Fwd: Re: A new look at old NFS readdir() problems?

From
David Steele
Date:
On 1/3/25 09:47, Greg Sabino Mullane wrote:
> On Fri, Jan 3, 2025 at 8:33 AM Robert Haas <robertmhaas@gmail.com 
> <mailto:robertmhaas@gmail.com>> wrote:
> 
>     We tried to make our code as robust as it could be in the face of
>     kernel code that behaved in a manner that was fairly ridiculous
>     relative to our needs. This case doesn't seem that different to me.
> 
> 
> +1. Seems a shame that freebsd chooses such "optimizations", but making 
> our code do various workarounds and jump through hoops to support 
> various OS quirks (hello Win32 fans!) seems a burden we agreed to take 
> on a long time ago.

FWIW, we observed this issue in pgBackRest a few years ago -- as you can 
imagine we do a lot of scanning so readdir gets a real workout.

We had one issue reported [1] involving Alpine Linux and CIFS and 
another [2] with SLES and NFS. We also had at least one internal report 
that involved RHEL and a proprietary storage appliance. I'm not certain 
if we received any reports for FreeBSD but it kind of rings a bell.

Over some time and various reports it seemed that any storage was 
potentially a problem. I resisted the notion that we would have to work 
around something that seemed to be an obvious kernel bug but in the end 
I capitulated.

We fixed this by making a snapshot of each directory before performing 
any operations on that directory (as has been suggested upthread). One 
advantage we have is that our storage is very centralized since we deal 
with a number of storage types so there are no readdirs in the general 
code base. It was still a pretty major patch [3] but a lot of it was 
removing the callbacks that we had used previously and adding 
optimizations to reduce memory consumption.

One more thing to note -- we are still assuming that Postgres is running 
on storage that is not subject to this issue. Even with our new 
methodology if Postgres is deleting files while we are trying to build a 
backup manifest that could cause us (and base backup) problems. The only 
solution I came up with for that problem was to keep reading the 
directory until we get two snapshots that match -- not very attractive 
but probably workable for pgBackRest. I doubt the same could be said for 
Postgres.

Regards,
-David

---

[1] https://github.com/pgbackrest/pgbackrest/issues/1754
[2] https://github.com/pgbackrest/pgbackrest/issues/1423
[3] https://github.com/pgbackrest/pgbackrest/commit/75623d45




Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Peter Eisentraut
Date:
On 03.01.25 02:58, Tom Lane wrote:
> I wrote:
>> Thomas Munro <thomas.munro@gmail.com> writes:
>>> For what little it's worth, I'm not quite convinced yet that FreeBSD's
>>> client isn't more broken than it needs to be.
> 
>> I'm suspicious of that too.
> 
> I poked at this a little further.  I made the attached stand-alone
> test case (you don't need any more than "cc -o rmtree rmtree.c"
> to build it, then point the script at some NFS-mounted directory).
> This fails with my NAS at least as far back as FreeBSD 11.0.
> I also tried it on NetBSD 9.2 which seems fine.

If you use some GUI file manager, point at a directory, and select 
"remove this directory with everything in it", what does it do 
internally?  Surely it runs a readdir() loop and unlinks the files as it 
gets to them?  Or does it indeed slurp the whole directory tree into 
memory before starting the deletion?  It seems like we're probably not 
the first to have this use case, so there should be some information or 
problem reports about this somewhere out there.  Or if not, then I'd be 
tempted to think, someone broke it recently, even if the NFS hacker 
argues that it cannot possibly work and never should have worked.




Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Tom Lane
Date:
Peter Eisentraut <peter@eisentraut.org> writes:
> On 03.01.25 02:58, Tom Lane wrote:
>> I poked at this a little further.  I made the attached stand-alone
>> test case (you don't need any more than "cc -o rmtree rmtree.c"
>> to build it, then point the script at some NFS-mounted directory).
>> This fails with my NAS at least as far back as FreeBSD 11.0.
>> I also tried it on NetBSD 9.2 which seems fine.

> If you use some GUI file manager, point at a directory, and select 
> "remove this directory with everything in it", what does it do 
> internally?  Surely it runs a readdir() loop and unlinks the files as it 
> gets to them?  Or does it indeed slurp the whole directory tree into 
> memory before starting the deletion?

One thing I noticed while testing yesterday is that "rm -rf foo"
worked even in cases where "rmtree foo" didn't.  I did not look
into FreeBSD's rm code, but I'll bet it has the sort of retry
logic that was recommended to us upthread.  I agree with your
point though that it's hard to believe that everyone does that
in every case where it would matter.

As for pre-existing bug reports, I found

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=57696

but there's basically no new information there, other than
the not-so-surprising fact that renaming directory entries
triggers the failure as efficiently as unlinking does.

            regards, tom lane



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Thomas Munro
Date:
On Sat, Jan 4, 2025 at 7:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> One thing I noticed while testing yesterday is that "rm -rf foo"
> worked even in cases where "rmtree foo" didn't.  I did not look
> into FreeBSD's rm code, but I'll bet it has the sort of retry
> logic that was recommended to us upthread.

It punts the directory scan to the fts_read() etc functions (directory
traversal helpers widely inherited from ancestral BSD but not in any
standard), which buffers the scan and probably hides this issue:

https://github.com/freebsd/freebsd-src/blob/main/bin/rm/rm.c
https://github.com/freebsd/freebsd-src/blob/main/lib/libc/gen/fts.c

I doubt that hides all potential problems though, if I have understood
the vague outline of this bug correctly: perhaps if you ran large
enough rm -r, and you unlinked a file concurrently with that loop, you
could break it, that is, cause it to skip innocent files other than
the one you unlinked?

glibc also has an implementation of the BSD fts functions, and uses
them in its rm:

https://github.com/coreutils/coreutils/blob/master/src/rm.c
https://github.com/coreutils/coreutils/blob/master/src/remove.c

> As for pre-existing bug reports, I found
>
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=57696
>
> but there's basically no new information there, other than
> the not-so-surprising fact that renaming directory entries
> triggers the failure as efficiently as unlinking does.

FWIW I am discussing this off-list with Rick, I *think* we can
distinguish between "gripes abouts NFS that we can't do much about"
and "extra thing going wrong here".  The fog of superstition around
NFS is thick because so many investigations end at a vendor
boundary/black box, but here we can understand the cookie scheme and
trace through all the layers here...



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> I doubt that hides all potential problems though, if I have understood
> the vague outline of this bug correctly: perhaps if you ran large
> enough rm -r, and you unlinked a file concurrently with that loop, you
> could break it, that is, cause it to skip innocent files other than
> the one you unlinked?

Yeah.  The thing that makes this untenably bad for us is that it's not
only your own process's actions that can break readdir(), it's the
actions of some unrelated process.  So even "buffer the whole
directory contents" isn't a reliable fix, because someone else could
unlink a file while you're reading the directory.

One way to "fix" this from our side is to institute some system-wide
interlock that prevents file deletions (and renames, and maybe even
creations?) while any process is executing a readdir loop.  That's
several steps beyond awful from a performance standpoint, and what's
worse is that it's still not reliable.  It only takes one bit of
code that isn't on board with the locking protocol to put you right
back at square one.  It might not even be code that's part of Postgres
proper, so long as it has access to PGDATA.  As an example: it'd be
unsafe to modify postgresql.conf with emacs, or any other editor that
likes to make a backup file.  So IMV that's no fix at all.

The only other thing I can think of is to read and buffer the whole
directory (no matter how large), and then do it *again*, and repeat
till we get the same results twice in a row.  That's likewise just
horrid from a performance standpoint.  Worse, I'm not entirely
convinced that it's a 100% fix: it seems to rely on rather debatable
assumptions about the possible consequences of concurrent directory
mods.

> FWIW I am discussing this off-list with Rick, I *think* we can
> distinguish between "gripes abouts NFS that we can't do much about"
> and "extra thing going wrong here".  The fog of superstition around
> NFS is thick because so many investigations end at a vendor
> boundary/black box, but here we can understand the cookie scheme and
> trace through all the layers here...

I wouldn't have any problem with saying that we don't support NFS
implementations that don't have stable cookies.  But so far I haven't
found any supported platform except FreeBSD that fails the rmtree test
against my Synology NAS.  I think the circumstantial evidence that
FreeBSD is doing something wrong, or wronger than necessary, is
strong.

            regards, tom lane



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Tom Lane
Date:
I wrote:
> I wouldn't have any problem with saying that we don't support NFS
> implementations that don't have stable cookies.  But so far I haven't
> found any supported platform except FreeBSD that fails the rmtree test
> against my Synology NAS.

To expand on that: I've now found that Linux, macOS, NetBSD, OpenBSD,
and illumos/OpenIndiana[1] pass that test.  I don't believe that Windows
does NFS.  That means that FreeBSD is the only one of our supported
platforms that fails.  I think we should simply document that NFS on
FreeBSD is broken, and await somebody getting annoyed enough to
fix that brokenness.

            regards, tom lane

[1] Getting OpenIndiana to work under qemu is not a task for those
easily discouraged.



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Thomas Munro
Date:
On Sat, Jan 4, 2025 at 5:48 AM David Steele <david@pgbackrest.org> wrote:
> We had one issue reported [1] involving Alpine Linux and CIFS and

Not directly relevant for pgbackrest probably, but I noticed that
Alpine comes up in a few reports of failing rm -r on CIFS.  I think it
might be because BSD and GNU rm use fts to buffer pathnames in user
space (narrow window), while Alpine uses busybox rm which has a
classic readdir()/unlink() loop:

https://github.com/brgl/busybox/blob/master/coreutils/rm.c
https://github.com/brgl/busybox/blob/master/libbb/remove_file.c

As for CIFS, there are lots of reports of this sort of thing from
Linux CIFS clients.  I am suspicious of the 32 bit monotonic
resume_key apparently being used to seek to a starting position.  I
don't plan to investigate myself, but ... is that even trying to avoid
skips and duplicates?



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Robert Haas
Date:
On Sat, Jan 4, 2025 at 3:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> To expand on that: I've now found that Linux, macOS, NetBSD, OpenBSD,
> and illumos/OpenIndiana[1] pass that test.  I don't believe that Windows
> does NFS.  That means that FreeBSD is the only one of our supported
> platforms that fails.  I think we should simply document that NFS on
> FreeBSD is broken, and await somebody getting annoyed enough to
> fix that brokenness.

Yeah, that seems like very strong evidence against FreeBSD, but I
think Thomas Munro's point about CIFS is worth considering. That is
rather widely used, and if the same workarounds would help both that
and FreeBSD's NFS, we might want to adopt it even if it's not a
complete fix.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Yeah, that seems like very strong evidence against FreeBSD, but I
> think Thomas Munro's point about CIFS is worth considering. That is
> rather widely used, and if the same workarounds would help both that
> and FreeBSD's NFS, we might want to adopt it even if it's not a
> complete fix.

TBH, I am happy that PG is now failing in a fairly visible way on
filesystems that are broken in this fashion.  I think that is better
than silent data corruption in obscure circumstances, which is where
we were before and would be again if we band-aid rmtree() and do
nothing else.  Nor do I think it's worth the effort to try to become
fully bulletproof on the point.

I think we should document that CIFS is unsupported.  The docs
could say something like:

    Storing databases on filesystems with unreliable readdir() is
    not supported and can lead to data corruption.  If you observe
    warnings like "directory is not empty" when trying to drop a
    database, that is strong evidence that the filesystem has
    unreliable readdir().  Filesystems known to have this problem
    include NFS on FreeBSD and CIFS on all platforms.

We could s/FreeBSD/older FreeBSD/ if something comes of Thomas'
efforts to fix that situation.

            regards, tom lane



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
David Steele
Date:
On 1/4/25 11:07, Thomas Munro wrote:
> On Sat, Jan 4, 2025 at 5:48 AM David Steele <david@pgbackrest.org> wrote:
>> We had one issue reported [1] involving Alpine Linux and CIFS and
> 
> Not directly relevant for pgbackrest probably, but I noticed that
> Alpine comes up in a few reports of failing rm -r on CIFS.  I think it
> might be because BSD and GNU rm use fts to buffer pathnames in user
> space (narrow window), while Alpine uses busybox rm which has a
> classic readdir()/unlink() loop:

Yeah, this doesn't affect pgBackRest because we have our own rmtree that 
uses snapshots (for the last few years, at least).

> As for CIFS, there are lots of reports of this sort of thing from
> Linux CIFS clients.

There may be users running Postgres on CIFS but my guess is that is rare 
-- at least I have never seen anyone doing it.

I'm more concerned about the report we saw on SUSE/NFS [1]. If that 
report is accurate it indicates this may not be something we can just 
document and move on from -- unless we are willing to entirely drop 
support for NFS.

Regards,
-David

[1] https://github.com/pgbackrest/pgbackrest/issues/1423



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Tom Lane
Date:
David Steele <david@pgbackrest.org> writes:
> On 1/4/25 11:07, Thomas Munro wrote:
>> As for CIFS, there are lots of reports of this sort of thing from
>> Linux CIFS clients.

> There may be users running Postgres on CIFS but my guess is that is rare 
> -- at least I have never seen anyone doing it.

It'd be news to me too.  I wondered if I could test it locally, but
while my NAS knows half a dozen such protocols it's never heard of
CIFS.

> I'm more concerned about the report we saw on SUSE/NFS [1]. If that 
> report is accurate it indicates this may not be something we can just 
> document and move on from -- unless we are willing to entirely drop 
> support for NFS.
> [1] https://github.com/pgbackrest/pgbackrest/issues/1423

I installed an up-to-date OpenSUSE image (Leap 15.6) and it passes
my "rmtree" test just fine with my NAS.  The report you cite
doesn't have any details on what the NFS server was, but I'd be
inclined to guess that that server's filesystem lacked support
for stable NFS cookies.

            regards, tom lane



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
David Steele
Date:
On 1/6/25 19:08, Tom Lane wrote:
> David Steele <david@pgbackrest.org> writes:
>> On 1/4/25 11:07, Thomas Munro wrote:
>>> As for CIFS, there are lots of reports of this sort of thing from
>>> Linux CIFS clients.
> 
>> There may be users running Postgres on CIFS but my guess is that is rare
>> -- at least I have never seen anyone doing it.
> 
> It'd be news to me too.  I wondered if I could test it locally, but
> while my NAS knows half a dozen such protocols it's never heard of
> CIFS.

You may also know it as SMB or Samba. pgBackRest skips directory fsyncs 
if the repository type is set to CIFS so I think we'd know if anybody 
was running on CIFS as I'm fairly certain a directory fsync will return 
a hard error. That may be implementation dependent, though.

>> I'm more concerned about the report we saw on SUSE/NFS [1]. If that
>> report is accurate it indicates this may not be something we can just
>> document and move on from -- unless we are willing to entirely drop
>> support for NFS.
>> [1] https://github.com/pgbackrest/pgbackrest/issues/1423
> 
> I installed an up-to-date OpenSUSE image (Leap 15.6) and it passes
> my "rmtree" test just fine with my NAS.  The report you cite
> doesn't have any details on what the NFS server was, but I'd be
> inclined to guess that that server's filesystem lacked support
> for stable NFS cookies.

The internal report we received might have had a similar cause. Sure 
seems like a minefield for any user trying to figure out if their setup 
is compliant, though. In many setups (especially production) a drop 
database is rare.

Regards,
-David



Re: Fwd: Re: A new look at old NFS readdir() problems?

From
Bruce Momjian
Date:
On Tue, Jan  7, 2025 at 01:03:05AM +0000, David Steele wrote:
> > > I'm more concerned about the report we saw on SUSE/NFS [1]. If that
> > > report is accurate it indicates this may not be something we can just
> > > document and move on from -- unless we are willing to entirely drop
> > > support for NFS.
> > > [1] https://github.com/pgbackrest/pgbackrest/issues/1423
> > 
> > I installed an up-to-date OpenSUSE image (Leap 15.6) and it passes
> > my "rmtree" test just fine with my NAS.  The report you cite
> > doesn't have any details on what the NFS server was, but I'd be
> > inclined to guess that that server's filesystem lacked support
> > for stable NFS cookies.
> 
> The internal report we received might have had a similar cause. Sure seems
> like a minefield for any user trying to figure out if their setup is
> compliant, though. In many setups (especially production) a drop database is
> rare.

Will people now always get a clear error on failure?  Crazy idea, but
could we have initdb or postmaster start test this? 

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.