Re: Strange issue with NFS mounted PGDATA on ugreen NAS - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Strange issue with NFS mounted PGDATA on ugreen NAS
Date
Msg-id CA+hUKGKo3Q=gjqdvfxn2Pd0RNBd5Xd66hBjiv=5pWXfdWs=jAQ@mail.gmail.com
Whole thread Raw
In response to Re: Strange issue with NFS mounted PGDATA on ugreen NAS  (Larry Rosenman <ler@lerctr.org>)
Responses Re: Strange issue with NFS mounted PGDATA on ugreen NAS
List pgsql-hackers
On Wed, Jan 1, 2025 at 1:20 PM Kenneth Marshall <ktm@rice.edu> wrote:
> On Tue, Dec 31, 2024 at 06:58:14PM -0500, Tom Lane wrote:
> > Larry Rosenman <ler@lerctr.org> writes:
> > > On 12/31/2024 5:37 pm, Tom Lane wrote:
> > >> Do you know what its underlying file system is?
> >
> > > btrfs

> Maybe there are some btrfs or nfs options that can be used to mitigate
> this effect. Otherwise, a bug report to Debian would be in order, I guess.

Mount option readdirsize on the client side should hide the problem up
to some size you choose, but you can't set it large enough for high
numbers of relations/forks/segments.

Guessing what is happening here:  I suspect BTRFS might have
positional offsets 1, 2, 3, ... for directory entries' d_off (the
value visible in struct direct, used for telldir(), seekdir(), and
NFS's behind-the-curtain paging scheme), and they might slide when you
unlink stuff.  Perhaps not immediately, but when the directory fd is
closed on the NFS server (nearly immediately I guess given the
stateless nature of NFS, it doesn't matter that the client has its
directory fd open).  That would explain how you finished up with so
many missed files.

I think XFS's d_off points to the next entry in a btree leaf page
scan, which sounds a lot more stable... until someone else unlinks the
next item underneath you and/or the system decides to compact stuff,
who knows...  And other systems have other schemes based on hashes or
raw offsets, with different degrees of stability and anomalies (cf
ELOOP for hash collisions).

NFS is at least supposed to tell the client that its cookie has been
invalidated with a cookie-invalidation-cookie called cookieverf.  But
there isn't any specified way to recover.  FreeBSD's client looks like
it might try to, but I'm not sure if that Linux's server even
implements it.

Anyway, I'll write a patch to change rmtree() to buffer the names in
memory.  In theory there could be hundreds of gigabytes of pathnames,
so perhaps I should do it in batches; I'll look into that.



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Strange issue with NFS mounted PGDATA on ugreen NAS
Next
From: Amit Kapila
Date:
Subject: Re: Conflict detection for update_deleted in logical replication