Thread: Re: Strange issue with NFS mounted PGDATA on ugreen NAS

Re: Strange issue with NFS mounted PGDATA on ugreen NAS

From
Larry Rosenman
Date:
On 12/31/2024 12:22 pm, Larry Rosenman wrote:
> When I try to drop a database, PostgreSQL leaves files in the directory 
> and does not even try to delete them.
> 
> PostgreSQL 16.6, FreeBSD 14.2, PGDATA mounted NFS from UGreen NAS.
> 
> Truss of the create/delete attached.
> 
> It does NOT seem to happen PG < 16.
> 
> Ideas?
as a followup, 17.2 exhibits the same behavior, but 15.10 and below work 
just fine on the same VM and share.


-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010



Re: Strange issue with NFS mounted PGDATA on ugreen NAS

From
Tom Lane
Date:
Larry Rosenman <ler@lerctr.org> writes:
> On 12/31/2024 12:22 pm, Larry Rosenman wrote:
>> When I try to drop a database, PostgreSQL leaves files in the directory 
>> and does not even try to delete them.
>> PostgreSQL 16.6, FreeBSD 14.2, PGDATA mounted NFS from UGreen NAS.

FWIW, I couldn't reproduce such a problem with PG running on macOS
Sequoia and using an NFS mount from a RHEL8 machine.  (I tried with
current master code, but I don't believe that we fixed any related
bugs recently.)

Can you try some other combinations of host OS and NFS source,
and see if you can identify what's needed to cause the failure?

What it looks like is that readdir() is skipping over some files,
which would almost certainly be an NFS server bug.  But that
theory could be wrong.

            regards, tom lane



Re: Strange issue with NFS mounted PGDATA on ugreen NAS

From
Thomas Munro
Date:
On Wed, Jan 1, 2025 at 11:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Larry Rosenman <ler@lerctr.org> writes:
> > On 12/31/2024 12:22 pm, Larry Rosenman wrote:
> >> When I try to drop a database, PostgreSQL leaves files in the directory
> >> and does not even try to delete them.
> >> PostgreSQL 16.6, FreeBSD 14.2, PGDATA mounted NFS from UGreen NAS.

> What it looks like is that readdir() is skipping over some files,
> which would almost certainly be an NFS server bug.  But that
> theory could be wrong.

Hmm.  So in 15 and earlier, rmtree() would read all the file names
into memory and then unlink them, but 54e72b66 changed that and
started unlinking in the loop.  Perhaps that wasn't a great idea, but
it's hardly the only place that directory contents might change across
readdir() calls (it's just a place that does that itself).

As has been analysed by the pgbackrest guys, readdir() can be flaky
over changing directories on some NFS (and maybe other network FSs)
implementations (unresolved AFAIK and different from this problem,
they were missing files during backups due to concurrent changes).
The implementation-specific cookie scheme for encoding a sort of
cursor position across readdir() calls has various different problems
on various different OSes, NFS implementations and underlying local
file systems (I looked into this quite a bit when that last discussion
happened, it's a mess).  I wonder what a UGreen NAS is running.



Re: Strange issue with NFS mounted PGDATA on ugreen NAS

From
Larry Rosenman
Date:
On 12/31/2024 5:24 pm, Thomas Munro wrote:
> On Wed, Jan 1, 2025 at 11:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Larry Rosenman <ler@lerctr.org> writes:
>> > On 12/31/2024 12:22 pm, Larry Rosenman wrote:
>> >> When I try to drop a database, PostgreSQL leaves files in the directory
>> >> and does not even try to delete them.
>> >> PostgreSQL 16.6, FreeBSD 14.2, PGDATA mounted NFS from UGreen NAS.
> 
>> What it looks like is that readdir() is skipping over some files,
>> which would almost certainly be an NFS server bug.  But that
>> theory could be wrong.
> 
> Hmm.  So in 15 and earlier, rmtree() would read all the file names
> into memory and then unlink them, but 54e72b66 changed that and
> started unlinking in the loop.  Perhaps that wasn't a great idea, but
> it's hardly the only place that directory contents might change across
> readdir() calls (it's just a place that does that itself).
> 
> As has been analysed by the pgbackrest guys, readdir() can be flaky
> over changing directories on some NFS (and maybe other network FSs)
> implementations (unresolved AFAIK and different from this problem,
> they were missing files during backups due to concurrent changes).
> The implementation-specific cookie scheme for encoding a sort of
> cursor position across readdir() calls has various different problems
> on various different OSes, NFS implementations and underlying local
> file systems (I looked into this quite a bit when that last discussion
> happened, it's a mess).  I wonder what a UGreen NAS is running.

UGreen is running a (modified) Debian Bookworm Linux.
-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010



Re: Strange issue with NFS mounted PGDATA on ugreen NAS

From
Tom Lane
Date:
Larry Rosenman <ler@lerctr.org> writes:
> On 12/31/2024 5:24 pm, Thomas Munro wrote:
>> The implementation-specific cookie scheme for encoding a sort of
>> cursor position across readdir() calls has various different problems
>> on various different OSes, NFS implementations and underlying local
>> file systems (I looked into this quite a bit when that last discussion
>> happened, it's a mess).  I wonder what a UGreen NAS is running.

> UGreen is running a (modified) Debian Bookworm Linux.

Do you know what its underlying file system is?

            regards, tom lane



Re: Strange issue with NFS mounted PGDATA on ugreen NAS

From
Larry Rosenman
Date:
On 12/31/2024 5:37 pm, Tom Lane wrote:
> Larry Rosenman <ler@lerctr.org> writes:
>> On 12/31/2024 5:24 pm, Thomas Munro wrote:
>>> The implementation-specific cookie scheme for encoding a sort of
>>> cursor position across readdir() calls has various different problems
>>> on various different OSes, NFS implementations and underlying local
>>> file systems (I looked into this quite a bit when that last 
>>> discussion
>>> happened, it's a mess).  I wonder what a UGreen NAS is running.
> 
>> UGreen is running a (modified) Debian Bookworm Linux.
> 
> Do you know what its underlying file system is?
> 
>             regards, tom lane
btrfs

-- 
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 214-642-9640                 E-Mail: ler@lerctr.org
US Mail: 13425 Ranch Road 620 N, Apt 718, Austin, TX 78717-1010



Re: Strange issue with NFS mounted PGDATA on ugreen NAS

From
Tom Lane
Date:
Larry Rosenman <ler@lerctr.org> writes:
> On 12/31/2024 5:37 pm, Tom Lane wrote:
>> Do you know what its underlying file system is?

> btrfs

OK.  My test was with XFS underneath the NFS service.

            regards, tom lane



Re: Strange issue with NFS mounted PGDATA on ugreen NAS

From
Thomas Munro
Date:
On Wed, Jan 1, 2025 at 1:20 PM Kenneth Marshall <ktm@rice.edu> wrote:
> On Tue, Dec 31, 2024 at 06:58:14PM -0500, Tom Lane wrote:
> > Larry Rosenman <ler@lerctr.org> writes:
> > > On 12/31/2024 5:37 pm, Tom Lane wrote:
> > >> Do you know what its underlying file system is?
> >
> > > btrfs

> Maybe there are some btrfs or nfs options that can be used to mitigate
> this effect. Otherwise, a bug report to Debian would be in order, I guess.

Mount option readdirsize on the client side should hide the problem up
to some size you choose, but you can't set it large enough for high
numbers of relations/forks/segments.

Guessing what is happening here:  I suspect BTRFS might have
positional offsets 1, 2, 3, ... for directory entries' d_off (the
value visible in struct direct, used for telldir(), seekdir(), and
NFS's behind-the-curtain paging scheme), and they might slide when you
unlink stuff.  Perhaps not immediately, but when the directory fd is
closed on the NFS server (nearly immediately I guess given the
stateless nature of NFS, it doesn't matter that the client has its
directory fd open).  That would explain how you finished up with so
many missed files.

I think XFS's d_off points to the next entry in a btree leaf page
scan, which sounds a lot more stable... until someone else unlinks the
next item underneath you and/or the system decides to compact stuff,
who knows...  And other systems have other schemes based on hashes or
raw offsets, with different degrees of stability and anomalies (cf
ELOOP for hash collisions).

NFS is at least supposed to tell the client that its cookie has been
invalidated with a cookie-invalidation-cookie called cookieverf.  But
there isn't any specified way to recover.  FreeBSD's client looks like
it might try to, but I'm not sure if that Linux's server even
implements it.

Anyway, I'll write a patch to change rmtree() to buffer the names in
memory.  In theory there could be hundreds of gigabytes of pathnames,
so perhaps I should do it in batches; I'll look into that.



Re: Strange issue with NFS mounted PGDATA on ugreen NAS

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> NFS is at least supposed to tell the client that its cookie has been
> invalidated with a cookie-invalidation-cookie called cookieverf.  But
> there isn't any specified way to recover.  FreeBSD's client looks like
> it might try to, but I'm not sure if that Linux's server even
> implements it.

ISTM we used to disclaim responsibility for data integrity if you
try to put PGDATA on NFS.  I looked at the current wording about
NFS in runtime.sgml and was frankly shocked at how optimistic it is.
Shouldn't we be saying something closer to "if it breaks you
get to keep both pieces"?

> Anyway, I'll write a patch to change rmtree() to buffer the names in
> memory.  In theory there could be hundreds of gigabytes of pathnames,
> so perhaps I should do it in batches; I'll look into that.

This feels a lot like putting lipstick on a pig.

            regards, tom lane



Re: Strange issue with NFS mounted PGDATA on ugreen NAS

From
Thomas Munro
Date:
On Wed, Jan 1, 2025 at 6:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> ISTM we used to disclaim responsibility for data integrity if you
> try to put PGDATA on NFS.  I looked at the current wording about
> NFS in runtime.sgml and was frankly shocked at how optimistic it is.
> Shouldn't we be saying something closer to "if it breaks you
> get to keep both pieces"?

I now suspect this specific readdir() problem is in FreeBSD's NFS
client.  See below.  There have also been reports of missed files from
(IIRC) Linux clients without much analysis, but that doesn't seem too
actionable from here unless someone can come up with a repro or at
least some solid details to investigate; those involved unspecified
(possibly appliance/cloud) NFS and CIFS file servers.

The other issue I know of personally is NFS ENOSPC, which has some
exciting disappearing-committed-data failure modes caused by lazy
allocation on Linux's implementation (and possibly others), that I've
written about before.  But actually that one is not strictly an
NFS-only issue, it's just really easy to hit that way, and I have a
patch to fix it on our side, which I hope to re-post soon.
Independently of this, really as it's tangled up with quite a few
other things...

> > Anyway, I'll write a patch to change rmtree() to buffer the names in
> > memory.  In theory there could be hundreds of gigabytes of pathnames,
> > so perhaps I should do it in batches; I'll look into that.
>
> This feels a lot like putting lipstick on a pig.

Hehe.  Yeah.  Abandoned.

I see this issue here with a FreeBSD client talking to a Debian server
exporting BTRFS or XFS, even with dirreadsize set high so that
multi-request paging is not expected.  Looking at Wireshark and the
NFS spec (disclaimer: I have never studied NFS at this level before,
addito salis grano), what I see is a READDIR request with cookie=0
(good), and which receives a response containing the whole directory
listing and a final entry marker eof=1 (good), but then FreeBSD
unexpectedly (to me) sends *another* READDIR request with cookie=662,
which is a real cookie that was received somewhere in the middle of
the first response on the entry for "13816_fsm", and that entry was
followed by an entry for "13816_vm".  The second request gets a
response that begins at "13816_vm" (correct on the server's part).
Then the client sends REMOVE (unlink) requests for some but not all of
the files, including "13816_fsm" but not "13816_vm".  Then it sends
yet another READDIR request with cookie=0 (meaning go from the top),
and gets a non-empty directory listing, but immediately sends RMDIR,
which unsurprisingly fails NFS3ERR_NOTEMPTY.  So my best guess so far
is that FreeBSD's NFS client must be corrupting its directory cache
when files are unlinked, and it's not the server's fault.  I don't see
any obvious problem with the way the cookies work.  Seems like
material for a minimised bug report elsewhere, and not our issue.



Re: Strange issue with NFS mounted PGDATA on ugreen NAS

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> I now suspect this specific readdir() problem is in FreeBSD's NFS
> client.  See below.  There have also been reports of missed files from
> (IIRC) Linux clients without much analysis, but that doesn't seem too
> actionable from here unless someone can come up with a repro or at
> least some solid details to investigate; those involved unspecified
> (possibly appliance/cloud) NFS and CIFS file servers.

I forgot to report back, but yesterday I spent time unsuccessfully
trying to reproduce the problem with macOS client and NFS server
using btrfs (a Synology NAS running some no-name version of Linux).
So that lends additional weight to your conclusion that it isn't
specifically a btrfs bug.

> I see this issue here with a FreeBSD client talking to a Debian server
> exporting BTRFS or XFS, even with dirreadsize set high so that
> multi-request paging is not expected.  Looking at Wireshark and the
> NFS spec (disclaimer: I have never studied NFS at this level before,
> addito salis grano), what I see is a READDIR request with cookie=0
> (good), and which receives a response containing the whole directory
> listing and a final entry marker eof=1 (good), but then FreeBSD
> unexpectedly (to me) sends *another* READDIR request with cookie=662,
> which is a real cookie that was received somewhere in the middle of
> the first response on the entry for "13816_fsm", and that entry was
> followed by an entry for "13816_vm".  The second request gets a
> response that begins at "13816_vm" (correct on the server's part).
> Then the client sends REMOVE (unlink) requests for some but not all of
> the files, including "13816_fsm" but not "13816_vm".  Then it sends
> yet another READDIR request with cookie=0 (meaning go from the top),
> and gets a non-empty directory listing, but immediately sends RMDIR,
> which unsurprisingly fails NFS3ERR_NOTEMPTY.  So my best guess so far
> is that FreeBSD's NFS client must be corrupting its directory cache
> when files are unlinked, and it's not the server's fault.  I don't see
> any obvious problem with the way the cookies work.  Seems like
> material for a minimised bug report elsewhere, and not our issue.

Yeah, that seems pretty damning.  Thanks for looking into it.

            regards, tom lane



Re: Strange issue with NFS mounted PGDATA on ugreen NAS

From
Tom Lane
Date:
I wrote:
> I forgot to report back, but yesterday I spent time unsuccessfully
> trying to reproduce the problem with macOS client and NFS server
> using btrfs (a Synology NAS running some no-name version of Linux).

Also, I *can* reproduce it using the same NFS server and a FreeBSD
14.2 client.  At least with this pair of machines, the behavior seems
deterministic: "createdb foo" followed by "dropdb foo" leaves the
same set of not-dropped files behind each time.  I added a bit of
debug logging to rmtree.c and verified that it's not seeing anything
odd happening, except that readdir never returns anything about the
missed files.

I will file a bug report, unless you already did?

            regards, tom lane