Re: Fwd: Re: A new look at old NFS readdir() problems? - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Fwd: Re: A new look at old NFS readdir() problems?
Date
Msg-id 520100.1735952472@sss.pgh.pa.us
Whole thread Raw
In response to Re: Fwd: Re: A new look at old NFS readdir() problems?  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Fwd: Re: A new look at old NFS readdir() problems?
List pgsql-hackers
Thomas Munro <thomas.munro@gmail.com> writes:
> I doubt that hides all potential problems though, if I have understood
> the vague outline of this bug correctly: perhaps if you ran large
> enough rm -r, and you unlinked a file concurrently with that loop, you
> could break it, that is, cause it to skip innocent files other than
> the one you unlinked?

Yeah.  The thing that makes this untenably bad for us is that it's not
only your own process's actions that can break readdir(), it's the
actions of some unrelated process.  So even "buffer the whole
directory contents" isn't a reliable fix, because someone else could
unlink a file while you're reading the directory.

One way to "fix" this from our side is to institute some system-wide
interlock that prevents file deletions (and renames, and maybe even
creations?) while any process is executing a readdir loop.  That's
several steps beyond awful from a performance standpoint, and what's
worse is that it's still not reliable.  It only takes one bit of
code that isn't on board with the locking protocol to put you right
back at square one.  It might not even be code that's part of Postgres
proper, so long as it has access to PGDATA.  As an example: it'd be
unsafe to modify postgresql.conf with emacs, or any other editor that
likes to make a backup file.  So IMV that's no fix at all.

The only other thing I can think of is to read and buffer the whole
directory (no matter how large), and then do it *again*, and repeat
till we get the same results twice in a row.  That's likewise just
horrid from a performance standpoint.  Worse, I'm not entirely
convinced that it's a 100% fix: it seems to rely on rather debatable
assumptions about the possible consequences of concurrent directory
mods.

> FWIW I am discussing this off-list with Rick, I *think* we can
> distinguish between "gripes abouts NFS that we can't do much about"
> and "extra thing going wrong here".  The fog of superstition around
> NFS is thick because so many investigations end at a vendor
> boundary/black box, but here we can understand the cookie scheme and
> trace through all the layers here...

I wouldn't have any problem with saying that we don't support NFS
implementations that don't have stable cookies.  But so far I haven't
found any supported platform except FreeBSD that fails the rmtree test
against my Synology NAS.  I think the circumstantial evidence that
FreeBSD is doing something wrong, or wronger than necessary, is
strong.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Next
From: Matthias van de Meent
Date:
Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements