Re: Checking for missing heap/index files - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Checking for missing heap/index files
Date
Msg-id CA+TgmoYK5pv0HKvQTSZNkE9iSisc0JYtGtrqnG36brgvYW7p1w@mail.gmail.com
Whole thread Raw
In response to Re: Checking for missing heap/index files  (Stephen Frost <sfrost@snowman.net>)
Responses Re: Checking for missing heap/index files  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Fri, Jun 17, 2022 at 6:31 PM Stephen Frost <sfrost@snowman.net> wrote:
>> Hmm, this sounds pretty bad, and I agree that a workaround should be put
>> in place.  But where is pg_basebackup looping around readdir()?  I
>> couldn't find it.  There's a call to readdir() in FindStreamingStart(),
>> but that doesn't seem to match what you describe.
>
> It’s the server side that does it in basebackup.c when it’s building the tarball for the data dir and each table
spaceand sending it to the client. It’s not done by src/bin/pg_basebackup. Sorry for not being clear. Technically this
wouldbe beyond just pg_basebackup but would impact, potentially, anything using BASE_BACKUP from the replication
protocol(in addition to other backup tools which operate against the data directory with readdir, of course). 

Specifically, sendDir() can either recurse back into sendDir(), or it
can call sendFile(). So in theory if your directory contains
some/stupid/long/path/to/a/file, you could have 6 directory scans open
all at the same time, and then a file being read concurrently with
that. That provides a lot of time for things to change concurrently,
especially at the outer levels. Before the scan of the outermost
directory moves to the next file, it will have to completely finish
reading and sending every file in the directory tree that was rooted
at the directory entry we last read.

I think this could be changed pretty easily. We could change sendDir()
to read all of the directory entries into an in-memory buffer and then
close the directory and iterate over the buffer. I see two potential
disadvantages of that approach. Number one, we could encounter a
directory with a vast number of relations. There are probably
subdirectories of PostgreSQL data directories that contain tens of
millions of files, possibly hundreds of millions of files. So,
remembering all that data in memory would potentially take gigabytes
of memory. I'm not sure that's a huge problem, because if you have
hundreds of millions of tables in a single database, you should
probably have enough memory in the system that a few GB of RAM to take
a backup is no big deal. However, people don't always have the memory
that they should have, and many users do in fact run systems at a
level of load that pushes the boundaries of their hardware.
Nonetheless, we could choose to take the position that caching the
list of filenames is worth it to avoid this risk.

The other consideration here is that this is not a complete remedy. It
makes the race condition narrower, I suppose, but it does not remove
it. Ideally, we would like to do better than "our new code will
corrupt your backup less often." However, I don't quite see how to get
there. We either need the OS to deliver us a reliable list of what
files exist - and I don't see how to make it do that if readdir
doesn't - or we need a way to know what files are supposed to exist
without reference to the OS - which would require some way of reading
the list of relfilenodes from a database to which we're not connected.
So maybe corrupting your backup less often is the best we can do. I do
wonder how often this actually happens though, and on which
filesystems. The provided links seem to suggest that this is mostly a
problem with network filesystems, especially CIFS, and maybe also NFS.

I'd be really interested in knowing whether this happens on a
mainstream, non-networked filesystem. It's not an irrelevant concern
even if it happens only on networked filesystems, but a lot more
people will be at risk if it also happens on ext4 or xfs. It does seem
a little bit surprising if no filesystem has a way of preventing this.
I mean, does open() also randomly but with low probability fail to
find a file that exists, due to a concurrent directory modification on
some directory in the pathname? I assume that would be unacceptable,
and the file system has a way of preventing that from happening, then
it has some way of ensuring a stable read of a directory, at least
over a short period.

--
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: pg_upgrade test failure
Next
From: Peter Eisentraut
Date:
Subject: Re: Make finding openssl program a configure or meson option