Thread: Checking for missing heap/index files

Checking for missing heap/index files

From
Bruce Momjian
Date:
We currently can check for missing heap/index files by comparing
pg_class with the database directory files.  However, I am not clear if
this is safe during concurrent DDL.  I assume we create the file before
the update to pg_class is visible, but do we always delete the file
after the update to pg_class is visible?  I assume any external checking
tool would need to lock the relation to prevent concurrent DDL.

Also, how would it check if the number of extents is correct?  Seems we
would need this value to be in pg_class, and have the same update
protections outlined above.  Seems that would require heavier locking.

Is this something anyone has even needed or had requested?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




Re: Checking for missing heap/index files

From
Mark Dilger
Date:

> On Jun 8, 2022, at 5:45 AM, Bruce Momjian <bruce@momjian.us> wrote:
>
> Is this something anyone has even needed or had requested?

I might have put this in amcheck's verify_heapam() had there been an interface for it.  I vaguely recall wanting
somethinglike this, yes. 

As it stands, verify_heapam() may trigger mdread()'s "could not open file" or "could not read block" error, in the
courseof verifying the table.  There isn't an option in amcheck to just verify that the underlying files exist.  If
structf_smgr had a function for validating that all segment files exist, I may have added an option to amcheck (and the
pg_amcheckfrontend tool) to quickly look for missing files. 

Looking at smgr/md.c, it seems mdnblocks() is close to what we want, but it skips already opened segments "to avoid
redundantseeks".  Perhaps we'd want to add a function to f_smgr, say "smgr_allexist", to check for all segment files?
I'mnot sure how heavy-handed the corresponding mdallexist() function should be.  Should it close all open segments,
thenreopen and check the size of all of them by calling mdnblocks()?  That seems safer than merely asking the
filesystemif the file exists without verifying that it can be opened. 

If we made these changes, and added corresponding quick check options to amcheck and pg_amcheck, would that meet your
currentneeds?  The downside to using amcheck for this sort of thing is that we did not (and likely will not) back port
it. I have had several occasions to want this functionality recently, but the customers were on pre-v14 servers, so
thesetools were not available anyway. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Checking for missing heap/index files

From
Bruce Momjian
Date:
On Thu, Jun 9, 2022 at 09:46:51AM -0700, Mark Dilger wrote:
>
>
> > On Jun 8, 2022, at 5:45 AM, Bruce Momjian <bruce@momjian.us> wrote:
> >
> > Is this something anyone has even needed or had requested?
>
> I might have put this in amcheck's verify_heapam() had there been an
> interface for it.  I vaguely recall wanting something like this, yes.
>
> As it stands, verify_heapam() may trigger mdread()'s "could not open
> file" or "could not read block" error, in the course of verifying
> the table.  There isn't an option in amcheck to just verify that
> the underlying files exist.  If struct f_smgr had a function for
> validating that all segment files exist, I may have added an option to
> amcheck (and the pg_amcheck frontend tool) to quickly look for missing
> files.

Well, how do we know what files should exist?  We know the first segment
should probably exist, but what about +1GB segments?

> Looking at smgr/md.c, it seems mdnblocks() is close to what we want,
> but it skips already opened segments "to avoid redundant seeks".
> Perhaps we'd want to add a function to f_smgr, say "smgr_allexist",
> to check for all segment files?  I'm not sure how heavy-handed the
> corresponding mdallexist() function should be.  Should it close
> all open segments, then reopen and check the size of all of them
> by calling mdnblocks()?  That seems safer than merely asking the
> filesystem if the file exists without verifying that it can be opened.

Yes.

> If we made these changes, and added corresponding quick check options
> to amcheck and pg_amcheck, would that meet your current needs?  The
> downside to using amcheck for this sort of thing is that we did not
> (and likely will not) back port it.  I have had several occasions to
> want this functionality recently, but the customers were on pre-v14
> servers, so these tools were not available anyway.

I don't have a need for it --- I was just wondering why we have
something that checks the relation contents, but not the file existence?

It seems like pg_amcheck would be a nice place to add this checking
ability.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




Re: Checking for missing heap/index files

From
Peter Geoghegan
Date:
On Thu, Jun 9, 2022 at 11:46 AM Bruce Momjian <bruce@momjian.us> wrote:
> I don't have a need for it --- I was just wondering why we have
> something that checks the relation contents, but not the file existence?

We do this for B-tree indexes within amcheck. They must always have
storage, if only to store the index metapage. (Actually unlogged
indexes that run on a standby don't, but that's accounted for
directly.)

-- 
Peter Geoghegan



Re: Checking for missing heap/index files

From
Stephen Frost
Date:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> We currently can check for missing heap/index files by comparing
> pg_class with the database directory files.  However, I am not clear if
> this is safe during concurrent DDL.  I assume we create the file before
> the update to pg_class is visible, but do we always delete the file
> after the update to pg_class is visible?  I assume any external checking
> tool would need to lock the relation to prevent concurrent DDL.

It'd sure be nice if an external tool (such as one trying to back up the
database..) could get this full list *without* having to run around and
lock everything.  This is because of some fun discoveries that have been
made around readdir() not always being entirely honest.  Take a look at:

https://github.com/pgbackrest/pgbackrest/issues/1754

and

https://gitlab.alpinelinux.org/alpine/aports/-/issues/10960

TL;DR: if you're removing files from a directory that you've got an
active readdir() running through, you might not actually get all of the
*existing* files.  Given that PG is happy to remove files from PGDATA
while a backup is running, in theory this could lead to a backup utility
like pgbackrest or pg_basebackup not actually backing up all the files.

Now, pgbackrest runs the readdir() very quickly to build a manifest of
all of the files to backup, minimizing the window for this to possibly
happen, but pg_basebackup keeps a readdir() open during the entire
backup, making this more possible.

> Also, how would it check if the number of extents is correct?  Seems we
> would need this value to be in pg_class, and have the same update
> protections outlined above.  Seems that would require heavier locking.

Would be nice to have but also would be expensive to maintain..

Thanks,

Stephen

Attachment

Re: Checking for missing heap/index files

From
Robert Haas
Date:
On Wed, Jun 8, 2022 at 8:46 AM Bruce Momjian <bruce@momjian.us> wrote:
> We currently can check for missing heap/index files by comparing
> pg_class with the database directory files.  However, I am not clear if
> this is safe during concurrent DDL.  I assume we create the file before
> the update to pg_class is visible, but do we always delete the file
> after the update to pg_class is visible?  I assume any external checking
> tool would need to lock the relation to prevent concurrent DDL.

If you see an entry in pg_class, then there should definitely be a
file present on disk. The reverse is not true: just because you don't
see an entry in pg_class for a file that's on disk doesn't mean it's
safe to remove that file.

> Also, how would it check if the number of extents is correct?  Seems we
> would need this value to be in pg_class, and have the same update
> protections outlined above.  Seems that would require heavier locking.

Yeah, and it's not just the number of extents but the length of the
last one. If the last extent is supposed to be 700MB and it gets
truncated to 200MB, it would be nice if we could notice that.

One idea might be for each heap table to have a metapage and store the
length - or an upper bound on the length - in the metapage. That'd
probably be cheaper than updating pg_class, but might still be
expensive in some scenarios, and it's a fairly large amount of
engineering.

> Is this something anyone has even needed or had requested?

Definitely. And also the reverse: figuring out which files on disk are
old garbage that can be safely nuked.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Checking for missing heap/index files

From
Bruce Momjian
Date:
On Mon, Jun 13, 2022 at 04:06:12PM -0400, Robert Haas wrote:
> One idea might be for each heap table to have a metapage and store the
> length - or an upper bound on the length - in the metapage. That'd
> probably be cheaper than updating pg_class, but might still be
> expensive in some scenarios, and it's a fairly large amount of
> engineering.

I agree --- it would be nice, but might be hard to justify the
engineering and overhead involved.  I guess I was just checking that I
wasn't missing something obvious.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




Re: Checking for missing heap/index files

From
Peter Geoghegan
Date:
On Mon, Jun 13, 2022 at 4:15 PM Bruce Momjian <bruce@momjian.us> wrote:
> I agree --- it would be nice, but might be hard to justify the
> engineering and overhead involved.  I guess I was just checking that I
> wasn't missing something obvious.

I suspect that the cost of being sloppy about this kind of thing
outweighs any benefit -- it's a false economy.

I believe we ought to eventually have crash-safe relation extension
and file allocation. Right now we're held back by concerns about
leaking a large number of empty pages (at least until the next
VACUUM). If leaking space was simply not possible in the first place,
we could afford to be more aggressive in code like
RelationAddExtraBlocks() -- it currently has a conservative cap of 512
pages per extension right now. This would require work in the FSM of
the kind I've been working on, on and off.

Each relation extension is bound to be more expensive when the process
is made crash safe, obviously -- but only if no other factor changes.
With larger batch sizes per relation extension, it could be very
different. Once you factor in lock contention, then having fewer
individual relation extensions for a fixed number of pages may make
all the difference.

-- 
Peter Geoghegan



Re: Checking for missing heap/index files

From
Alvaro Herrera
Date:
On 2022-Jun-09, Stephen Frost wrote:

> TL;DR: if you're removing files from a directory that you've got an
> active readdir() running through, you might not actually get all of the
> *existing* files.  Given that PG is happy to remove files from PGDATA
> while a backup is running, in theory this could lead to a backup utility
> like pgbackrest or pg_basebackup not actually backing up all the files.
> 
> Now, pgbackrest runs the readdir() very quickly to build a manifest of
> all of the files to backup, minimizing the window for this to possibly
> happen, but pg_basebackup keeps a readdir() open during the entire
> backup, making this more possible.

Hmm, this sounds pretty bad, and I agree that a workaround should be put
in place.  But where is pg_basebackup looping around readdir()?  I
couldn't find it.  There's a call to readdir() in FindStreamingStart(),
but that doesn't seem to match what you describe.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



Re: Checking for missing heap/index files

From
Stephen Frost
Date:
Greetings,

On Fri, Jun 17, 2022 at 14:32 Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2022-Jun-09, Stephen Frost wrote:

> TL;DR: if you're removing files from a directory that you've got an
> active readdir() running through, you might not actually get all of the
> *existing* files.  Given that PG is happy to remove files from PGDATA
> while a backup is running, in theory this could lead to a backup utility
> like pgbackrest or pg_basebackup not actually backing up all the files.
>
> Now, pgbackrest runs the readdir() very quickly to build a manifest of
> all of the files to backup, minimizing the window for this to possibly
> happen, but pg_basebackup keeps a readdir() open during the entire
> backup, making this more possible.

Hmm, this sounds pretty bad, and I agree that a workaround should be put
in place.  But where is pg_basebackup looping around readdir()?  I
couldn't find it.  There's a call to readdir() in FindStreamingStart(),
but that doesn't seem to match what you describe.

It’s the server side that does it in basebackup.c when it’s building the tarball for the data dir and each table space and sending it to the client. It’s not done by src/bin/pg_basebackup. Sorry for not being clear. Technically this would be beyond just pg_basebackup but would impact, potentially, anything using BASE_BACKUP from the replication protocol (in addition to other backup tools which operate against the data directory with readdir, of course).

Thanks,

Stephen

Re: Checking for missing heap/index files

From
Robert Haas
Date:
On Fri, Jun 17, 2022 at 6:31 PM Stephen Frost <sfrost@snowman.net> wrote:
>> Hmm, this sounds pretty bad, and I agree that a workaround should be put
>> in place.  But where is pg_basebackup looping around readdir()?  I
>> couldn't find it.  There's a call to readdir() in FindStreamingStart(),
>> but that doesn't seem to match what you describe.
>
> It’s the server side that does it in basebackup.c when it’s building the tarball for the data dir and each table
spaceand sending it to the client. It’s not done by src/bin/pg_basebackup. Sorry for not being clear. Technically this
wouldbe beyond just pg_basebackup but would impact, potentially, anything using BASE_BACKUP from the replication
protocol(in addition to other backup tools which operate against the data directory with readdir, of course). 

Specifically, sendDir() can either recurse back into sendDir(), or it
can call sendFile(). So in theory if your directory contains
some/stupid/long/path/to/a/file, you could have 6 directory scans open
all at the same time, and then a file being read concurrently with
that. That provides a lot of time for things to change concurrently,
especially at the outer levels. Before the scan of the outermost
directory moves to the next file, it will have to completely finish
reading and sending every file in the directory tree that was rooted
at the directory entry we last read.

I think this could be changed pretty easily. We could change sendDir()
to read all of the directory entries into an in-memory buffer and then
close the directory and iterate over the buffer. I see two potential
disadvantages of that approach. Number one, we could encounter a
directory with a vast number of relations. There are probably
subdirectories of PostgreSQL data directories that contain tens of
millions of files, possibly hundreds of millions of files. So,
remembering all that data in memory would potentially take gigabytes
of memory. I'm not sure that's a huge problem, because if you have
hundreds of millions of tables in a single database, you should
probably have enough memory in the system that a few GB of RAM to take
a backup is no big deal. However, people don't always have the memory
that they should have, and many users do in fact run systems at a
level of load that pushes the boundaries of their hardware.
Nonetheless, we could choose to take the position that caching the
list of filenames is worth it to avoid this risk.

The other consideration here is that this is not a complete remedy. It
makes the race condition narrower, I suppose, but it does not remove
it. Ideally, we would like to do better than "our new code will
corrupt your backup less often." However, I don't quite see how to get
there. We either need the OS to deliver us a reliable list of what
files exist - and I don't see how to make it do that if readdir
doesn't - or we need a way to know what files are supposed to exist
without reference to the OS - which would require some way of reading
the list of relfilenodes from a database to which we're not connected.
So maybe corrupting your backup less often is the best we can do. I do
wonder how often this actually happens though, and on which
filesystems. The provided links seem to suggest that this is mostly a
problem with network filesystems, especially CIFS, and maybe also NFS.

I'd be really interested in knowing whether this happens on a
mainstream, non-networked filesystem. It's not an irrelevant concern
even if it happens only on networked filesystems, but a lot more
people will be at risk if it also happens on ext4 or xfs. It does seem
a little bit surprising if no filesystem has a way of preventing this.
I mean, does open() also randomly but with low probability fail to
find a file that exists, due to a concurrent directory modification on
some directory in the pathname? I assume that would be unacceptable,
and the file system has a way of preventing that from happening, then
it has some way of ensuring a stable read of a directory, at least
over a short period.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Checking for missing heap/index files

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I'd be really interested in knowing whether this happens on a
> mainstream, non-networked filesystem. It's not an irrelevant concern
> even if it happens only on networked filesystems, but a lot more
> people will be at risk if it also happens on ext4 or xfs. It does seem
> a little bit surprising if no filesystem has a way of preventing this.
> I mean, does open() also randomly but with low probability fail to
> find a file that exists, due to a concurrent directory modification on
> some directory in the pathname? I assume that would be unacceptable,
> and the file system has a way of preventing that from happening, then
> it has some way of ensuring a stable read of a directory, at least
> over a short period.

The POSIX spec for readdir(3) has a little bit of info:

    The type DIR, which is defined in the <dirent.h> header, represents a
    directory stream, which is an ordered sequence of all the directory
    entries in a particular directory. Directory entries represent files;
    files may be removed from a directory or added to a directory
    asynchronously to the operation of readdir().

    If a file is removed from or added to the directory after the most
    recent call to opendir() or rewinddir(), whether a subsequent call to
    readdir() returns an entry for that file is unspecified.

There is no text suggesting that it's okay to miss, or to double-return,
an entry that is present throughout the scan.  So I'd interpret the case
you're worried about as "forbidden by POSIX".  Of course, it's known that
NFS fails to provide POSIX semantics in all cases --- but I don't know
if this is one of them.

            regards, tom lane



Re: Checking for missing heap/index files

From
Robert Haas
Date:
On Tue, Oct 18, 2022 at 12:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> There is no text suggesting that it's okay to miss, or to double-return,
> an entry that is present throughout the scan.  So I'd interpret the case
> you're worried about as "forbidden by POSIX".  Of course, it's known that
> NFS fails to provide POSIX semantics in all cases --- but I don't know
> if this is one of them.

Yeah, me neither. One problem I see is that, even if the behavior is
forbidden by POSIX, if it happens in practice on systems people
actually use, then it's an issue. We even have documentation saying
that it's OK to use NFS, and a lot of people do -- which IMHO is
unfortunate, but it's also not clear what the realistic alternatives
are. It's pretty hard to tell people in 2022 that they are only
allowed to use PostgreSQL with local storage.

But to put my cards on the table, it's not so much that I am worried
about this problem myself as that I want to know whether we're going
to do anything about it as a project, and if so, what, because it
intersects a patch that I'm working on. So if we want to readdir() in
one fell swoop and cache the results, I'm going to go write a patch
for that. If we don't, then I'd like to know whether (a) we think that
would be theoretically acceptable but not justified by the evidence
presently available or (b) would be unacceptable due to (b1) the
potential for increased memory usage or (b2) some other reason.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Checking for missing heap/index files

From
Stephen Frost
Date:
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Tue, Oct 18, 2022 at 12:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > There is no text suggesting that it's okay to miss, or to double-return,
> > an entry that is present throughout the scan.  So I'd interpret the case
> > you're worried about as "forbidden by POSIX".  Of course, it's known that
> > NFS fails to provide POSIX semantics in all cases --- but I don't know
> > if this is one of them.
>
> Yeah, me neither. One problem I see is that, even if the behavior is
> forbidden by POSIX, if it happens in practice on systems people
> actually use, then it's an issue. We even have documentation saying
> that it's OK to use NFS, and a lot of people do -- which IMHO is
> unfortunate, but it's also not clear what the realistic alternatives
> are. It's pretty hard to tell people in 2022 that they are only
> allowed to use PostgreSQL with local storage.
>
> But to put my cards on the table, it's not so much that I am worried
> about this problem myself as that I want to know whether we're going
> to do anything about it as a project, and if so, what, because it
> intersects a patch that I'm working on. So if we want to readdir() in
> one fell swoop and cache the results, I'm going to go write a patch
> for that. If we don't, then I'd like to know whether (a) we think that
> would be theoretically acceptable but not justified by the evidence
> presently available or (b) would be unacceptable due to (b1) the
> potential for increased memory usage or (b2) some other reason.

While I don't think it's really something that should be happening, it's
definitely something that's been seen with some networked filesystems,
as reported.  I also strongly suspect that on local filesystems there's
something that prevents this from happening but as mentioned that
doesn't cover all PG use cases.

In pgbackrest, we moved to doing a scan and cache'ing all of the results
in memory to reduce the risk when reading from the PG data dir.  We also
reworked our expire code (which removes an older backup from the backup
repository) to also do a complete scan before removing files.

I don't see it as likely to be acceptable, but arranging to not add or
remove files while the scan is happening would presumably eliminate the
risk entirely.  We've not seen this issue recur in the expire command
since the change to first completely scan the directory and then go and
remove the files from it.  Perhaps just not removing files during the
scan would be sufficient which might be more reasonable to do.

Thanks,

Stephen

Attachment

Re: Checking for missing heap/index files

From
Robert Haas
Date:
On Tue, Oct 18, 2022 at 2:37 PM Stephen Frost <sfrost@snowman.net> wrote:
> While I don't think it's really something that should be happening, it's
> definitely something that's been seen with some networked filesystems,
> as reported.

Do you have clear and convincing evidence of this happening on
anything other than CIFS?

> I don't see it as likely to be acceptable, but arranging to not add or
> remove files while the scan is happening would presumably eliminate the
> risk entirely.  We've not seen this issue recur in the expire command
> since the change to first completely scan the directory and then go and
> remove the files from it.  Perhaps just not removing files during the
> scan would be sufficient which might be more reasonable to do.

I don't think that's a complete non-starter, but I do think it would
be somewhat expensive in some workloads. I hate to make everyone pay
that much for insurance against a shouldn't-happen case. We could make
it optional, but then we're asking users to decide whether or not they
need insurance. Since we don't even know which filesystems are
potentially affected, how is anyone else supposed to know? Worse
still, if you have a corruption event, you're still not going to know
for sure whether this would have fixed it, so you still don't know
whether you should turn on the feature for next time. And if you do
turn it on and don't get corruption again, you don't know whether you
would have had a problem if you hadn't used the feature. It all just
seems like a lot of guesswork that will end up being frustrating to
both users and developers.

Just deciding to cache to the results of readdir() in memory is much
cheaper insurance. I think I'd probably be willing to foist that
overhead onto everyone, all the time. As I mentioned before, it could
still hose someone who is right on the brink of a memory disaster, but
that's a much narrower blast radius than putting locking around all
operations that create or remove a file in the same directory as a
relation file. But it's also not a complete fix, which sucks.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Checking for missing heap/index files

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Oct 18, 2022 at 2:37 PM Stephen Frost <sfrost@snowman.net> wrote:
>> I don't see it as likely to be acceptable, but arranging to not add or
>> remove files while the scan is happening would presumably eliminate the
>> risk entirely.  We've not seen this issue recur in the expire command
>> since the change to first completely scan the directory and then go and
>> remove the files from it.  Perhaps just not removing files during the
>> scan would be sufficient which might be more reasonable to do.

> Just deciding to cache to the results of readdir() in memory is much
> cheaper insurance. I think I'd probably be willing to foist that
> overhead onto everyone, all the time. As I mentioned before, it could
> still hose someone who is right on the brink of a memory disaster, but
> that's a much narrower blast radius than putting locking around all
> operations that create or remove a file in the same directory as a
> relation file. But it's also not a complete fix, which sucks.

Yeah, that.  I'm not sure if we need to do anything about this, but
if we do, I don't think that's it.  Agreed that the memory-consumption
objection is pretty weak; the one that seems compelling is that by
itself, this does nothing to fix the problem beyond narrowing the
window some.

Isn't it already the case (or could be made so) that relation file
removal happens only in the checkpointer?  I wonder if we could
get to a situation where we can interlock file removal just by
commanding the checkpointer to not do it for awhile.  Then combining
that with caching readdir results (to narrow the window in which we
have to stop the checkpointer) might yield a solution that has some
credibility.  This scheme doesn't attempt to prevent file creation
concurrently with a readdir, but you'd have to make some really
adverse assumptions to believe that file creation would cause a
pre-existing entry to get missed (as opposed to getting scanned
twice).  So it might be an acceptable answer.

            regards, tom lane



Re: Checking for missing heap/index files

From
Robert Haas
Date:
On Tue, Oct 18, 2022 at 3:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Isn't it already the case (or could be made so) that relation file
> removal happens only in the checkpointer?  I wonder if we could
> get to a situation where we can interlock file removal just by
> commanding the checkpointer to not do it for awhile.  Then combining
> that with caching readdir results (to narrow the window in which we
> have to stop the checkpointer) might yield a solution that has some
> credibility.  This scheme doesn't attempt to prevent file creation
> concurrently with a readdir, but you'd have to make some really
> adverse assumptions to believe that file creation would cause a
> pre-existing entry to get missed (as opposed to getting scanned
> twice).  So it might be an acceptable answer.

I believe that individual backends directly remove all relation forks
other than the main fork and all segments other than the first one.
The discussion on various other threads has been in the direction of
trying to standardize on moving that last case out of the checkpointer
- i.e. getting rid of what Thomas dubbed "tombstone" files - which is
pretty much the exact opposite of this proposal. But even apart from
that, I don't think this would be that easy to implement. If you
removed a large relation, you'd have to tell the checkpointer to
remove many files instead of just 1. That sounds kinda painful: it
would be more IPC, and it would delay file removal just so that we can
tell the checkpointer to delay it some more.

And I don't think we really need to do any of that. We could invent a
new kind of lock tag for <dboid/tsoid> combination. Take a share lock
to create or remove files. Take an exclusive lock to scan the
directory. I think that accomplishes the same thing as your proposal,
but more directly, and with less overhead. It's still substantially
more than NO overhead, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Checking for missing heap/index files

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Oct 18, 2022 at 3:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Isn't it already the case (or could be made so) that relation file
>> removal happens only in the checkpointer?

> I believe that individual backends directly remove all relation forks
> other than the main fork and all segments other than the first one.

Yeah, obviously some changes would need to be made to that, but ISTM
we could just treat all the forks as we now treat the first one.

> The discussion on various other threads has been in the direction of
> trying to standardize on moving that last case out of the checkpointer
> - i.e. getting rid of what Thomas dubbed "tombstone" files - which is
> pretty much the exact opposite of this proposal.

Yeah, we'd have to give up on that.  If that goes anywhere then
it kills this idea.

> But even apart from
> that, I don't think this would be that easy to implement. If you
> removed a large relation, you'd have to tell the checkpointer to
> remove many files instead of just 1.

The backends just implement this by deleting files until they don't
find the next one in sequence.  I fail to see how it'd be any
harder for the checkpointer to do that.

> And I don't think we really need to do any of that. We could invent a
> new kind of lock tag for <dboid/tsoid> combination. Take a share lock
> to create or remove files. Take an exclusive lock to scan the
> directory. I think that accomplishes the same thing as your proposal,
> but more directly, and with less overhead. It's still substantially
> more than NO overhead, though.

My concern about that is that it implies touching a whole lot of
places, and if you miss even one then you've lost whatever guarantee
you thought you were getting.  More, there's no easy way to find
all the relevant places (some will be in extensions, no doubt).
So I have approximately zero faith that it could be made reliable.
Funneling things through the checkpointer would make that a lot
more centralized.  I concede that cowboy unlink() calls could still
be a problem ... but I doubt there's any solution that's totally
free of that hazard.

            regards, tom lane



Re: Checking for missing heap/index files

From
Robert Haas
Date:
On Tue, Oct 18, 2022 at 5:44 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> My concern about that is that it implies touching a whole lot of
> places, and if you miss even one then you've lost whatever guarantee
> you thought you were getting.  More, there's no easy way to find
> all the relevant places (some will be in extensions, no doubt).

*scratches head*

I must be missing something. It seems to me that it requires touching
exactly the same set of places as your idea. And it also seems to me
like it's not that many places. Am I being really dumb here?

It's still somewhat unclear to me whether we should be doing anything
at all here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com