Thread: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

Hi,

Currently the server is erroring out when unable to remove/parse a
logical rewrite file in CheckPointLogicalRewriteHeap wasting the
amount of work the checkpoint has done and preventing the checkpoint
from finishing. This is unlike CheckPointSnapBuild does for snapshot
files i.e. it just emits a message at LOG level and continues if it is
unable to parse or remove the file. Attaching a small patch applying
the same idea to the mapping files.

Thoughts?

Regards,
Bharath Rupireddy.

Attachment
On 12/31/21, 4:44 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
> Currently the server is erroring out when unable to remove/parse a
> logical rewrite file in CheckPointLogicalRewriteHeap wasting the
> amount of work the checkpoint has done and preventing the checkpoint
> from finishing. This is unlike CheckPointSnapBuild does for snapshot
> files i.e. it just emits a message at LOG level and continues if it is
> unable to parse or remove the file. Attaching a small patch applying
> the same idea to the mapping files.

This seems reasonable to me.  AFAICT moving on to other files after an
error shouldn't cause any problems.  In fact, it's probably beneficial
to try to clean up as much as possible so that the files do not
continue to build up.

The only feedback I have for the patch is that I don't think the new
comments are necessary.

Nathan


On Thu, Jan 13, 2022 at 3:47 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>
> On 12/31/21, 4:44 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
> > Currently the server is erroring out when unable to remove/parse a
> > logical rewrite file in CheckPointLogicalRewriteHeap wasting the
> > amount of work the checkpoint has done and preventing the checkpoint
> > from finishing. This is unlike CheckPointSnapBuild does for snapshot
> > files i.e. it just emits a message at LOG level and continues if it is
> > unable to parse or remove the file. Attaching a small patch applying
> > the same idea to the mapping files.
>
> This seems reasonable to me.  AFAICT moving on to other files after an
> error shouldn't cause any problems.  In fact, it's probably beneficial
> to try to clean up as much as possible so that the files do not
> continue to build up.

Thanks for the review Nathan!

> The only feedback I have for the patch is that I don't think the new
> comments are necessary.

I borrowed the comments as-is from the CheckPointSnapBuild introduced
by the commit b89e15105. IMO, let the comments be there as they
explain why we are not emitting ERRORs, however I will leave it to the
committer to decide on that.

Regards,
Bharath Rupireddy.



On 1/12/22, 10:03 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
> On Thu, Jan 13, 2022 at 3:47 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>> The only feedback I have for the patch is that I don't think the new
>> comments are necessary.
>
> I borrowed the comments as-is from the CheckPointSnapBuild introduced
> by the commit b89e15105. IMO, let the comments be there as they
> explain why we are not emitting ERRORs, however I will leave it to the
> committer to decide on that.

Huh, somehow I missed that when I looked at it yesterday.  I'm going
to bump this one to ready-for-committer, then.

Nathan


Hi,

On 2021-12-31 18:12:37 +0530, Bharath Rupireddy wrote:
> Currently the server is erroring out when unable to remove/parse a
> logical rewrite file in CheckPointLogicalRewriteHeap wasting the
> amount of work the checkpoint has done and preventing the checkpoint
> from finishing.

This seems like it'd make failures to remove the files practically
invisible. Which'd have it's own set of problems?

What motivated proposing this change?

Greetings,

Andres Freund



On Fri, Jan 14, 2022 at 1:08 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-12-31 18:12:37 +0530, Bharath Rupireddy wrote:
> > Currently the server is erroring out when unable to remove/parse a
> > logical rewrite file in CheckPointLogicalRewriteHeap wasting the
> > amount of work the checkpoint has done and preventing the checkpoint
> > from finishing.
>
> This seems like it'd make failures to remove the files practically
> invisible. Which'd have it's own set of problems?
>
> What motivated proposing this change?

We had an issue where there were many mapping files generated during
the crash recovery and end-of-recovery checkpoint was taking a lot of
time. We had to manually intervene and delete some of the mapping
files (although it may not sound sensible) to make end-of-recovery
checkpoint faster. Because of the race condition between manual
deletion and checkpoint deletion, the unlink error occurred which
crashed the server and the server entered the recovery again wasting
the entire earlier recovery work.

In summary, with the changes (emitting LOG-only messages for unlink
failures and continuing with the other files) proposed for
CheckPointLogicalRewriteHeap in this thread and the existing code in
CheckPointSnapBuild, I'm sure it will help not waste the recovery
that's has been done in case  unlink fails for any reasons.

Regards,
Bharath Rupireddy.



Hi,

On Sat, Jan 15, 2022 at 02:04:12PM +0530, Bharath Rupireddy wrote:
> 
> We had an issue where there were many mapping files generated during
> the crash recovery and end-of-recovery checkpoint was taking a lot of
> time. We had to manually intervene and delete some of the mapping
> files (although it may not sound sensible) to make end-of-recovery
> checkpoint faster. Because of the race condition between manual
> deletion and checkpoint deletion, the unlink error occurred which
> crashed the server and the server entered the recovery again wasting
> the entire earlier recovery work.

Maybe I'm missing something but wouldn't
https://commitfest.postgresql.org/36/3448/ better solve the problem?



On Sat, Jan 15, 2022 at 2:59 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> Hi,
>
> On Sat, Jan 15, 2022 at 02:04:12PM +0530, Bharath Rupireddy wrote:
> >
> > We had an issue where there were many mapping files generated during
> > the crash recovery and end-of-recovery checkpoint was taking a lot of
> > time. We had to manually intervene and delete some of the mapping
> > files (although it may not sound sensible) to make end-of-recovery
> > checkpoint faster. Because of the race condition between manual
> > deletion and checkpoint deletion, the unlink error occurred which
> > crashed the server and the server entered the recovery again wasting
> > the entire earlier recovery work.
>
> Maybe I'm missing something but wouldn't
> https://commitfest.postgresql.org/36/3448/ better solve the problem?

The error can cause the new background process proposed there in that
thread to restart, which is again costly. Since we have LOG-only and
continue behavior in CheckPointSnapBuild already, having the same
behavior for CheckPointLogicalRewriteHeap helps a lot.

Regards,
Bharath Rupireddy.



Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:
> On Sat, Jan 15, 2022 at 2:59 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>> Maybe I'm missing something but wouldn't
>> https://commitfest.postgresql.org/36/3448/ better solve the problem?

> The error can cause the new background process proposed there in that
> thread to restart, which is again costly. Since we have LOG-only and
> continue behavior in CheckPointSnapBuild already, having the same
> behavior for CheckPointLogicalRewriteHeap helps a lot.

[ stares at CheckPointLogicalRewriteHeap for awhile ... ]

This code has got more problems than that.  It took me awhile to
absorb it, but we don't actually care about the contents of any of
those files; all of the information is encoded in the file *names*.
(This strikes me as probably not a very efficient design, compared
to putting the same data into a single text file; but for now I'll
assume we're not up for a complete rewrite.)  That being the case,
I wonder what it is we expect fsync'ing the surviving files to do
exactly.  We should be fsync'ing the directory not the files
themselves, no?

Other things that seem poorly thought out:

* Why is the check for "map-" prefix after, rather than before,
the lstat?

* Why is it okay to ignore lstat failure?  Seems like we might
as well not even have the lstat.

* The sscanf on the file name would not notice trailing junk,
such as an editor backup marker.  Is that okay?

As far as the patch itself goes, I agree that failure to unlink
is noncritical, because such a file would have no further effect
and we can just ignore it.  I think I also agree that failure
of the sscanf is noncritical, because the implication of that
is that the file name doesn't conform to our expectations, which
means it's basically just like the check that causes us to
ignore names not starting with "map-".  (Actually, isn't the
separate check for "map-" useless, given that sscanf will make
the equivalent check?)

I started out wondering why the patch didn't also change the loop's
other ERROR conditions to LOG.  But we do want to ERROR if we're
unable to sync transient state down to disk, and that is what
the other steps (think they) are doing.  It might be worth a
comment to point that out though, before someone decides that
if these errors are just LOG level then the others can be too.

Anyway, I think possibly we can drop the bottom half of the loop
(the part trying to fsync non-removed files) in favor of fsync'ing
the directory afterwards.  Thoughts?

            regards, tom lane



I wrote:
> Anyway, I think possibly we can drop the bottom half of the loop
> (the part trying to fsync non-removed files) in favor of fsync'ing
> the directory afterwards.  Thoughts?

Oh, scratch that --- *this* loop doesn't care about the file
contents, but other code does.  However, don't we need a directory
fsync too?  Or is that handled someplace else?

            regards, tom lane



Hi,

On 2022-01-19 13:34:21 -0500, Tom Lane wrote:
> Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:
> > On Sat, Jan 15, 2022 at 2:59 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
> >> Maybe I'm missing something but wouldn't
> >> https://commitfest.postgresql.org/36/3448/ better solve the problem?
>
> > The error can cause the new background process proposed there in that
> > thread to restart, which is again costly. Since we have LOG-only and
> > continue behavior in CheckPointSnapBuild already, having the same
> > behavior for CheckPointLogicalRewriteHeap helps a lot.
>
> [ stares at CheckPointLogicalRewriteHeap for awhile ... ]
>
> This code has got more problems than that.  It took me awhile to
> absorb it, but we don't actually care about the contents of any of
> those files; all of the information is encoded in the file *names*.

I'm not following - we *do* need the contents of the files? They're applied
as-needed in ApplyLogicalMappingFile().


> (This strikes me as probably not a very efficient design, compared
> to putting the same data into a single text file; but for now I'll
> assume we're not up for a complete rewrite.)  That being the case,
> I wonder what it is we expect fsync'ing the surviving files to do
> exactly.  We should be fsync'ing the directory not the files
> themselves, no?

Fsyncing the directory doesn't guarantee anything about the contents of
files. But, you're right, we need an fsync of the directory too.


> Other things that seem poorly thought out:
>
> * Why is the check for "map-" prefix after, rather than before,
> the lstat?

It doesn't seem to matter much - there shouldn't be a meaningful amount of
other files in there.


> * Why is it okay to ignore lstat failure?  Seems like we might
> as well not even have the lstat.

Yea, that seems odd, not sure why that ended up this way. I guess the aim
might have been to tolerate random files we don't have permissions for or
such?


> * The sscanf on the file name would not notice trailing junk,
> such as an editor backup marker.  Is that okay?

I don't really see a problem with it - there shouldn't be other files matching
the pattern - but it couldn't hurt to check the pattern matches exhaustively.


> As far as the patch itself goes, I agree that failure to unlink
> is noncritical, because such a file would have no further effect
> and we can just ignore it.

I don't agree. We iterate through the directory regularly on systems with
catalog changes + logical decoding. An ever increasing list of gunk will make
that more and more expensive.  And I haven't heard a meaningful reason why we
would have map-* files that we can't remove.

Ignoring failures like this just makes problems much harder to debug and they
tend to bite harder for it.


> I think I also agree that failure of the sscanf is noncritical, because the
> implication of that is that the file name doesn't conform to our
> expectations, which means it's basically just like the check that causes us
> to ignore names not starting with "map-".  (Actually, isn't the separate
> check for "map-" useless, given that sscanf will make the equivalent check?)

Well, this way only files starting with "map-" are expected to conform to a
strict format, the rest is ignored?


> Anyway, I think possibly we can drop the bottom half of the loop
> (the part trying to fsync non-removed files) in favor of fsync'ing
> the directory afterwards.  Thoughts?

I don't think that'd be correct.


In short: We should add a directory fsync, I'm fine with improving the error
checking, but the rest seems like a net-negative with no convincing reasoning.

Greetings,

Andres Freund



On 1/19/22, 11:08 AM, "Andres Freund" <andres@anarazel.de> wrote:
> On 2022-01-19 13:34:21 -0500, Tom Lane wrote:
>> As far as the patch itself goes, I agree that failure to unlink
>> is noncritical, because such a file would have no further effect
>> and we can just ignore it.
>
> I don't agree. We iterate through the directory regularly on systems with
> catalog changes + logical decoding. An ever increasing list of gunk will make
> that more and more expensive.  And I haven't heard a meaningful reason why we
> would have map-* files that we can't remove.

I think the other side of this is that we don't want checkpointing to
continually fail because of a noncritical failure.  That could also
lead to problems down the road.

> Ignoring failures like this just makes problems much harder to debug and they
> tend to bite harder for it.

If such noncritical failures happened regularly, the server logs will
likely become filled with messages about it.  Perhaps users may not
notice for a while, but I don't think the proposed patch would make
debugging excessively difficult.

Nathan


"Bossart, Nathan" <bossartn@amazon.com> writes:
> I think the other side of this is that we don't want checkpointing to
> continually fail because of a noncritical failure.  That could also
> lead to problems down the road.

Yeah, a persistent failure to complete checkpoints is very nasty.
Your disk will soon fill with unrecyclable WAL.  I don't see how
that's better than a somewhat hypothetical performance issue.

            regards, tom lane



I took the liberty of adjusting Bharath's patch based on the latest
feedback.

On 1/19/22, 10:35 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
> We should be fsync'ing the directory not the files
> themselves, no?

I added a directory sync at the end of CheckPointLogicalRewriteHeap(),
which IIUC is enough.

> * Why is the check for "map-" prefix after, rather than before,
> the lstat?

I swapped these checks.  I stopped short of moving the sscanf() before
the lstat(), though, as 1) I don't think it will help very much and 2)
it seemed weird to start emitting "could not parse filename" logs for
non-regular files we presently skip silently.

> * Why is it okay to ignore lstat failure?  Seems like we might
> as well not even have the lstat.

I added error checking for lstat().

> * The sscanf on the file name would not notice trailing junk,
> such as an editor backup marker.  Is that okay?

I think this is okay.  The absolute worst that would happen would be
that the extra file would be deleted.  This might eventually become a
problem if files with the same prefix format were created by the
server.  However, CheckPointSnapBuild() already has this problem with
temporary files, and it claims not to need any extra handling:

         * temporary filenames from SnapBuildSerialize() include the LSN and
         * everything but are postfixed by .$pid.tmp. We can just remove them
         * the same as other files because there can be none that are
         * currently being written that are older than cutoff.

> (Actually, isn't the
> separate check for "map-" useless, given that sscanf will make
> the equivalent check?)

The only benefit I see from the extra "map-" check is that it'll avoid
"could not parse filename" logs for files that clearly aren't related
to the task at hand.  I don't know if this is expected during normal
operation at the moment.  I've left the "map-" check for now.

> I started out wondering why the patch didn't also change the loop's
> other ERROR conditions to LOG.  But we do want to ERROR if we're
> unable to sync transient state down to disk, and that is what
> the other steps (think they) are doing.  It might be worth a
> comment to point that out though, before someone decides that
> if these errors are just LOG level then the others can be too.

I added such a comment.

I also updated CheckPointSnapBuild() and UpdateLogicalMappings() with
similar adjustments.

Nathan


Attachment
Hi,

On 2022-01-20 19:15:08 +0000, Bossart, Nathan wrote:
> > * Why is it okay to ignore lstat failure?  Seems like we might
> > as well not even have the lstat.
> 
> I added error checking for lstat().

It seems odd to change a bunch of things to not be errors anymore, but then
add new sources of errors. If we try to deal with concurrent deletions or
permission issues - otherwise what's the point of making unlink() not an error
anymore - why do we expect to be able to lstat()?


> I also updated CheckPointSnapBuild() and UpdateLogicalMappings() with
> similar adjustments.

FWIW, I still think the ERROR->LOG changes are bad idea. The whole thing of
"oh, let's just ignore stuff that we don't expect and soldier on" has bitten
us over and over again. It makes us less robust, not more robust.

It's also just about impossible to monitor for problems that emit LOG.


I'd be more on board accepting some selective errors. E.g. not erroring on
ENOENT, but continuing to error on others (most likely ENOACCESS). I think we
should *not* change the


> +        /*
> +         * We just log a message if a file doesn't fit the pattern, it's
> +         * probably some editor's lock/state file or similar...
> +         */

An editor's lock file that starts with map- would presumably be the whole
filename plus an additional file-ending. But this check won't catch those.


> +             * Unlike failures to unlink() old logical rewrite files, we must
> +             * ERROR if we're unable to sync transient state down to disk.  This
> +             * allows replay to assume that everything written out before
> +             * checkpoint start is persisted.
>               */

Hm, not quite happy with the second bit. Checkpointed state being durable
isn't about replay directly, it's about the basic property of a checkpoint
being fulfilled?


>              pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC);
>              if (pg_fsync(fd) != 0)
> @@ -1282,4 +1305,7 @@ CheckPointLogicalRewriteHeap(void)
>          }
>      }
>      FreeDir(mappings_dir);
> +
> +    /* persist directory entries to disk */
> +    fsync_fname("pg_logical/mappings", true);
>  }

This is probably worth backpatching, if you split it out I'll do so.

Greetings,

Andres Freund



Thanks for your feedback.

On 1/20/22, 11:47 AM, "Andres Freund" <andres@anarazel.de> wrote:
> It seems odd to change a bunch of things to not be errors anymore, but then
> add new sources of errors. If we try to deal with concurrent deletions or
> permission issues - otherwise what's the point of making unlink() not an error
> anymore - why do we expect to be able to lstat()?

My reasoning for making lstat() an ERROR was that there's a chance we
need to fsync() the file, and if we can't fsync() a file for whatever
reason, we definitely want to ERROR.  I suppose we could conditionally
ERROR based on the file name, but I don't know if that's really worth
the complexity.

> I'd be more on board accepting some selective errors. E.g. not erroring on
> ENOENT, but continuing to error on others (most likely ENOACCESS). I think we
> should *not* change the

I think this approach would work for the use-case Bharath mentioned
upthread.  In any case, if deleting a file fails because the file was
already deleted, there's no point in ERROR-ing.  I think filtering
errors is a bit trickier for lstat().  If we would've fsync'd the file
but lstat() gives us ENOENT, we may have a problem.  (However, there's
also a good chance we wouldn't notice such problems if the race didn't
occur.)  I'll play around with it.

>> +             /*
>> +              * We just log a message if a file doesn't fit the pattern, it's
>> +              * probably some editor's lock/state file or similar...
>> +              */
>
> An editor's lock file that starts with map- would presumably be the whole
> filename plus an additional file-ending. But this check won't catch those.

Right, it will either fsync() or unlink() those.  I'll work on the
comment.  Or do you think it's worth validating that there are no
trailing characters?  I looked into that a bit earlier, and the code
felt excessive to me, but I don't have a strong opinion here.

>> +                      * Unlike failures to unlink() old logical rewrite files, we must
>> +                      * ERROR if we're unable to sync transient state down to disk.  This
>> +                      * allows replay to assume that everything written out before
>> +                      * checkpoint start is persisted.
>>                        */
>
> Hm, not quite happy with the second bit. Checkpointed state being durable
> isn't about replay directly, it's about the basic property of a checkpoint
> being fulfilled?

I'll work on this.  FWIW I modeled this based on the comment above the
function.  Do you think that should be adjusted as well?

>> +     /* persist directory entries to disk */
>> +     fsync_fname("pg_logical/mappings", true);
>
> This is probably worth backpatching, if you split it out I'll do so.

Sure thing, will do shortly.

Nathan


On 1/20/22, 12:24 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> On 1/20/22, 11:47 AM, "Andres Freund" <andres@anarazel.de> wrote:
>>> +     /* persist directory entries to disk */
>>> +     fsync_fname("pg_logical/mappings", true);
>>
>> This is probably worth backpatching, if you split it out I'll do so.
>
> Sure thing, will do shortly.

Here's this part.

Nathan


Attachment
On 2022-01-20 20:41:16 +0000, Bossart, Nathan wrote:
> Here's this part.

And pushed to all branches. Thanks.



On Fri, Jan 21, 2022 at 11:49:56AM -0800, Andres Freund wrote:
> On 2022-01-20 20:41:16 +0000, Bossart, Nathan wrote:
>> Here's this part.
> 
> And pushed to all branches. Thanks.

Thanks!

I spent some time thinking about the right way to proceed here, and I came
up with the attached patches.  The first patch just adds error checking for
various lstat() calls in the replication code.  If lstat() fails, then it
probably doesn't make sense to try to continue processing the file.

The second patch changes some nearby calls to ereport() to ERROR.  If these
failures are truly unexpected, and we don't intend to support use-cases
like concurrent manual deletion, then failing might be the right way to go.
I think it's a shame that such failures could cause checkpointing to
continually fail, but that topic is already being discussed elsewhere [0].

[0] https://postgr.es/m/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB%40amazon.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com/

Attachment
On Thu, Jan 27, 2022 at 6:31 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> I spent some time thinking about the right way to proceed here, and I came
> up with the attached patches.  The first patch just adds error checking for
> various lstat() calls in the replication code.  If lstat() fails, then it
> probably doesn't make sense to try to continue processing the file.
>
> The second patch changes some nearby calls to ereport() to ERROR.  If these
> failures are truly unexpected, and we don't intend to support use-cases
> like concurrent manual deletion, then failing might be the right way to go.
> I think it's a shame that such failures could cause checkpointing to
> continually fail, but that topic is already being discussed elsewhere [0].
>
> [0] https://postgr.es/m/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB%40amazon.com

After an off-list discussion with Andreas, proposing here a patch that
basically replaces ReadDir call with ReadDirExtended and gets rid of
lstat entirely. With this chance, the checkpoint will only care about
the snapshot and mapping files and not fail if it finds other files in
the directories. Removing lstat enables us to make things faster as we
avoid a bunch of extra system calls - one lstat call per each mapping
or snapshot file.

This patch also includes, converting "could not parse filename" and
"could not remove file" errors to LOG messages in
CheckPointLogicalRewriteHeap. This will enable checkpoint not to waste
the amount of work that it has done in case it is unable to
parse/remove the file, for whatever reasons.

Regards,
Bharath Rupireddy.

Attachment
On Mon, Jan 31, 2022 at 10:42:54AM +0530, Bharath Rupireddy wrote:
> After an off-list discussion with Andreas, proposing here a patch that
> basically replaces ReadDir call with ReadDirExtended and gets rid of
> lstat entirely. With this chance, the checkpoint will only care about
> the snapshot and mapping files and not fail if it finds other files in
> the directories. Removing lstat enables us to make things faster as we
> avoid a bunch of extra system calls - one lstat call per each mapping
> or snapshot file.

I think removing the lstat() is probably reasonable.  We currently aren't
doing proper error checking, and the chances of a non-regular file matching
the prefix are likely pretty low.  In the worst case, we'll LOG or ERROR
when unlinking or fsyncing fails.

However, I'm not sure about the change to ReadDirExtended().  That might be
okay for CheckPointSnapBuild(), which is just trying to remove old files,
but CheckPointLogicalRewriteHeap() is responsible for ensuring that files
are flushed to disk for the checkpoint.  If we stop reading the directory
after an error and let the checkpoint continue, isn't it possible that some
mappings files won't be persisted to disk?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Wed, Feb 2, 2022 at 5:25 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Mon, Jan 31, 2022 at 10:42:54AM +0530, Bharath Rupireddy wrote:
> > After an off-list discussion with Andreas, proposing here a patch that
> > basically replaces ReadDir call with ReadDirExtended and gets rid of
> > lstat entirely. With this chance, the checkpoint will only care about
> > the snapshot and mapping files and not fail if it finds other files in
> > the directories. Removing lstat enables us to make things faster as we
> > avoid a bunch of extra system calls - one lstat call per each mapping
> > or snapshot file.
>
> I think removing the lstat() is probably reasonable.  We currently aren't
> doing proper error checking, and the chances of a non-regular file matching
> the prefix are likely pretty low.  In the worst case, we'll LOG or ERROR
> when unlinking or fsyncing fails.
>
> However, I'm not sure about the change to ReadDirExtended().  That might be
> okay for CheckPointSnapBuild(), which is just trying to remove old files,
> but CheckPointLogicalRewriteHeap() is responsible for ensuring that files
> are flushed to disk for the checkpoint.  If we stop reading the directory
> after an error and let the checkpoint continue, isn't it possible that some
> mappings files won't be persisted to disk?

Unless I mis-read your above statement, with LOG level in
ReadDirExtended, I don't think we stop reading the files in
CheckPointLogicalRewriteHeap. Am I missing something here?

Since, we also continue in CheckPointLogicalRewriteHeap if we can't
parse/delete some files with the change of "could not parse
filename"/"could not remove file" messages to LOG level

I'm attaching v6, just changed elog(LOG, to ereport(LOG in
CheckPointLogicalRewriteHeap, other things remain the same.

Regards,
Bharath Rupireddy.

Attachment
On Wed, Feb 02, 2022 at 05:19:26PM +0530, Bharath Rupireddy wrote:
> On Wed, Feb 2, 2022 at 5:25 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> However, I'm not sure about the change to ReadDirExtended().  That might be
>> okay for CheckPointSnapBuild(), which is just trying to remove old files,
>> but CheckPointLogicalRewriteHeap() is responsible for ensuring that files
>> are flushed to disk for the checkpoint.  If we stop reading the directory
>> after an error and let the checkpoint continue, isn't it possible that some
>> mappings files won't be persisted to disk?
> 
> Unless I mis-read your above statement, with LOG level in
> ReadDirExtended, I don't think we stop reading the files in
> CheckPointLogicalRewriteHeap. Am I missing something here?

ReadDirExtended() has the following comment:

 * If elevel < ERROR, returns NULL after any error.  With the normal coding
 * pattern, this will result in falling out of the loop immediately as
 * though the directory contained no (more) entries.

If there is a problem reading the directory, we will LOG and then exit the
loop.  If we didn't scan through all the entries in the directory, there is
a chance that we didn't fsync() all the files that need it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Thu, Feb 3, 2022 at 12:07 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Wed, Feb 02, 2022 at 05:19:26PM +0530, Bharath Rupireddy wrote:
> > On Wed, Feb 2, 2022 at 5:25 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> However, I'm not sure about the change to ReadDirExtended().  That might be
> >> okay for CheckPointSnapBuild(), which is just trying to remove old files,
> >> but CheckPointLogicalRewriteHeap() is responsible for ensuring that files
> >> are flushed to disk for the checkpoint.  If we stop reading the directory
> >> after an error and let the checkpoint continue, isn't it possible that some
> >> mappings files won't be persisted to disk?
> >
> > Unless I mis-read your above statement, with LOG level in
> > ReadDirExtended, I don't think we stop reading the files in
> > CheckPointLogicalRewriteHeap. Am I missing something here?
>
> ReadDirExtended() has the following comment:
>
>  * If elevel < ERROR, returns NULL after any error.  With the normal coding
>  * pattern, this will result in falling out of the loop immediately as
>  * though the directory contained no (more) entries.
>
> If there is a problem reading the directory, we will LOG and then exit the
> loop.  If we didn't scan through all the entries in the directory, there is
> a chance that we didn't fsync() all the files that need it.

Thanks. I get it. For syncing map files, we don't want to tolerate any
errors, whereas removal of the old map files (lesser than cutoff LSN)
can be tolerated in CheckPointLogicalRewriteHeap.

Here's the v7 version using ReadDir for CheckPointLogicalRewriteHeap.

Regards,
Bharath Rupireddy.

Attachment
On Thu, Feb 03, 2022 at 09:45:08AM +0530, Bharath Rupireddy wrote:
> On Thu, Feb 3, 2022 at 12:07 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> If there is a problem reading the directory, we will LOG and then exit the
>> loop.  If we didn't scan through all the entries in the directory, there is
>> a chance that we didn't fsync() all the files that need it.
> 
> Thanks. I get it. For syncing map files, we don't want to tolerate any
> errors, whereas removal of the old map files (lesser than cutoff LSN)
> can be tolerated in CheckPointLogicalRewriteHeap.

LGTM.  Andres noted upthread [0] that the comment above sscanf() about
skipping editors' lock files might not be accurate.  I don't think it's a
huge problem if sscanf() matches those files, but perhaps we can improve
the comment.

[0] https://postgr.es/m/20220120194618.hmfd4kxkng2cgryh%40alap3.anarazel.de

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Fri, Feb 4, 2022 at 5:33 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> > Thanks. I get it. For syncing map files, we don't want to tolerate any
> > errors, whereas removal of the old map files (lesser than cutoff LSN)
> > can be tolerated in CheckPointLogicalRewriteHeap.
>
> LGTM.  Andres noted upthread [0] that the comment above sscanf() about
> skipping editors' lock files might not be accurate.  I don't think it's a
> huge problem if sscanf() matches those files, but perhaps we can improve
> the comment.
>
> [0] https://postgr.es/m/20220120194618.hmfd4kxkng2cgryh%40alap3.anarazel.de

Andres comment from [0]:

> An editor's lock file that starts with map- would presumably be the whole
> filename plus an additional file-ending. But this check won't catch those.

Agreed. sscanf checks can't detect the files named "whole filename
plus an additional file-ending". I just checked with vi editor lock
state file .0-14ED3B8.snap.swp [1], the log generated is [2]. I'm not
sure exactly which editor would create a lockfile like "whole filename
plus an additional file-ending".

In any case, let's remove the editor's lock/state file from those
comments and have just only "We just log a message if a file doesn't
fit the pattern". Attached v8 patch with that change.

[1]
-rw------- 1 bharath bharath 12288 Feb 10 15:48 .0-14ED3B8.snap.swp
-rw------- 1 bharath bharath   128 Feb 10 15:48 0-14ED518.snap
-rw------- 1 bharath bharath   128 Feb 10 15:49 0-14ED518.snap.lockfile
-rw------- 1 bharath bharath   128 Feb 10 15:49 0-14ED550.snap
-rw------- 1 bharath bharath   128 Feb 10 15:49 0-14ED600.snap

[2]
2022-02-10 15:48:47.938 UTC [1121678] LOG:  could not parse file name
"pg_logical/snapshots/.0-14ED3B8.snap.swp"

Regards,
Bharath Rupireddy.

Attachment
On Thu, Feb 10, 2022 at 09:30:45PM +0530, Bharath Rupireddy wrote:
> In any case, let's remove the editor's lock/state file from those
> comments and have just only "We just log a message if a file doesn't
> fit the pattern". Attached v8 patch with that change.

I've moved this one to ready-for-committer.  I was under the impression
that Andres was firmly against this approach, but you did mention there was
an off-list discussion.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Hi,

On 2022-02-10 21:30:45 +0530, Bharath Rupireddy wrote:
> From 4801ff2c3b1e7bc7076205b676d4e3bc4a4ed308 Mon Sep 17 00:00:00 2001
> From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
> Date: Thu, 10 Feb 2022 15:58:58 +0000
> Subject: [PATCH v8] Replace ReadDir with ReadDirExtended
> 
> Replace ReadDir with ReadDirExtended (in CheckPointSnapBuild) and
> get rid of lstat entirely.

I think this might be based on a slight misunderstanding / bad phrasing on my
part.  We can use get_dirent_type() to optimize away the lstat on most
platforms, ReadDirExtended itself doesn't do that automatically. I was trying
to reference removing lstat calls by using get_dirent_type() in more places...


> We still use ReadDir in CheckPointLogicalRewriteHeap
> because unable to read directory would result a NULL from
> ReadDirExtended and we may miss to fsync the remaining map files,
> so here let's error out with ReadDir.

Then why is this skipping the lstat?


> Also, convert "could not parse filename" and "could not remove file"
> errors to LOG messages in  CheckPointLogicalRewriteHeap. This will
> enable checkpoint not to waste the amount of work that it had done.

I still doubt this is a good idea.

Greetings,

Andres Freund



On Tue, Feb 15, 2022 at 09:09:52AM -0800, Andres Freund wrote:
> On 2022-02-10 21:30:45 +0530, Bharath Rupireddy wrote:
>> Replace ReadDir with ReadDirExtended (in CheckPointSnapBuild) and
>> get rid of lstat entirely.
> 
> I think this might be based on a slight misunderstanding / bad phrasing on my
> part.  We can use get_dirent_type() to optimize away the lstat on most
> platforms, ReadDirExtended itself doesn't do that automatically. I was trying
> to reference removing lstat calls by using get_dirent_type() in more places...
> 
> 
>> We still use ReadDir in CheckPointLogicalRewriteHeap
>> because unable to read directory would result a NULL from
>> ReadDirExtended and we may miss to fsync the remaining map files,
>> so here let's error out with ReadDir.
> 
> Then why is this skipping the lstat?
> 
> 
>> Also, convert "could not parse filename" and "could not remove file"
>> errors to LOG messages in  CheckPointLogicalRewriteHeap. This will
>> enable checkpoint not to waste the amount of work that it had done.
> 
> I still doubt this is a good idea.

IIUC you are advocating for something more like the attached patches.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment
Hi,

On 2022-02-15 09:57:53 -0800, Nathan Bossart wrote:
> IIUC you are advocating for something more like the attached patches.

At least for patch 1 yes. Don't have the cycles just now to look at the
others.

I generally think it'd be a good exercise to go through an use
get_dirent_type() in nearly all ReadDir() style loops - it's a nice efficiency
win in general, and IIRC a particularly big one on windows.

It'd probably be good to add a reference to get_dirent_type() to
ReadDir[Extended]()'s docs.

Greetings,

Andres Freund



On Tue, Feb 15, 2022 at 10:10:34AM -0800, Andres Freund wrote:
> I generally think it'd be a good exercise to go through an use
> get_dirent_type() in nearly all ReadDir() style loops - it's a nice efficiency
> win in general, and IIRC a particularly big one on windows.
> 
> It'd probably be good to add a reference to get_dirent_type() to
> ReadDir[Extended]()'s docs.

Agreed.  I might give this a try.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Tue, Feb 15, 2022 at 10:37:58AM -0800, Nathan Bossart wrote:
> On Tue, Feb 15, 2022 at 10:10:34AM -0800, Andres Freund wrote:
>> I generally think it'd be a good exercise to go through an use
>> get_dirent_type() in nearly all ReadDir() style loops - it's a nice efficiency
>> win in general, and IIRC a particularly big one on windows.
>> 
>> It'd probably be good to add a reference to get_dirent_type() to
>> ReadDir[Extended]()'s docs.
> 
> Agreed.  I might give this a try.

Alright, here is a new patch set where I've tried to replace as many
stat()/lstat() calls as possible with get_dirent_type().  0002 and 0003 are
the same as v9.  I noticed a few remaining stat()/lstat() calls that don't
appear to be doing proper error checking, but I haven't had a chance to try
fixing those yet.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment
On Wed, Feb 16, 2022 at 12:11 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> On Tue, Feb 15, 2022 at 10:37:58AM -0800, Nathan Bossart wrote:
> > On Tue, Feb 15, 2022 at 10:10:34AM -0800, Andres Freund wrote:
> >> I generally think it'd be a good exercise to go through an use
> >> get_dirent_type() in nearly all ReadDir() style loops - it's a nice efficiency
> >> win in general, and IIRC a particularly big one on windows.
> >>
> >> It'd probably be good to add a reference to get_dirent_type() to
> >> ReadDir[Extended]()'s docs.
> >
> > Agreed.  I might give this a try.
>
> Alright, here is a new patch set where I've tried to replace as many
> stat()/lstat() calls as possible with get_dirent_type().  0002 and 0003 are
> the same as v9.  I noticed a few remaining stat()/lstat() calls that don't
> appear to be doing proper error checking, but I haven't had a chance to try
> fixing those yet.

0001:  These get_dirent_type() conversions are nice.  LGTM.

0002:

     /* we're only handling directories here, skip if it's not ours */
-    if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
+    if (lstat(path, &statbuf) != 0)
+        ereport(ERROR,
+                (errcode_for_file_access(),
+                 errmsg("could not stat file \"%s\": %m", path)));
+    else if (!S_ISDIR(statbuf.st_mode))
         return;

Why is this a good place to silently ignore non-directories?
StartupReorderBuffer() is already in charge of skipping random
detritus found in the directory, so would it be better to do "if
(get_dirent_type(...) != PGFILETYPE_DIR) continue" there, and then
drop the lstat() stanza from ReorderBufferCleanupSeralizedTXNs()
completely?  Then perhaps its ReadDirExtended() shoud be using ERROR
instead of INFO, so that missing/non-dir/b0rked directories raise an
error.  I don't understand why it's reporting readdir() errors at INFO
but unlink() errors at ERROR, and as far as I can see the other paths
that reach this code shouldn't be sending in paths to non-directories
here unless something is seriously busted and that's ERROR-worthy.

0003:  I haven't studied this cure vs disease thing... no opinion today.



Thanks for taking a look!

On Thu, Mar 24, 2022 at 01:17:01PM +1300, Thomas Munro wrote:
>      /* we're only handling directories here, skip if it's not ours */
> -    if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
> +    if (lstat(path, &statbuf) != 0)
> +        ereport(ERROR,
> +                (errcode_for_file_access(),
> +                 errmsg("could not stat file \"%s\": %m", path)));
> +    else if (!S_ISDIR(statbuf.st_mode))
>          return;
> 
> Why is this a good place to silently ignore non-directories?
> StartupReorderBuffer() is already in charge of skipping random
> detritus found in the directory, so would it be better to do "if
> (get_dirent_type(...) != PGFILETYPE_DIR) continue" there, and then
> drop the lstat() stanza from ReorderBufferCleanupSeralizedTXNs()
> completely?  Then perhaps its ReadDirExtended() shoud be using ERROR
> instead of INFO, so that missing/non-dir/b0rked directories raise an
> error.

My guess is that this was done because ReorderBufferCleanupSerializedTXNs()
is also called from ReorderBufferAllocate() and ReorderBufferFree().
However, it is odd that we just silently return if the slot path isn't a
directory in those cases.  I think we could use get_dirent_type() in
StartupReorderBuffer() as you suggested, and then we could let ReadDir()
ERROR for non-directories for the other callers of
ReorderBufferCleanupSerializedTXNs().  WDYT?

> I don't understand why it's reporting readdir() errors at INFO
> but unlink() errors at ERROR, and as far as I can see the other paths
> that reach this code shouldn't be sending in paths to non-directories
> here unless something is seriously busted and that's ERROR-worthy.

I agree.  I'll switch it to ReadDir() in the next revision so that we ERROR
instead of INFO.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Tue, Mar 29, 2022 at 03:48:32PM -0700, Nathan Bossart wrote:
> On Thu, Mar 24, 2022 at 01:17:01PM +1300, Thomas Munro wrote:
>>      /* we're only handling directories here, skip if it's not ours */
>> -    if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
>> +    if (lstat(path, &statbuf) != 0)
>> +        ereport(ERROR,
>> +                (errcode_for_file_access(),
>> +                 errmsg("could not stat file \"%s\": %m", path)));
>> +    else if (!S_ISDIR(statbuf.st_mode))
>>          return;
>> 
>> Why is this a good place to silently ignore non-directories?
>> StartupReorderBuffer() is already in charge of skipping random
>> detritus found in the directory, so would it be better to do "if
>> (get_dirent_type(...) != PGFILETYPE_DIR) continue" there, and then
>> drop the lstat() stanza from ReorderBufferCleanupSeralizedTXNs()
>> completely?  Then perhaps its ReadDirExtended() shoud be using ERROR
>> instead of INFO, so that missing/non-dir/b0rked directories raise an
>> error.
> 
> My guess is that this was done because ReorderBufferCleanupSerializedTXNs()
> is also called from ReorderBufferAllocate() and ReorderBufferFree().
> However, it is odd that we just silently return if the slot path isn't a
> directory in those cases.  I think we could use get_dirent_type() in
> StartupReorderBuffer() as you suggested, and then we could let ReadDir()
> ERROR for non-directories for the other callers of
> ReorderBufferCleanupSerializedTXNs().  WDYT?
> 
>> I don't understand why it's reporting readdir() errors at INFO
>> but unlink() errors at ERROR, and as far as I can see the other paths
>> that reach this code shouldn't be sending in paths to non-directories
>> here unless something is seriously busted and that's ERROR-worthy.
> 
> I agree.  I'll switch it to ReadDir() in the next revision so that we ERROR
> instead of INFO.

Here is an updated patch set.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment
On Wed, Mar 30, 2022 at 09:21:30AM -0700, Nathan Bossart wrote:
> Here is an updated patch set.

rebased

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment
On Sat, Apr 9, 2022 at 1:49 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Wed, Mar 30, 2022 at 09:21:30AM -0700, Nathan Bossart wrote:
> > Here is an updated patch set.
>
> rebased

Thanks.

0001 - there are many places where lstat/stat is being used - don't we
need to replace all or most of them with get_dirent_type?

0002 - I'm not quite happy with this patch, with the change,
checkpoint errors out, if it can't remove just a file - the comments
there says it all. Is there any strong reason for this change?

- /*
- * It's not particularly harmful, though strange, if we can't
- * remove the file here. Don't prevent the checkpoint from
- * completing, that'd be a cure worse than the disease.
- */
  if (unlink(path) < 0)
- {
- ereport(LOG,
+ ereport(ERROR,
  (errcode_for_file_access(),
  errmsg("could not remove file \"%s\": %m",
  path)));
- continue;
- }

Regards,
Bharath Rupireddy.



On Fri, Jul 08, 2022 at 09:39:10PM +0530, Bharath Rupireddy wrote:
> 0001 - there are many places where lstat/stat is being used - don't we
> need to replace all or most of them with get_dirent_type?

It's been a while since I wrote this one, but I believe my intent was to
replace as many [l]stat() calls in ReadDir()-style loops as possible with
get_dirent_type().  Are there any that I've missed?

> 0002 - I'm not quite happy with this patch, with the change,
> checkpoint errors out, if it can't remove just a file - the comments
> there says it all. Is there any strong reason for this change?

Andres noted several concerns upthread.  In short, ignoring unexpected
errors makes them harder to debug and likely masks bugs.

FWIW I agree that it is unfortunate that a relatively non-critical error
here leads to checkpoint failures, which can cause much worse problems down
the road.  I think this is one of the reasons for moving tasks like this
out of the checkpointer, as I'm trying to do with the proposed custodian
process [0].

[0] https://commitfest.postgresql.org/38/3448/

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Hi,

On 2022-04-08 13:18:57 -0700, Nathan Bossart wrote:
> @@ -1035,32 +1036,9 @@ ParseConfigDirectory(const char *includedir,
>  
>          join_path_components(filename, directory, de->d_name);
>          canonicalize_path(filename);
> -        if (stat(filename, &st) == 0)
> +        de_type = get_dirent_type(filename, de, true, elevel);
> +        if (de_type == PGFILETYPE_ERROR)
>          {
> -            if (!S_ISDIR(st.st_mode))
> -            {
> -                /* Add file to array, increasing its size in blocks of 32 */
> -                if (num_filenames >= size_filenames)
> -                {
> -                    size_filenames += 32;
> -                    filenames = (char **) repalloc(filenames,
> -                                            size_filenames * sizeof(char *));
> -                }
> -                filenames[num_filenames] = pstrdup(filename);
> -                num_filenames++;
> -            }
> -        }
> -        else
> -        {
> -            /*
> -             * stat does not care about permissions, so the most likely reason
> -             * a file can't be accessed now is if it was removed between the
> -             * directory listing and now.
> -             */
> -            ereport(elevel,
> -                    (errcode_for_file_access(),
> -                     errmsg("could not stat file \"%s\": %m",
> -                            filename)));
>              record_config_file_error(psprintf("could not stat file \"%s\"",
>                                                filename),
>                                       calling_file, calling_lineno,
> @@ -1068,6 +1046,18 @@ ParseConfigDirectory(const char *includedir,
>              status = false;
>              goto cleanup;
>          }
> +        else if (de_type != PGFILETYPE_DIR)
> +        {
> +            /* Add file to array, increasing its size in blocks of 32 */
> +            if (num_filenames >= size_filenames)
> +            {
> +                size_filenames += 32;
> +                filenames = (char **) repalloc(filenames,
> +                                               size_filenames * sizeof(char *));
> +            }
> +            filenames[num_filenames] = pstrdup(filename);
> +            num_filenames++;
> +        }
>      }
>  
>      if (num_filenames > 0)

Seems like the diff would be easier to read if it didn't move code around as
much?

Looks pretty reasonable, I'd be happy to commit it, I think.

Greetings,

Andres Freund



On Mon, Jul 11, 2022 at 01:25:33PM -0700, Andres Freund wrote:
> On 2022-04-08 13:18:57 -0700, Nathan Bossart wrote:
>> @@ -1035,32 +1036,9 @@ ParseConfigDirectory(const char *includedir,
>>  
>>          join_path_components(filename, directory, de->d_name);
>>          canonicalize_path(filename);
>> -        if (stat(filename, &st) == 0)
>> +        de_type = get_dirent_type(filename, de, true, elevel);
>> +        if (de_type == PGFILETYPE_ERROR)
>>          {
>> -            if (!S_ISDIR(st.st_mode))
>> -            {
>> -                /* Add file to array, increasing its size in blocks of 32 */
>> -                if (num_filenames >= size_filenames)
>> -                {
>> -                    size_filenames += 32;
>> -                    filenames = (char **) repalloc(filenames,
>> -                                            size_filenames * sizeof(char *));
>> -                }
>> -                filenames[num_filenames] = pstrdup(filename);
>> -                num_filenames++;
>> -            }
>> -        }
>> -        else
>> -        {
>> -            /*
>> -             * stat does not care about permissions, so the most likely reason
>> -             * a file can't be accessed now is if it was removed between the
>> -             * directory listing and now.
>> -             */
>> -            ereport(elevel,
>> -                    (errcode_for_file_access(),
>> -                     errmsg("could not stat file \"%s\": %m",
>> -                            filename)));
>>              record_config_file_error(psprintf("could not stat file \"%s\"",
>>                                                filename),
>>                                       calling_file, calling_lineno,
>> @@ -1068,6 +1046,18 @@ ParseConfigDirectory(const char *includedir,
>>              status = false;
>>              goto cleanup;
>>          }
>> +        else if (de_type != PGFILETYPE_DIR)
>> +        {
>> +            /* Add file to array, increasing its size in blocks of 32 */
>> +            if (num_filenames >= size_filenames)
>> +            {
>> +                size_filenames += 32;
>> +                filenames = (char **) repalloc(filenames,
>> +                                               size_filenames * sizeof(char *));
>> +            }
>> +            filenames[num_filenames] = pstrdup(filename);
>> +            num_filenames++;
>> +        }
>>      }
>>  
>>      if (num_filenames > 0)
> 
> Seems like the diff would be easier to read if it didn't move code around as
> much?

I opted to reorganize in order save an indent and simplify the conditions a
bit.  The alternative is something like this:

    if (de_type != PGFILETYPE_ERROR)
    {
        if (de_type != PGTFILETYPE_DIR)
        {
            ...
        }
    }
    else
    {
        ...
    }

I don't feel strongly one way or another, and I'll change it if you think
it's worth it to simplify the diff.

> Looks pretty reasonable, I'd be happy to commit it, I think.

Appreciate it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Fri, Jul 8, 2022 at 10:44 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> > 0002 - I'm not quite happy with this patch, with the change,
> > checkpoint errors out, if it can't remove just a file - the comments
> > there says it all. Is there any strong reason for this change?
>
> Andres noted several concerns upthread.  In short, ignoring unexpected
> errors makes them harder to debug and likely masks bugs.
>
> FWIW I agree that it is unfortunate that a relatively non-critical error
> here leads to checkpoint failures, which can cause much worse problems down
> the road.  I think this is one of the reasons for moving tasks like this
> out of the checkpointer, as I'm trying to do with the proposed custodian
> process [0].
>
> [0] https://commitfest.postgresql.org/38/3448/

IMHO, we can keep it as-is and if required can be changed as part of
the patch set [0], as this change without [0] can cause a checkpoint
to fail. Furthermore, I would like it if we convert "could not parse
filename" and "could not remove file" ERRORs of
CheckPointLogicalRewriteHeap to LOGs until [0] gets in - others may
have different opinions though.

Just wondering - do we ever have a problem if we can't remove the
snapshot or mapping file?

May be unrelated, RemoveTempXlogFiles doesn't even bother to check if
the temp wal file is removed.

Regards,
Bharath Rupireddy.



On Mon, Jul 18, 2022 at 04:53:18PM +0530, Bharath Rupireddy wrote:
> Just wondering - do we ever have a problem if we can't remove the
> snapshot or mapping file?

Besides running out of disk space, there appears to be a transaction ID
wraparound risk with the mappings files.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Working on get_dirent_type() reminded me of this thread.  I was going
to commit the first of these patches, but then I noticed Andres said
he was planning to, so I'll wait another day.  Here they are, with
commit messages but otherwise unchanged from Nathan's v12 except for a
slight comment tweak:

-               /* we're only handling directories here, skip if it's
not ours */
+               /* we're only handling directories here, skip if it's not one */

The only hunk I'm having second thoughts about is the following, which
makes unexpected stray files break checkpoints:

-                * We just log a message if a file doesn't fit the pattern, it's
-                * probably some editors lock/state file or similar...
                 */
                if (sscanf(snap_de->d_name, "%X-%X.snap", &hi, &lo) != 2)
-               {
-                       ereport(LOG,
+                       ereport(ERROR,
                                        (errmsg("could not parse file
name \"%s\"", path)));

Bharath mentioned other places that loop over stat(), but I think
those are places that want stuff we don't already have, like st_size.

Attachment
Thomas Munro <thomas.munro@gmail.com> writes:
> The only hunk I'm having second thoughts about is the following, which
> makes unexpected stray files break checkpoints:

Sounds like a pretty bad idea.  What's the upside?

            regards, tom lane



I wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> The only hunk I'm having second thoughts about is the following, which
>> makes unexpected stray files break checkpoints:

> Sounds like a pretty bad idea.  What's the upside?

Actually, having now read the patch, I don't think there is any
part of 0002 that is a good idea.  It's blithely removing the
comments that explain why the existing coding is the way it is,
and not providing a shred of justification for making checkpoints
more brittle.

I have not tried to analyze the error-handling properties of 0001,
but if it's being equally cavalier then it shouldn't be committed
either.  Most of this behavior is the result of decades of hard-won
experience; discarding it because it doesn't fit conveniently
into some refactoring plan isn't smart.

            regards, tom lane



On Tue, Aug 9, 2022 at 3:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Actually, having now read the patch, I don't think there is any
> part of 0002 that is a good idea.  It's blithely removing the
> comments that explain why the existing coding is the way it is,
> and not providing a shred of justification for making checkpoints
> more brittle.

0002 also contradicts the original $SUBJECT and goal of this thread,
which is possibly why it was kept separate.  I was only thinking of
committing 0001 myself, which is the one I'd reviewed an earlier
version of.

> I have not tried to analyze the error-handling properties of 0001,
> but if it's being equally cavalier then it shouldn't be committed
> either.  Most of this behavior is the result of decades of hard-won
> experience; discarding it because it doesn't fit conveniently
> into some refactoring plan isn't smart.

0001 does introduce new errors, as mentioned in the commit message, in
the form of elevel ERROR passed into get_dirent_type(), which might be
thrown if your OS has no d_type and lstat() fails (also if you asked
to follow symlinks, but in those cases errors were already thrown).
But in those cases, it seems at least a little fishy that we ignored
errors from the existing lstat().  I wondered if that was because they
expected that any failure meant ENOENT and they wanted to tolerate
that, but that does not seem to be the case, so I considered the error
to be an improvement.



Thomas Munro <thomas.munro@gmail.com> writes:
> On Tue, Aug 9, 2022 at 3:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I have not tried to analyze the error-handling properties of 0001,
>> but if it's being equally cavalier then it shouldn't be committed
>> either.

> 0001 does introduce new errors, as mentioned in the commit message, in
> the form of elevel ERROR passed into get_dirent_type(), which might be
> thrown if your OS has no d_type and lstat() fails (also if you asked
> to follow symlinks, but in those cases errors were already thrown).
> But in those cases, it seems at least a little fishy that we ignored
> errors from the existing lstat().

Hmmm ... I'll grant that ignoring lstat errors altogether isn't great.
But should the replacement behavior be elog-LOG-and-press-on,
or elog-ERROR-and-fail-the-surrounding-operation?  I'm not in any
hurry to believe that the latter is more appropriate without some
analysis of what the callers are doing.

The bottom line here is that I'm distrustful of behavioral changes
introduced to simplify refactoring rather than to solve a live
problem.

            regards, tom lane



On Tue, Aug 9, 2022 at 9:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Tue, Aug 9, 2022 at 3:27 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I have not tried to analyze the error-handling properties of 0001,
> >> but if it's being equally cavalier then it shouldn't be committed
> >> either.
>
> > 0001 does introduce new errors, as mentioned in the commit message, in
> > the form of elevel ERROR passed into get_dirent_type(), which might be
> > thrown if your OS has no d_type and lstat() fails (also if you asked
> > to follow symlinks, but in those cases errors were already thrown).
> > But in those cases, it seems at least a little fishy that we ignored
> > errors from the existing lstat().
>
> Hmmm ... I'll grant that ignoring lstat errors altogether isn't great.
> But should the replacement behavior be elog-LOG-and-press-on,
> or elog-ERROR-and-fail-the-surrounding-operation?  I'm not in any
> hurry to believe that the latter is more appropriate without some
> analysis of what the callers are doing.
>
> The bottom line here is that I'm distrustful of behavioral changes
> introduced to simplify refactoring rather than to solve a live
> problem.

+1. I agree with Tom not to change elog-LOG to elog-ERROR and fail the
checkpoint operation. Because the checkpoint is more important than
why a single snapshot file (out thousands or even million files) isn't
removed at that moment. Also, I originally proposed to change
elog-ERROR to elog-LOG in CheckPointLogicalRewriteHeap for unlink()
failures for the same reason.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/



On 2022-08-09 15:10:41 +1200, Thomas Munro wrote:
> I was going to commit the first of these patches, but then I noticed Andres
> said he was planning to, so I'll wait another day.

I'd be happy for you to take this on...



On Tue, Aug 09, 2022 at 09:29:16AM +0530, Bharath Rupireddy wrote:
> On Tue, Aug 9, 2022 at 9:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hmmm ... I'll grant that ignoring lstat errors altogether isn't great.
>> But should the replacement behavior be elog-LOG-and-press-on,
>> or elog-ERROR-and-fail-the-surrounding-operation?  I'm not in any
>> hurry to believe that the latter is more appropriate without some
>> analysis of what the callers are doing.
>>
>> The bottom line here is that I'm distrustful of behavioral changes
>> introduced to simplify refactoring rather than to solve a live
>> problem.
> 
> +1. I agree with Tom not to change elog-LOG to elog-ERROR and fail the
> checkpoint operation. Because the checkpoint is more important than
> why a single snapshot file (out thousands or even million files) isn't
> removed at that moment. Also, I originally proposed to change
> elog-ERROR to elog-LOG in CheckPointLogicalRewriteHeap for unlink()
> failures for the same reason.

This was my initial instinct as well, but this thread has received
contradictory feedback during the months since.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Tue, Aug 9, 2022 at 4:28 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> On Tue, Aug 09, 2022 at 09:29:16AM +0530, Bharath Rupireddy wrote:
> > On Tue, Aug 9, 2022 at 9:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Hmmm ... I'll grant that ignoring lstat errors altogether isn't great.
> >> But should the replacement behavior be elog-LOG-and-press-on,
> >> or elog-ERROR-and-fail-the-surrounding-operation?  I'm not in any
> >> hurry to believe that the latter is more appropriate without some
> >> analysis of what the callers are doing.
> >>
> >> The bottom line here is that I'm distrustful of behavioral changes
> >> introduced to simplify refactoring rather than to solve a live
> >> problem.
> >
> > +1. I agree with Tom not to change elog-LOG to elog-ERROR and fail the
> > checkpoint operation. Because the checkpoint is more important than

The changes were not from elog-LOG to elog-ERROR, they were changes
from silently eating errors, but I take your (plural) general point.

> > why a single snapshot file (out thousands or even million files) isn't
> > removed at that moment. Also, I originally proposed to change
> > elog-ERROR to elog-LOG in CheckPointLogicalRewriteHeap for unlink()
> > failures for the same reason.
>
> This was my initial instinct as well, but this thread has received
> contradictory feedback during the months since.

OK, so there aren't many places in 0001 that error out where we
previously did not.

For the one in CheckPointLogicalRewriteHeap(), if it fails, you don't
just want to skip -- it might be in the range that we need to fsync().
So what if we just deleted that get_dirent_type() != PGFILETYPE_REG
check altogether?  Depending on the name range check, we'd either
attempt to unlink() and LOG if that fails, or we'd attempt to
open()-or-ERROR and then fsync()-or-PANIC.  What functionality have we
lost without the DT_REG check?  Well, now if someone ill-advisedly
puts an important character device, socket, symlink (ie any kind of
non-DT_REG) into that directory with a name conforming to the
unlinkable range, we'll try to unlink it and possibly succeed, where
previously we'd have skipped, and if it's in the checkpointable range,
we'd try to fsync() it and likely fail, which is good because this is
a supposed to be a checkpoint.

That's already what happens if lstat() fails in master.  We fall back
to treating it like a DT_REG anyway:

               if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
                        continue;

For the one in CheckSnapBuild(), it's similar but unlink only.

For the one in StartupReplicationSlots(), you could possibly take the
same line: if it's not a directory, we'll try an rmdir() it anyway and
then emit a WARNING if that fails.  Again, that's exactly what master
would do if that lstat() failed.

In other words, if we're going to LOG and press on, maybe we should
just press on anyway.  Would the error message be any less
informative?  Would the risk of unlinking a character device that you
accidentally stored in
/clusters/pg16-main/pgdata/pg_logical/mappings/map-1234-5678 be a
problem for anyone?

Separately from that topic, it would be nice to have a comment in each
place where we tolerate unlink() failure to explain why that is
harmless except for wasted disk.



On Tue, Aug 09, 2022 at 05:26:39PM +1200, Thomas Munro wrote:
> The changes were not from elog-LOG to elog-ERROR, they were changes
> from silently eating errors, but I take your (plural) general point.

I think this was in reference to 0002.  My comment was, at least.

> OK, so there aren't many places in 0001 that error out where we
> previously did not.
> 
> For the one in CheckPointLogicalRewriteHeap(), if it fails, you don't
> just want to skip -- it might be in the range that we need to fsync().
> So what if we just deleted that get_dirent_type() != PGFILETYPE_REG
> check altogether?  Depending on the name range check, we'd either
> attempt to unlink() and LOG if that fails, or we'd attempt to
> open()-or-ERROR and then fsync()-or-PANIC.

It looks like CheckPointLogicalRewriteHeap() presently ERRORs when unlink()
fails.  However, CheckPointSnapBuild() only LOGs when unlink() fails.
Since mappings files contain transaction IDs, persistent unlink() failures
might lead to wraparound, but I don't think failing checkpoints will help
improve matters.

Bharath's original patch changed CheckPointLogicalRewriteHeap() to LOG on
sscanf() and unlink() failures.  IIUC things would work the way you suggest
if that was applied.

> What functionality have we
> lost without the DT_REG check?  Well, now if someone ill-advisedly
> puts an important character device, socket, symlink (ie any kind of
> non-DT_REG) into that directory with a name conforming to the
> unlinkable range, we'll try to unlink it and possibly succeed, where
> previously we'd have skipped, and if it's in the checkpointable range,
> we'd try to fsync() it and likely fail, which is good because this is
> a supposed to be a checkpoint.

I don't know that I agree that the last part is good.  Presumably we don't
want checkpointing to fail forever because there's a random non-regular
file that can't be fsync'd.  But it's not clear why such files would exist
in the first place.  Presumably this isn't the sort of thing that Postgres
should be expected to support.

> In other words, if we're going to LOG and press on, maybe we should
> just press on anyway.  Would the error message be any less
> informative?  Would the risk of unlinking a character device that you
> accidentally stored in
> /clusters/pg16-main/pgdata/pg_logical/mappings/map-1234-5678 be a
> problem for anyone?

Probably not.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Tue, Aug 9, 2022 at 10:57 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> OK, so there aren't many places in 0001 that error out where we
> previously did not.

Well, that's not true I believe. The 0001 patch introduces
get_dirent_type() with elevel ERROR which means that lstat failures
are now reported as ERRORs with ereport(ERROR, as opposed to
continuing on the HEAD.

-        if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+        if (get_dirent_type(path, mapping_de, false, ERROR) != PGFILETYPE_REG)
             continue;

-        if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+        if (get_dirent_type(path, snap_de, false, ERROR) != PGFILETYPE_REG)
         {

so on.

The main idea of using get_dirent_type() instead of lstat or stat is
to benefit from avoiding system calls on platforms where the directory
entry type is stored in dirent structure, but not to cause errors for
lstat or stat system calls failures where we previously didn't. I
think 0001 must be just replacing lstat or stat with get_dirent_type()
with elevel ERROR if lstat or stat failure is treated as ERROR
previously, otherwise with elevel LOG. I modified it as attached. Once
this patch gets committed, then we can continue the discussion as to
whether we make elog-LOG to elog-ERROR or vice-versa.

I'm tempted to use get_dirent_type() in RemoveXlogFile() but that
requires us to pass struct dirent * from RemoveOldXlogFiles() (as
attached 0002 for now), If done, this avoids one lstat() system call
per WAL file. IMO, that's okay to pass an extra function parameter to
RemoveXlogFile() as it is a static function and there can be many
(thousands of) WAL files at times, so saving system calls (lstat() *
number of WAL files) is definitely an advantage.

Thoughts?

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Attachment
On Wed, Aug 10, 2022 at 7:15 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> The main idea of using get_dirent_type() instead of lstat or stat is
> to benefit from avoiding system calls on platforms where the directory
> entry type is stored in dirent structure, but not to cause errors for
> lstat or stat system calls failures where we previously didn't.

Yeah, I get it... I'm OK with separating mechanical changes like
lstat() -> get_dirent_type() from policy changes on error handling.  I
initially thought the ERROR was surely better than what we're doing
now (making extra system calls to avoid unlinking unexpected things,
for which our only strategy on failure is to try to unlink the thing
anyway), but it's better to discuss that separately.

> I
> think 0001 must be just replacing lstat or stat with get_dirent_type()
> with elevel ERROR if lstat or stat failure is treated as ERROR
> previously, otherwise with elevel LOG. I modified it as attached. Once
> this patch gets committed, then we can continue the discussion as to
> whether we make elog-LOG to elog-ERROR or vice-versa.

Agreed, will look at this version tomorrow.

> I'm tempted to use get_dirent_type() in RemoveXlogFile() but that
> requires us to pass struct dirent * from RemoveOldXlogFiles() (as
> attached 0002 for now), If done, this avoids one lstat() system call
> per WAL file. IMO, that's okay to pass an extra function parameter to
> RemoveXlogFile() as it is a static function and there can be many
> (thousands of) WAL files at times, so saving system calls (lstat() *
> number of WAL files) is definitely an advantage.

-        lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
+        get_dirent_type(path, xlde, false, LOG) != PGFILETYPE_REG &&

I think you wanted ==, but maybe it'd be nicer to pass in a
"recyclable" boolean to RemoveXlogFile()?

If you're looking for more, rmtree() looks like another.



On Wed, Aug 10, 2022 at 2:11 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Aug 10, 2022 at 7:15 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> > The main idea of using get_dirent_type() instead of lstat or stat is
> > to benefit from avoiding system calls on platforms where the directory
> > entry type is stored in dirent structure, but not to cause errors for
> > lstat or stat system calls failures where we previously didn't.
>
> Yeah, I get it... I'm OK with separating mechanical changes like
> lstat() -> get_dirent_type() from policy changes on error handling.  I
> initially thought the ERROR was surely better than what we're doing
> now (making extra system calls to avoid unlinking unexpected things,
> for which our only strategy on failure is to try to unlink the thing
> anyway), but it's better to discuss that separately.
>
> > I
> > think 0001 must be just replacing lstat or stat with get_dirent_type()
> > with elevel ERROR if lstat or stat failure is treated as ERROR
> > previously, otherwise with elevel LOG. I modified it as attached. Once
> > this patch gets committed, then we can continue the discussion as to
> > whether we make elog-LOG to elog-ERROR or vice-versa.
>
> Agreed, will look at this version tomorrow.

Thanks.

> > I'm tempted to use get_dirent_type() in RemoveXlogFile() but that
> > requires us to pass struct dirent * from RemoveOldXlogFiles() (as
> > attached 0002 for now), If done, this avoids one lstat() system call
> > per WAL file. IMO, that's okay to pass an extra function parameter to
> > RemoveXlogFile() as it is a static function and there can be many
> > (thousands of) WAL files at times, so saving system calls (lstat() *
> > number of WAL files) is definitely an advantage.
>
> -        lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
> +        get_dirent_type(path, xlde, false, LOG) != PGFILETYPE_REG &&
>
> I think you wanted ==, but maybe it'd be nicer to pass in a
> "recyclable" boolean to RemoveXlogFile()?

Ah, my bad, I corrected it now.

If you mean to do "bool recyclable = (get_dirent_type(path, xlde,
false, LOG) == PGFILETYPE_REG)"" and pass it as parameter to
RemoveXlogFile(), then we have to do get_dirent_type calls for every
WAL file even when any of (wal_recycle && *endlogSegNo <= recycleSegNo
&& XLogCtl->InstallXLogFileSegmentActive) is false which is
unnecessary and it's better to pass dirent structure as a parameter.

> If you're looking for more, rmtree() looks like another.

Are you suggesting to not use pgfnames() in rmtree() and do
opendir()-readdir()-get_dirent_type() directly? If yes, I don't think
it's a great idea because rmtree() has recursive calls and holding
opendir()-readdir() for all rmtree() recursive calls without
closedir() may eat up dynamic memory if there are deeper level
directories. I would like to leave it that way. Do you have any other
ideas in mind?

Please find the v15 patch, I merged the RemoveXlogFile() changes into 0001.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Attachment
On Wed, Aug 10, 2022 at 03:28:25PM +0530, Bharath Rupireddy wrote:
>          snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
> -        if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
> +        if (get_dirent_type(path, mapping_de, false, LOG) != PGFILETYPE_REG)
>              continue;

Previously, failure to lstat() wouldn't lead to skipping the entry.  With
this patch, a failure to determine the file type will cause the entry to be
skipped.  This might be okay in some places (e.g., CheckPointSnapBuild())
but not in others.  For example, in CheckPointLogicalRewriteHeap(), this
could cause us to skip fsync-ing a file due to a get_dirent_type() failure,
which seems bad.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Wed, Aug 17, 2022 at 2:02 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Wed, Aug 10, 2022 at 03:28:25PM +0530, Bharath Rupireddy wrote:
> >               snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
> > -             if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
> > +             if (get_dirent_type(path, mapping_de, false, LOG) != PGFILETYPE_REG)
> >                       continue;
>
> Previously, failure to lstat() wouldn't lead to skipping the entry.  With
> this patch, a failure to determine the file type will cause the entry to be
> skipped.  This might be okay in some places (e.g., CheckPointSnapBuild())
> but not in others.  For example, in CheckPointLogicalRewriteHeap(), this
> could cause us to skip fsync-ing a file due to a get_dirent_type() failure,
> which seems bad.

Hm. I corrected it in the v16 patch, please review.

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Attachment
On Wed, Aug 17, 2022 at 12:39:26PM +0530, Bharath Rupireddy wrote:
> +    /*
> +     * We're only handling directories here, skip if it's not ours. Also, skip
> +     * if the caller has already performed this check.
> +     */
> +    if (!slot_dir_checked &&
> +        lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
>          return;

There was previous discussion about removing this stanza from
ReorderBufferCleanupSeralizedTXNs() completely [0].  If that is still
possible, I think I would choose that over having callers specify whether
to do the directory check.

> +        /* we're only handling directories here, skip if it's not one */
> +        sprintf(path, "pg_replslot/%s", logical_de->d_name);
> +        if (get_dirent_type(path, logical_de, false, LOG) != PGFILETYPE_DIR)
> +            continue;

IIUC an error in get_dirent_type() could cause slots to be skipped here,
which is a behavior change.

[0] https://postgr.es/m/20220329224832.GA560657%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Tue, Aug 23, 2022 at 11:37 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Wed, Aug 17, 2022 at 12:39:26PM +0530, Bharath Rupireddy wrote:
> > +     /*
> > +      * We're only handling directories here, skip if it's not ours. Also, skip
> > +      * if the caller has already performed this check.
> > +      */
> > +     if (!slot_dir_checked &&
> > +             lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
> >               return;
>
> There was previous discussion about removing this stanza from
> ReorderBufferCleanupSeralizedTXNs() completely [0].  If that is still
> possible, I think I would choose that over having callers specify whether
> to do the directory check.
>
> [0] https://postgr.es/m/20220329224832.GA560657%40nathanxps13

From [0]:

> My guess is that this was done because ReorderBufferCleanupSerializedTXNs()
> is also called from ReorderBufferAllocate() and ReorderBufferFree().
> However, it is odd that we just silently return if the slot path isn't a
> directory in those cases. I think we could use get_dirent_type() in
> StartupReorderBuffer() as you suggested, and then we could let ReadDir()
> ERROR for non-directories for the other callers of
> ReorderBufferCleanupSerializedTXNs(). WDYT?

Firstly, removing lstat() completely from
ReorderBufferCleanupSerializedTXNs() is a behavioural change.
ReorderBufferCleanupSerializedTXNs() returning silently to callers
ReorderBufferAllocate() and ReorderBufferFree(), when the slot path
isn't a directory completely makes sense to me because the callers are
trying to clean something and if it doesn't exist that's okay, they
can go ahead. Also, the usage of the ReorderBufferAllocate() and
ReorderBufferFree() is pretty wide and I don't want to risk the new
behaviour.

> > +             /* we're only handling directories here, skip if it's not one */
> > +             sprintf(path, "pg_replslot/%s", logical_de->d_name);
> > +             if (get_dirent_type(path, logical_de, false, LOG) != PGFILETYPE_DIR)
> > +                     continue;
>
> IIUC an error in get_dirent_type() could cause slots to be skipped here,
> which is a behavior change.

That behaviour hasn't changed, no? Currently, if lstat() fails in
ReorderBufferCleanupSerializedTXNs() it returns to
StartupReorderBuffer()'s while loop which is in a way continuing with
the other slots, this patch does nothing new.

So, no new patch, please find the latest v16 patch at [1].

[1] https://www.postgresql.org/message-id/CALj2ACXZPMYXrPZ%2BCUE0_5DKDnxyqDGRftSgPPx--PWhWB3RtQ%40mail.gmail.com

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/



On Thu, Aug 25, 2022 at 10:48:08AM +0530, Bharath Rupireddy wrote:
> On Tue, Aug 23, 2022 at 11:37 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> IIUC an error in get_dirent_type() could cause slots to be skipped here,
>> which is a behavior change.
>
> That behaviour hasn't changed, no? Currently, if lstat() fails in
> ReorderBufferCleanupSerializedTXNs() it returns to
> StartupReorderBuffer()'s while loop which is in a way continuing with
> the other slots, this patch does nothing new.

Are you sure?  FWIW, the changes in reorderbuffer.c for
ReorderBufferCleanupSerializedTXNs() reduce the code readability, in
my opinion, so that's one less argument in favor of this change.

The gain in ParseConfigDirectory() is kind of cool.
pg_tzenumerate_next(), copydir(), RemoveXlogFile()
StartupReplicationSlots(), CheckPointLogicalRewriteHeap() and
RemovePgTempFilesInDir() seem fine, as well.  At least these avoid
extra lstat() calls when the file type is unknown, which would be only
a limited number of users where some of the three DT_* are missing
(right?).
--
Michael

Attachment
On Thu, Aug 25, 2022 at 5:51 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> FWIW, the changes in reorderbuffer.c for
> ReorderBufferCleanupSerializedTXNs() reduce the code readability, in
> my opinion, so that's one less argument in favor of this change.

Agreed. I reverted the changes.

> The gain in ParseConfigDirectory() is kind of cool.
> pg_tzenumerate_next(), copydir(), RemoveXlogFile()
> StartupReplicationSlots(), CheckPointLogicalRewriteHeap() and
> RemovePgTempFilesInDir() seem fine, as well.  At least these avoid
> extra lstat() calls when the file type is unknown, which would be only
> a limited number of users where some of the three DT_* are missing
> (right?).

Yes, the idea is to avoid lstat() system calls when possible.

PSA v17 patch with reorderbuffer.c changes reverted.

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Attachment
On Sat, Aug 27, 2022 at 02:06:32PM +0530, Bharath Rupireddy wrote:
> PSA v17 patch with reorderbuffer.c changes reverted.

LGTM

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



On Sun, Aug 28, 2022 at 02:52:43PM -0700, Nathan Bossart wrote:
> On Sat, Aug 27, 2022 at 02:06:32PM +0530, Bharath Rupireddy wrote:
>> PSA v17 patch with reorderbuffer.c changes reverted.
>
> LGTM

Yeah, what you have here looks pretty fine to me, so this is IMO
committable.  Let's wait a bit to see if there are any objections from
the others, in case.
--
Michael

Attachment
Hi,

On 2022-08-27 14:06:32 +0530, Bharath Rupireddy wrote:
> @@ -50,7 +51,7 @@ copydir(char *fromdir, char *todir, bool recurse)
>  
>      while ((xlde = ReadDir(xldir, fromdir)) != NULL)
>      {
> -        struct stat fst;
> +        PGFileType    xlde_type;
>  
>          /* If we got a cancel signal during the copy of the directory, quit */
>          CHECK_FOR_INTERRUPTS();
> @@ -62,18 +63,15 @@ copydir(char *fromdir, char *todir, bool recurse)
>          snprintf(fromfile, sizeof(fromfile), "%s/%s", fromdir, xlde->d_name);
>          snprintf(tofile, sizeof(tofile), "%s/%s", todir, xlde->d_name);
>  
> -        if (lstat(fromfile, &fst) < 0)
> -            ereport(ERROR,
> -                    (errcode_for_file_access(),
> -                     errmsg("could not stat file \"%s\": %m", fromfile)));
> +        xlde_type = get_dirent_type(fromfile, xlde, false, ERROR);
>  
> -        if (S_ISDIR(fst.st_mode))
> +        if (xlde_type == PGFILETYPE_DIR)
>          {
>              /* recurse to handle subdirectories */
>              if (recurse)
>                  copydir(fromfile, tofile, true);
>          }
> -        else if (S_ISREG(fst.st_mode))
> +        else if (xlde_type == PGFILETYPE_REG)
>              copy_file(fromfile, tofile);
>      }
>      FreeDir(xldir);

It continues to make no sense to me to add behaviour changes around
error-handling as part of a conversion to get_dirent_type(). I don't at all
understand why e.g. the above change to make copydir() silently skip over
files it can't stat is ok?

Greetings,

Andres Freund



On Wed, Aug 31, 2022 at 12:14:33PM -0700, Andres Freund wrote:
> On 2022-08-27 14:06:32 +0530, Bharath Rupireddy wrote:
>> -        if (lstat(fromfile, &fst) < 0)
>> -            ereport(ERROR,
>> -                    (errcode_for_file_access(),
>> -                     errmsg("could not stat file \"%s\": %m", fromfile)));
>> +        xlde_type = get_dirent_type(fromfile, xlde, false, ERROR);
>>  
>> -        if (S_ISDIR(fst.st_mode))
>> +        if (xlde_type == PGFILETYPE_DIR)
>>          {
>>              /* recurse to handle subdirectories */
>>              if (recurse)
>>                  copydir(fromfile, tofile, true);
>>          }
>> -        else if (S_ISREG(fst.st_mode))
>> +        else if (xlde_type == PGFILETYPE_REG)
>>              copy_file(fromfile, tofile);
>>      }
>>      FreeDir(xldir);
> 
> It continues to make no sense to me to add behaviour changes around
> error-handling as part of a conversion to get_dirent_type(). I don't at all
> understand why e.g. the above change to make copydir() silently skip over
> files it can't stat is ok?

In this example, the call to get_dirent_type() should ERROR if the call to
lstat() fails (the "elevel" argument is set to ERROR).
-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Hi,

On 2022-08-31 12:24:28 -0700, Nathan Bossart wrote:
> On Wed, Aug 31, 2022 at 12:14:33PM -0700, Andres Freund wrote:
> > On 2022-08-27 14:06:32 +0530, Bharath Rupireddy wrote:
> > It continues to make no sense to me to add behaviour changes around
> > error-handling as part of a conversion to get_dirent_type(). I don't at all
> > understand why e.g. the above change to make copydir() silently skip over
> > files it can't stat is ok?
> 
> In this example, the call to get_dirent_type() should ERROR if the call to
> lstat() fails (the "elevel" argument is set to ERROR).

Oh, oops. Skimmed code too quickly...

Greetings,

Andres Freund



On Wed, Aug 31, 2022 at 04:10:59PM +0900, Michael Paquier wrote:
> Yeah, what you have here looks pretty fine to me, so this is IMO
> committable.  Let's wait a bit to see if there are any objections from
> the others, in case.

I had a few hours and I've spent them looking at what you had here in
details, and there were a few things I have tweaked before applying
the patch.  First, elevel was set to LOG for three calls of
get_dirent_type() in code paths where we want to skip entries.  This
would have become very noisy, so I've switched two of them to DEBUG1
and the third one to DEBUG2 for consistency with the surroundings.  A
second thing was RemoveXlogFile(), where we passed down a dirent entry
*and* its d_name.  With this design, it would be easy to mess up
things and pass down a file name that does not match with its dirent
entry, so I have finished by replacing "segname" by the dirent
structure.  A last thing was about the two extra comment blocks in
fd.c, and I could not convince myself that these were helpful hints so
I have removed both of them.
--
Michael

Attachment
On Fri, Sep 02, 2022 at 05:09:14PM +0900, Michael Paquier wrote:
> I had a few hours and I've spent them looking at what you had here in
> details, and there were a few things I have tweaked before applying
> the patch.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com





On Sat, Sep 3, 2022 at 3:09 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Fri, Sep 02, 2022 at 05:09:14PM +0900, Michael Paquier wrote:
> I had a few hours and I've spent them looking at what you had here in
> details, and there were a few things I have tweaked before applying
> the patch.

Thanks!

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com


Hi,
Your patch require a rebase. Please provide the latest rebase patch.
=== applying patch ./v17-0001-Make-more-use-of-get_dirent_type-in-place-of-sta.patch
patching file src/backend/access/heap/rewriteheap.c
Hunk #1 FAILED at 113.
Hunk #2 FAILED at 1213.
Hunk #3 FAILED at 1221.
3 out of 3 hunks FAILED -- saving rejects to file src/backend/access/heap/rewriteheap.c.rej

--
Ibrar Ahmed
On Sat, Sep 03, 2022 at 10:47:32AM +0500, Ibrar Ahmed wrote:
> Your patch require a rebase. Please provide the latest rebase patch.

I was looking for a CF entry yesterday when I looked at this patch,
missing https://commitfest.postgresql.org/39/3496/.  This has been
applied as of bfb9dfd.
--
Michael

Attachment