Thread: O(n) tasks cause lengthy startups and checkpoints

O(n) tasks cause lengthy startups and checkpoints

From
"Bossart, Nathan"
Date:
Hi hackers,

Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to
avoid individually syncing all database files after a crash.  However,
as noted earlier this year [0], there are still a number of O(n) tasks
that affect startup and checkpointing that I'd like to improve.
Below, I've attempted to summarize each task and to offer ideas for
improving matters.  I'll likely split each of these into its own
thread, given there is community interest for such changes.

1) CheckPointSnapBuild(): This function loops through
   pg_logical/snapshots to remove all snapshots that are no longer
   needed.  If there are many entries in this directory, this can take
   a long time.  The note above this function indicates that this is
   done during checkpoints simply because it is convenient.  IIUC
   there is no requirement that this function actually completes for a
   given checkpoint.  My current idea is to move this to a new
   maintenance worker.
2) CheckPointLogicalRewriteHeap(): This function loops through
   pg_logical/mappings to remove old mappings and flush all remaining
   ones.  IIUC there is no requirement that the "remove old mappings"
   part must complete for a given checkpoint, but the "flush all
   remaining" portion allows replay after a checkpoint to only "deal
   with the parts of a mapping that have been written out after the
   checkpoint started."  Therefore, I think we should move the "remove
   old mappings" part to a new maintenance worker (probably the same
   one as for 1), and we should consider using syncfs() for the "flush
   all remaining" part.  (I suspect the main argument against the
   latter will be that it could cause IO spikes.)
3) RemovePgTempFiles(): This step can delay startup if there are many
   temporary files to individually remove.  This step is already
   optionally done after a crash via the remove_temp_files_after_crash
   GUC.  I propose that we have startup move the temporary file
   directories aside and create new ones, and then a separate worker
   (probably the same one from 1 and 2) could clean up the old files.
4) StartupReorderBuffer(): This step deletes logical slot data that
   has been spilled to disk.  This code appears to be written to avoid
   deleting different types of files in these directories, but AFAICT
   there shouldn't be any other files.  Therefore, I think we could do
   something similar to 3 (i.e., move the directories aside during
   startup and clean them up via a new maintenance worker).

I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing.  I imagine such a
process would come in handy down the road, too.  WDYT?

Nathan

[0] https://postgr.es/m/32B59582-AA6C-4609-B08F-2256A271F7A5%40amazon.com


Re: O(n) tasks cause lengthy startups and checkpoints

From
SATYANARAYANA NARLAPURAM
Date:
+1 to the idea. I don't see a reason why checkpointer has to do all of that. Keeping checkpoint to minimal essential work helps servers recover faster in the event of a crash.

RemoveOldXlogFiles is also an O(N) operation that can at least be avoided during the end of recovery (CHECKPOINT_END_OF_RECOVERY) checkpoint. When a sufficient number of WAL files accumulated and the previous checkpoint did not get a chance to cleanup, this can increase the unavailability of the server.

    RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);



On Wed, Dec 1, 2021 at 12:24 PM Bossart, Nathan <bossartn@amazon.com> wrote:
Hi hackers,

Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to
avoid individually syncing all database files after a crash.  However,
as noted earlier this year [0], there are still a number of O(n) tasks
that affect startup and checkpointing that I'd like to improve.
Below, I've attempted to summarize each task and to offer ideas for
improving matters.  I'll likely split each of these into its own
thread, given there is community interest for such changes.

1) CheckPointSnapBuild(): This function loops through
   pg_logical/snapshots to remove all snapshots that are no longer
   needed.  If there are many entries in this directory, this can take
   a long time.  The note above this function indicates that this is
   done during checkpoints simply because it is convenient.  IIUC
   there is no requirement that this function actually completes for a
   given checkpoint.  My current idea is to move this to a new
   maintenance worker.
2) CheckPointLogicalRewriteHeap(): This function loops through
   pg_logical/mappings to remove old mappings and flush all remaining
   ones.  IIUC there is no requirement that the "remove old mappings"
   part must complete for a given checkpoint, but the "flush all
   remaining" portion allows replay after a checkpoint to only "deal
   with the parts of a mapping that have been written out after the
   checkpoint started."  Therefore, I think we should move the "remove
   old mappings" part to a new maintenance worker (probably the same
   one as for 1), and we should consider using syncfs() for the "flush
   all remaining" part.  (I suspect the main argument against the
   latter will be that it could cause IO spikes.)
3) RemovePgTempFiles(): This step can delay startup if there are many
   temporary files to individually remove.  This step is already
   optionally done after a crash via the remove_temp_files_after_crash
   GUC.  I propose that we have startup move the temporary file
   directories aside and create new ones, and then a separate worker
   (probably the same one from 1 and 2) could clean up the old files.
4) StartupReorderBuffer(): This step deletes logical slot data that
   has been spilled to disk.  This code appears to be written to avoid
   deleting different types of files in these directories, but AFAICT
   there shouldn't be any other files.  Therefore, I think we could do
   something similar to 3 (i.e., move the directories aside during
   startup and clean them up via a new maintenance worker).

I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing.  I imagine such a
process would come in handy down the road, too.  WDYT?

Nathan

[0] https://postgr.es/m/32B59582-AA6C-4609-B08F-2256A271F7A5%40amazon.com

Re: O(n) tasks cause lengthy startups and checkpoints

From
Andres Freund
Date:
Hi,

On 2021-12-01 20:24:25 +0000, Bossart, Nathan wrote:
> I realize adding a new maintenance worker might be a bit heavy-handed,
> but I think it would be nice to have somewhere to offload tasks that
> really shouldn't impact startup and checkpointing.  I imagine such a
> process would come in handy down the road, too.  WDYT?

-1. I think the overhead of an additional worker is disproportional here. And
there's simplicity benefits in having a predictable cleanup interlock as well.

I think particularly for the snapshot stuff it'd be better to optimize away
unnecessary snapshot files, rather than making the cleanup more asynchronous.

Greetings,

Andres Freund



Re: O(n) tasks cause lengthy startups and checkpoints

From
"Bossart, Nathan"
Date:
On 12/1/21, 2:56 PM, "Andres Freund" <andres@anarazel.de> wrote:
> On 2021-12-01 20:24:25 +0000, Bossart, Nathan wrote:
>> I realize adding a new maintenance worker might be a bit heavy-handed,
>> but I think it would be nice to have somewhere to offload tasks that
>> really shouldn't impact startup and checkpointing.  I imagine such a
>> process would come in handy down the road, too.  WDYT?
>
> -1. I think the overhead of an additional worker is disproportional here. And
> there's simplicity benefits in having a predictable cleanup interlock as well.

Another idea I had was to put some upper limit on how much time is
spent on such tasks.  For example, a checkpoint would only spend X
minutes on CheckPointSnapBuild() before giving up until the next one.
I think the main downside of that approach is that it could lead to
unbounded growth, so perhaps we would limit (or even skip) such tasks
only for end-of-recovery and shutdown checkpoints.  Perhaps the
startup tasks could be limited in a similar fashion.

> I think particularly for the snapshot stuff it'd be better to optimize away
> unnecessary snapshot files, rather than making the cleanup more asynchronous.

I can look into this.  Any pointers would be much appreciated.

Nathan


Re: O(n) tasks cause lengthy startups and checkpoints

From
"Euler Taveira"
Date:
On Wed, Dec 1, 2021, at 9:19 PM, Bossart, Nathan wrote:
On 12/1/21, 2:56 PM, "Andres Freund" <andres@anarazel.de> wrote:
> On 2021-12-01 20:24:25 +0000, Bossart, Nathan wrote:
>> I realize adding a new maintenance worker might be a bit heavy-handed,
>> but I think it would be nice to have somewhere to offload tasks that
>> really shouldn't impact startup and checkpointing.  I imagine such a
>> process would come in handy down the road, too.  WDYT?
>
> -1. I think the overhead of an additional worker is disproportional here. And
> there's simplicity benefits in having a predictable cleanup interlock as well.

Another idea I had was to put some upper limit on how much time is
spent on such tasks.  For example, a checkpoint would only spend X
minutes on CheckPointSnapBuild() before giving up until the next one.
I think the main downside of that approach is that it could lead to
unbounded growth, so perhaps we would limit (or even skip) such tasks
only for end-of-recovery and shutdown checkpoints.  Perhaps the
startup tasks could be limited in a similar fashion.
Saying that a certain task is O(n) doesn't mean it needs a separate process to
handle it. Did you have a use case or even better numbers (% of checkpoint /
startup time) that makes your proposal worthwhile?

I would try to optimize (1) and (2). However, delayed removal can be a
long-term issue if the new routine cannot keep up with the pace of file
creation (specially if the checkpoints are far apart).

For (3), there is already a GUC that would avoid the slowdown during startup.
Use it if you think the startup time is more important that disk space occupied
by useless files.

For (4), you are forgetting that the on-disk state of replication slots is
stored in the pg_replslot/SLOTNAME/state. It seems you cannot just rename the
replication slot directory and copy the state file. What happen if there is a
crash before copying the state file?

While we are talking about items (1), (2) and (4), we could probably have an
option to create some ephemeral logical decoding files into ramdisk (similar to
statistics directory). I wouldn't like to hijack this thread but this proposal
could alleviate the possible issues that you pointed out. If people are
interested in this proposal, I can start a new thread about it.


--
Euler Taveira

Re: O(n) tasks cause lengthy startups and checkpoints

From
Bharath Rupireddy
Date:
On Thu, Dec 2, 2021 at 1:54 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>
> Hi hackers,
>
> Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to
> avoid individually syncing all database files after a crash.  However,
> as noted earlier this year [0], there are still a number of O(n) tasks
> that affect startup and checkpointing that I'd like to improve.
> Below, I've attempted to summarize each task and to offer ideas for
> improving matters.  I'll likely split each of these into its own
> thread, given there is community interest for such changes.
>
> 1) CheckPointSnapBuild(): This function loops through
>    pg_logical/snapshots to remove all snapshots that are no longer
>    needed.  If there are many entries in this directory, this can take
>    a long time.  The note above this function indicates that this is
>    done during checkpoints simply because it is convenient.  IIUC
>    there is no requirement that this function actually completes for a
>    given checkpoint.  My current idea is to move this to a new
>    maintenance worker.
> 2) CheckPointLogicalRewriteHeap(): This function loops through
>    pg_logical/mappings to remove old mappings and flush all remaining
>    ones.  IIUC there is no requirement that the "remove old mappings"
>    part must complete for a given checkpoint, but the "flush all
>    remaining" portion allows replay after a checkpoint to only "deal
>    with the parts of a mapping that have been written out after the
>    checkpoint started."  Therefore, I think we should move the "remove
>    old mappings" part to a new maintenance worker (probably the same
>    one as for 1), and we should consider using syncfs() for the "flush
>    all remaining" part.  (I suspect the main argument against the
>    latter will be that it could cause IO spikes.)
> 3) RemovePgTempFiles(): This step can delay startup if there are many
>    temporary files to individually remove.  This step is already
>    optionally done after a crash via the remove_temp_files_after_crash
>    GUC.  I propose that we have startup move the temporary file
>    directories aside and create new ones, and then a separate worker
>    (probably the same one from 1 and 2) could clean up the old files.
> 4) StartupReorderBuffer(): This step deletes logical slot data that
>    has been spilled to disk.  This code appears to be written to avoid
>    deleting different types of files in these directories, but AFAICT
>    there shouldn't be any other files.  Therefore, I think we could do
>    something similar to 3 (i.e., move the directories aside during
>    startup and clean them up via a new maintenance worker).
>
> I realize adding a new maintenance worker might be a bit heavy-handed,
> but I think it would be nice to have somewhere to offload tasks that
> really shouldn't impact startup and checkpointing.  I imagine such a
> process would come in handy down the road, too.  WDYT?

+1 for the overall idea of making the checkpoint faster. In fact, we
here at our team have been thinking about this problem for a while. If
there are a lot of files that checkpoint has to loop over and remove,
IMO, that task can be delegated to someone else (maybe a background
worker called background cleaner or bg cleaner, of course, we can have
a GUC to enable or disable it). The checkpoint can just write some
marker files (for instance, it can write snapshot_<cutofflsn> files
with file name itself representing the cutoff lsn so that the new bg
cleaner can remove the snapshot files, similarly it can write marker
files for other file removals). Having said that, a new bg cleaner
deleting the files asynchronously on behalf of checkpoint can look an
overkill until we have some numbers that we could save with this
approach. For this purpose, I did a small experiment to figure out how
much usually file deletion takes [1] on a SSD, for 1million files
8seconds, I'm sure it will be much more on HDD.

The bg cleaner can also be used for RemovePgTempFiles, probably the
postmaster just renaming the pgsql_temp to something
pgsql_temp_delete, then proceeding with the server startup, the bg
cleaner can then delete the files.
Also, we could do something similar for removing/recycling old xlog
files and StartupReorderBuffer.

Another idea could be to parallelize the checkpoint i.e. IIUC, the
tasks that checkpoint do in CheckPointGuts are independent and if we
have some counters like (how many snapshot/mapping files that the
server generated)

[1] on SSD:
deletion of 1000000 files took 7.930380 seconds
deletion of 500000 files took 3.921676 seconds
deletion of 100000 files took 0.768772 seconds
deletion of 50000 files took 0.400623 seconds
deletion of 10000 files took 0.077565 seconds
deletion of 1000 files took 0.006232 seconds

Regards,
Bharath Rupireddy.



Re: O(n) tasks cause lengthy startups and checkpoints

From
"Bossart, Nathan"
Date:
On 12/1/21, 6:06 PM, "Euler Taveira" <euler@eulerto.com> wrote:
> Saying that a certain task is O(n) doesn't mean it needs a separate process to
> handle it. Did you have a use case or even better numbers (% of checkpoint /
> startup time) that makes your proposal worthwhile?

I don't have specific numbers on hand, but each of the four functions
I listed is something I routinely see impacting customers.

> For (3), there is already a GUC that would avoid the slowdown during startup.
> Use it if you think the startup time is more important that disk space occupied
> by useless files.

Setting remove_temp_files_after_crash to false only prevents temp file
cleanup during restart after a backend crash.  It is always called for
other startups.

> For (4), you are forgetting that the on-disk state of replication slots is
> stored in the pg_replslot/SLOTNAME/state. It seems you cannot just rename the
> replication slot directory and copy the state file. What happen if there is a
> crash before copying the state file?

Good point.  I think it's possible to deal with this, though.  Perhaps
the files that should be deleted on startup should go in a separate
directory, or maybe we could devise a way to ensure the state file is
copied even if there is a crash at an inconvenient time.

Nathan


Re: O(n) tasks cause lengthy startups and checkpoints

From
"Bossart, Nathan"
Date:
On 12/1/21, 6:48 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
> +1 for the overall idea of making the checkpoint faster. In fact, we
> here at our team have been thinking about this problem for a while. If
> there are a lot of files that checkpoint has to loop over and remove,
> IMO, that task can be delegated to someone else (maybe a background
> worker called background cleaner or bg cleaner, of course, we can have
> a GUC to enable or disable it). The checkpoint can just write some

Right.  IMO it isn't optimal to have critical things like startup and
checkpointing depend on somewhat-unrelated tasks.  I understand the
desire to avoid adding additional processes, and maybe it is a bigger
hammer than what is necessary to reduce the impact, but it seemed like
a natural solution for this problem.  That being said, I'm all for
exploring other ways to handle this.

> Another idea could be to parallelize the checkpoint i.e. IIUC, the
> tasks that checkpoint do in CheckPointGuts are independent and if we
> have some counters like (how many snapshot/mapping files that the
> server generated)

Could you elaborate on this?  Is your idea that the checkpointer would
create worker processes like autovacuum does?

Nathan


Re: O(n) tasks cause lengthy startups and checkpoints

From
Bharath Rupireddy
Date:
On Fri, Dec 3, 2021 at 3:01 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>
> On 12/1/21, 6:48 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
> > +1 for the overall idea of making the checkpoint faster. In fact, we
> > here at our team have been thinking about this problem for a while. If
> > there are a lot of files that checkpoint has to loop over and remove,
> > IMO, that task can be delegated to someone else (maybe a background
> > worker called background cleaner or bg cleaner, of course, we can have
> > a GUC to enable or disable it). The checkpoint can just write some
>
> Right.  IMO it isn't optimal to have critical things like startup and
> checkpointing depend on somewhat-unrelated tasks.  I understand the
> desire to avoid adding additional processes, and maybe it is a bigger
> hammer than what is necessary to reduce the impact, but it seemed like
> a natural solution for this problem.  That being said, I'm all for
> exploring other ways to handle this.

Having a generic background cleaner process (controllable via a few
GUCs), which can delete a bunch of files (snapshot, mapping, old WAL,
temp files etc.) or some other task on behalf of the checkpointer,
seems to be the easiest solution.

I'm too open for other ideas.

> > Another idea could be to parallelize the checkpoint i.e. IIUC, the
> > tasks that checkpoint do in CheckPointGuts are independent and if we
> > have some counters like (how many snapshot/mapping files that the
> > server generated)
>
> Could you elaborate on this?  Is your idea that the checkpointer would
> create worker processes like autovacuum does?

Yes, I was thinking that the checkpointer creates one or more dynamic
background workers (we can assume one background worker for now) to
delete the files. If a threshold of files crosses (snapshot files
count is more than this threshold), the new worker gets spawned which
would then enumerate the files and delete the unneeded ones, the
checkpointer can proceed with the other tasks and finish the
checkpointing. Having said this, I prefer the background cleaner
approach over the dynamic background worker. The advantage with the
background cleaner being that it can do other tasks (like other kinds
of file deletion).

Another idea could be that, use the existing background writer to do
the file deletion while the checkpoint is happening. But again, this
might cause problems because the bg writer flushing dirty buffers will
get delayed.

Regards,
Bharath Rupireddy.



Re: O(n) tasks cause lengthy startups and checkpoints

From
"Bossart, Nathan"
Date:
On 12/3/21, 5:57 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
> On Fri, Dec 3, 2021 at 3:01 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>>
>> On 12/1/21, 6:48 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
>> > +1 for the overall idea of making the checkpoint faster. In fact, we
>> > here at our team have been thinking about this problem for a while. If
>> > there are a lot of files that checkpoint has to loop over and remove,
>> > IMO, that task can be delegated to someone else (maybe a background
>> > worker called background cleaner or bg cleaner, of course, we can have
>> > a GUC to enable or disable it). The checkpoint can just write some
>>
>> Right.  IMO it isn't optimal to have critical things like startup and
>> checkpointing depend on somewhat-unrelated tasks.  I understand the
>> desire to avoid adding additional processes, and maybe it is a bigger
>> hammer than what is necessary to reduce the impact, but it seemed like
>> a natural solution for this problem.  That being said, I'm all for
>> exploring other ways to handle this.
>
> Having a generic background cleaner process (controllable via a few
> GUCs), which can delete a bunch of files (snapshot, mapping, old WAL,
> temp files etc.) or some other task on behalf of the checkpointer,
> seems to be the easiest solution.
>
> I'm too open for other ideas.

I might hack something together for the separate worker approach, if
for no other reason than to make sure I really understand how these
functions work.  If/when a better idea emerges, we can alter course.

Nathan


Re: O(n) tasks cause lengthy startups and checkpoints

From
Bharath Rupireddy
Date:
On Fri, Dec 3, 2021 at 11:50 PM Bossart, Nathan <bossartn@amazon.com> wrote:
>
> On 12/3/21, 5:57 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
> > On Fri, Dec 3, 2021 at 3:01 AM Bossart, Nathan <bossartn@amazon.com> wrote:
> >>
> >> On 12/1/21, 6:48 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
> >> > +1 for the overall idea of making the checkpoint faster. In fact, we
> >> > here at our team have been thinking about this problem for a while. If
> >> > there are a lot of files that checkpoint has to loop over and remove,
> >> > IMO, that task can be delegated to someone else (maybe a background
> >> > worker called background cleaner or bg cleaner, of course, we can have
> >> > a GUC to enable or disable it). The checkpoint can just write some
> >>
> >> Right.  IMO it isn't optimal to have critical things like startup and
> >> checkpointing depend on somewhat-unrelated tasks.  I understand the
> >> desire to avoid adding additional processes, and maybe it is a bigger
> >> hammer than what is necessary to reduce the impact, but it seemed like
> >> a natural solution for this problem.  That being said, I'm all for
> >> exploring other ways to handle this.
> >
> > Having a generic background cleaner process (controllable via a few
> > GUCs), which can delete a bunch of files (snapshot, mapping, old WAL,
> > temp files etc.) or some other task on behalf of the checkpointer,
> > seems to be the easiest solution.
> >
> > I'm too open for other ideas.
>
> I might hack something together for the separate worker approach, if
> for no other reason than to make sure I really understand how these
> functions work.  If/when a better idea emerges, we can alter course.

Thanks. As I said upthread we've been discussing the approach of
offloading some of the checkpoint tasks like (deleting snapshot files)
internally for quite some time and I would like to share a patch that
adds a new background cleaner process (currently able to delete the
logical replication snapshot files, if required can be extended to do
other tasks as well). I don't mind if it gets rejected. Please have a
look.

Regards,
Bharath Rupireddy.

Attachment

Re: O(n) tasks cause lengthy startups and checkpoints

From
"Bossart, Nathan"
Date:
On 12/6/21, 3:44 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
> On Fri, Dec 3, 2021 at 11:50 PM Bossart, Nathan <bossartn@amazon.com> wrote:
>> I might hack something together for the separate worker approach, if
>> for no other reason than to make sure I really understand how these
>> functions work.  If/when a better idea emerges, we can alter course.
>
> Thanks. As I said upthread we've been discussing the approach of
> offloading some of the checkpoint tasks like (deleting snapshot files)
> internally for quite some time and I would like to share a patch that
> adds a new background cleaner process (currently able to delete the
> logical replication snapshot files, if required can be extended to do
> other tasks as well). I don't mind if it gets rejected. Please have a
> look.

Thanks for sharing!  I've also spent some time on a patch set, which I
intend to share once I have handling for all four tasks (so far I have
handling for CheckPointSnapBuild() and RemovePgTempFiles()).  I'll
take a look at your patch as well.

Nathan


Re: O(n) tasks cause lengthy startups and checkpoints

From
"Bossart, Nathan"
Date:
On 12/6/21, 11:23 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> On 12/6/21, 3:44 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
>> Thanks. As I said upthread we've been discussing the approach of
>> offloading some of the checkpoint tasks like (deleting snapshot files)
>> internally for quite some time and I would like to share a patch that
>> adds a new background cleaner process (currently able to delete the
>> logical replication snapshot files, if required can be extended to do
>> other tasks as well). I don't mind if it gets rejected. Please have a
>> look.
>
> Thanks for sharing!  I've also spent some time on a patch set, which I
> intend to share once I have handling for all four tasks (so far I have
> handling for CheckPointSnapBuild() and RemovePgTempFiles()).  I'll
> take a look at your patch as well.

Well, I haven't had a chance to look at your patch, and my patch set
still only has handling for CheckPointSnapBuild() and
RemovePgTempFiles(), but I thought I'd share what I have anyway.  I
split it into 5 patches:

0001 - Adds a new "custodian" auxiliary process that does nothing.
0002 - During startup, remove the pgsql_tmp directories instead of
       only clearing the contents.
0003 - Split temporary file cleanup during startup into two stages.
       The first renames the directories, and the second clears them.
0004 - Moves the second stage from 0003 to the custodian process.
0005 - Moves CheckPointSnapBuild() to the custodian process.

This is still very much a work in progress, and I've done minimal
testing so far.

Nathan


Attachment

Re: O(n) tasks cause lengthy startups and checkpoints

From
Robert Haas
Date:
On Fri, Dec 10, 2021 at 2:03 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> Well, I haven't had a chance to look at your patch, and my patch set
> still only has handling for CheckPointSnapBuild() and
> RemovePgTempFiles(), but I thought I'd share what I have anyway.  I
> split it into 5 patches:
>
> 0001 - Adds a new "custodian" auxiliary process that does nothing.
> 0002 - During startup, remove the pgsql_tmp directories instead of
>        only clearing the contents.
> 0003 - Split temporary file cleanup during startup into two stages.
>        The first renames the directories, and the second clears them.
> 0004 - Moves the second stage from 0003 to the custodian process.
> 0005 - Moves CheckPointSnapBuild() to the custodian process.

I don't know whether this kind of idea is good or not.

One thing we've seen a number of times now is that entrusting the same
process with multiple responsibilities often ends poorly. Sometimes
it's busy with one thing when another thing really needs to be done
RIGHT NOW. Perhaps that won't be an issue here since all of these
things are related to checkpointing, but then the process name should
reflect that rather than making it sound like we can just keep piling
more responsibilities onto this process indefinitely. At some point
that seems bound to become an issue.

Another issue is that we don't want to increase the number of
processes without bound. Processes use memory and CPU resources and if
we run too many of them it becomes a burden on the system. Low-end
systems may not have too many resources in total, and high-end systems
can struggle to fit demanding workloads within the resources that they
have. Maybe it would be cheaper to do more things at once if we were
using threads rather than processes, but that day still seems fairly
far off.

But against all that, if these tasks are slowing down checkpoints and
that's avoidable, that seems pretty important too. Interestingly, I
can't say that I've ever seen any of these things be a problem for
checkpoint or startup speed. I wonder why you've had a different
experience.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
"Bossart, Nathan"
Date:
On 12/13/21, 5:54 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> I don't know whether this kind of idea is good or not.

Thanks for chiming in.  I have an almost-complete patch set that I'm
hoping to post to the lists in the next couple of days.

> One thing we've seen a number of times now is that entrusting the same
> process with multiple responsibilities often ends poorly. Sometimes
> it's busy with one thing when another thing really needs to be done
> RIGHT NOW. Perhaps that won't be an issue here since all of these
> things are related to checkpointing, but then the process name should
> reflect that rather than making it sound like we can just keep piling
> more responsibilities onto this process indefinitely. At some point
> that seems bound to become an issue.

Two of the tasks are cleanup tasks that checkpointing handles at the
moment, and two are cleanup tasks that are done at startup.  For now,
all of these tasks are somewhat nonessential.  There's no requirement
that any of these tasks complete in order to finish startup or
checkpointing.  In fact, outside of preventing the server from running
out of disk space, I don't think there's any requirement that these
tasks run at all.  IMO this would have to be a core tenet of a new
auxiliary process like this.

That being said, I totally understand your point.  If there were a
dozen such tasks handled by a single auxiliary process, that could
cause a new set of problems.  Your checkpointing and startup might be
fast, but you might run out of disk space because our cleanup process
can't handle it all.  So a new worker could end up becoming an
availability risk as well.

> Another issue is that we don't want to increase the number of
> processes without bound. Processes use memory and CPU resources and if
> we run too many of them it becomes a burden on the system. Low-end
> systems may not have too many resources in total, and high-end systems
> can struggle to fit demanding workloads within the resources that they
> have. Maybe it would be cheaper to do more things at once if we were
> using threads rather than processes, but that day still seems fairly
> far off.

I do agree that it is important to be very careful about adding new
processes, and if a better idea for how to handle these tasks emerges,
I will readily abandon my current approach.  Upthread, Andres
mentioned optimizing unnecessary snapshot files, and I mentioned
possibly limiting how much time startup and checkpoints spend on these
tasks.  I don't have too many details for the former, and for the
latter, I'm worried about not being able to keep up.  But if the
prospect of adding a new auxiliary process for this stuff is a non-
starter, perhaps I should explore that approach some more.

> But against all that, if these tasks are slowing down checkpoints and
> that's avoidable, that seems pretty important too. Interestingly, I
> can't say that I've ever seen any of these things be a problem for
> checkpoint or startup speed. I wonder why you've had a different
> experience.

Yeah, it's difficult for me to justify why users should suffer long
periods of downtime because startup or checkpointing is taking a very
long time doing things that are arguably unrelated to startup and
checkpointing.

Nathan


Re: O(n) tasks cause lengthy startups and checkpoints

From
"Bossart, Nathan"
Date:
On 12/13/21, 9:20 AM, "Justin Pryzby" <pryzby@telsasoft.com> wrote:
> On Mon, Dec 13, 2021 at 08:53:37AM -0500, Robert Haas wrote:
>> Another issue is that we don't want to increase the number of
>> processes without bound. Processes use memory and CPU resources and if
>> we run too many of them it becomes a burden on the system. Low-end
>> systems may not have too many resources in total, and high-end systems
>> can struggle to fit demanding workloads within the resources that they
>> have. Maybe it would be cheaper to do more things at once if we were
>> using threads rather than processes, but that day still seems fairly
>> far off.
>
> Maybe that's an argument that this should be a dynamic background worker
> instead of an auxilliary process.  Then maybe it would be controlled by
> max_parallel_maintenance_workers (or something similar).  The checkpointer
> would need to do these tasks itself if parallel workers were disabled or
> couldn't be launched.

I think this is an interesting idea.  I dislike the prospect of having
two code paths for all this stuff, but if it addresses the concerns
about resource usage, maybe it's worth it.

Nathan


Re: O(n) tasks cause lengthy startups and checkpoints

From
Robert Haas
Date:
On Mon, Dec 13, 2021 at 1:21 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> > But against all that, if these tasks are slowing down checkpoints and
> > that's avoidable, that seems pretty important too. Interestingly, I
> > can't say that I've ever seen any of these things be a problem for
> > checkpoint or startup speed. I wonder why you've had a different
> > experience.
>
> Yeah, it's difficult for me to justify why users should suffer long
> periods of downtime because startup or checkpointing is taking a very
> long time doing things that are arguably unrelated to startup and
> checkpointing.

Well sure. But I've never actually seen that happen.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
"Bossart, Nathan"
Date:
On 12/13/21, 12:37 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> On Mon, Dec 13, 2021 at 1:21 PM Bossart, Nathan <bossartn@amazon.com> wrote:
>> > But against all that, if these tasks are slowing down checkpoints and
>> > that's avoidable, that seems pretty important too. Interestingly, I
>> > can't say that I've ever seen any of these things be a problem for
>> > checkpoint or startup speed. I wonder why you've had a different
>> > experience.
>>
>> Yeah, it's difficult for me to justify why users should suffer long
>> periods of downtime because startup or checkpointing is taking a very
>> long time doing things that are arguably unrelated to startup and
>> checkpointing.
>
> Well sure. But I've never actually seen that happen.

I'll admit that surprises me.  As noted elsewhere [0], we were seeing
this enough with pgsql_tmp that we started moving the directory aside
before starting the server.  Discussions about handling this usually
prompt questions about why there are so many temporary files in the
first place (which is fair).  FWIW all four functions noted in my
original message [1] are things I've seen firsthand affecting users.

Nathan

[0] https://postgr.es/m/E7573D54-A8C9-40A8-89D7-0596A36ED124%40amazon.com
[1] https://postgr.es/m/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB%40amazon.com


Re: O(n) tasks cause lengthy startups and checkpoints

From
Bruce Momjian
Date:
On Mon, Dec 13, 2021 at 11:05:46PM +0000, Bossart, Nathan wrote:
> On 12/13/21, 12:37 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> > On Mon, Dec 13, 2021 at 1:21 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> >> > But against all that, if these tasks are slowing down checkpoints and
> >> > that's avoidable, that seems pretty important too. Interestingly, I
> >> > can't say that I've ever seen any of these things be a problem for
> >> > checkpoint or startup speed. I wonder why you've had a different
> >> > experience.
> >>
> >> Yeah, it's difficult for me to justify why users should suffer long
> >> periods of downtime because startup or checkpointing is taking a very
> >> long time doing things that are arguably unrelated to startup and
> >> checkpointing.
> >
> > Well sure. But I've never actually seen that happen.
> 
> I'll admit that surprises me.  As noted elsewhere [0], we were seeing
> this enough with pgsql_tmp that we started moving the directory aside
> before starting the server.  Discussions about handling this usually
> prompt questions about why there are so many temporary files in the
> first place (which is fair).  FWIW all four functions noted in my
> original message [1] are things I've seen firsthand affecting users.

Have we changed temporary file handling in any recent major releases,
meaning is this a current problem or one already improved in PG 14.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: O(n) tasks cause lengthy startups and checkpoints

From
"Bossart, Nathan"
Date:
On 12/14/21, 9:00 AM, "Bruce Momjian" <bruce@momjian.us> wrote:
> Have we changed temporary file handling in any recent major releases,
> meaning is this a current problem or one already improved in PG 14.

I haven't noticed any recent improvements while working in this area.

Nathan


Re: O(n) tasks cause lengthy startups and checkpoints

From
"Bossart, Nathan"
Date:
On 12/14/21, 12:09 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> On 12/14/21, 9:00 AM, "Bruce Momjian" <bruce@momjian.us> wrote:
>> Have we changed temporary file handling in any recent major releases,
>> meaning is this a current problem or one already improved in PG 14.
>
> I haven't noticed any recent improvements while working in this area.

On second thought, the addition of the remove_temp_files_after_crash
GUC is arguably an improvement since it could prevent files from
accumulating after repeated crashes.

Nathan


Re: O(n) tasks cause lengthy startups and checkpoints

From
"Bossart, Nathan"
Date:
On 12/13/21, 10:21 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> Thanks for chiming in.  I have an almost-complete patch set that I'm
> hoping to post to the lists in the next couple of days.

As promised, here is v2.  This patch set includes handling for all
four tasks noted upthread.  I'd still consider this a work-in-
progress, as I've done minimal testing.  At the very least, it should
demonstrate what an auxiliary process approach might look like.

Nathan


Attachment

Re: O(n) tasks cause lengthy startups and checkpoints

From
Andres Freund
Date:
Hi,

On 2021-12-14 20:23:57 +0000, Bossart, Nathan wrote:
> As promised, here is v2.  This patch set includes handling for all
> four tasks noted upthread.  I'd still consider this a work-in-
> progress, as I've done minimal testing.  At the very least, it should
> demonstrate what an auxiliary process approach might look like.

This generates a compiler warning:
https://cirrus-ci.com/task/5740581082103808?logs=mingw_cross_warning#L378

Greetings,

Andres Freund



Re: O(n) tasks cause lengthy startups and checkpoints

From
Amul Sul
Date:
On Mon, Jan 3, 2022 at 2:56 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-12-14 20:23:57 +0000, Bossart, Nathan wrote:
> > As promised, here is v2.  This patch set includes handling for all
> > four tasks noted upthread.  I'd still consider this a work-in-
> > progress, as I've done minimal testing.  At the very least, it should
> > demonstrate what an auxiliary process approach might look like.
>
> This generates a compiler warning:
> https://cirrus-ci.com/task/5740581082103808?logs=mingw_cross_warning#L378
>

Somehow, I am not getting these compiler warnings on the latest master
head (69872d0bbe6).

Here are the few minor comments for the v2 version, I thought would help:

+ * Copyright (c) 2021, PostgreSQL Global Development Group

Time to change the year :)
--

+
+       /* These operations are really just a minimal subset of
+        * AbortTransaction().  We don't have very many resources to worry
+        * about.
+        */

Incorrect formatting, the first line should be empty in the multiline
code comment.
--

+   XLogRecPtr  logical_rewrite_mappings_cutoff;    /* can remove
older mappings */
+   XLogRecPtr  logical_rewrite_mappings_cutoff_set;

Look like logical_rewrite_mappings_cutoff gets to set only once and
never get reset, if it is true then I think that variable can be
skipped completely and set the initial logical_rewrite_mappings_cutoff
to InvalidXLogRecPtr, that will do the needful.
--

Regards,
Amul



Re: O(n) tasks cause lengthy startups and checkpoints

From
"Bossart, Nathan"
Date:
Thanks for your review.

On 1/2/22, 11:00 PM, "Amul Sul" <sulamul@gmail.com> wrote:
> On Mon, Jan 3, 2022 at 2:56 AM Andres Freund <andres@anarazel.de> wrote:
>> This generates a compiler warning:
>> https://cirrus-ci.com/task/5740581082103808?logs=mingw_cross_warning#L378
>
> Somehow, I am not getting these compiler warnings on the latest master
> head (69872d0bbe6).

I attempted to fix this by including time.h in custodian.c.

> Here are the few minor comments for the v2 version, I thought would help:
>
> + * Copyright (c) 2021, PostgreSQL Global Development Group
>
> Time to change the year :)

Fixed in v3.

> +
> +       /* These operations are really just a minimal subset of
> +        * AbortTransaction().  We don't have very many resources to worry
> +        * about.
> +        */
>
> Incorrect formatting, the first line should be empty in the multiline
> code comment.

Fixed in v3.

> +   XLogRecPtr  logical_rewrite_mappings_cutoff;    /* can remove
> older mappings */
> +   XLogRecPtr  logical_rewrite_mappings_cutoff_set;
>
> Look like logical_rewrite_mappings_cutoff gets to set only once and
> never get reset, if it is true then I think that variable can be
> skipped completely and set the initial logical_rewrite_mappings_cutoff
> to InvalidXLogRecPtr, that will do the needful.

I think the problem with this is that when the cutoff is
InvalidXLogRecPtr, it is taken to mean that all logical rewrite files
can be removed.  If we just used the cutoff variable, we could remove
files we need if the custodian ran before the cutoff was set.  I
suppose we could initially set the cutoff to MaxXLogRecPtr to indicate
that the value is not yet set, but I see no real advantage to doing it
that way versus just using a bool.  Speaking of which,
logical_rewrite_mappings_cutoff_set obviously should be a bool.  I've
fixed that in v3.

Nathan


Attachment

Re: O(n) tasks cause lengthy startups and checkpoints

From
Maxim Orlov
Date:
The code seems to be in good condition. All the tests are running ok with no errors.

I like the whole idea of shifting additional checkpointer jobs as much as possible to another worker. In my view, it is more appropriate to call this worker "bg cleaner" or "bg file cleaner" or smth.

It could be useful for systems with high load, which may deal with deleting many files at once, but I'm not sure about "small" installations. Extra bg worker need more resources to do occasional deletion of small amounts of files. I really do not know how to do it better, maybe to have two different code paths switched by GUC?
 
Should we also think about adding WAL preallocation into custodian worker from the patch "Pre-alocationg WAL files" [1] ?

--
Best regards,
Maxim Orlov.

Re: O(n) tasks cause lengthy startups and checkpoints

From
"Bossart, Nathan"
Date:
On 1/14/22, 3:43 AM, "Maxim Orlov" <orlovmg@gmail.com> wrote:
> The code seems to be in good condition. All the tests are running ok with no errors.

Thanks for your review.

> I like the whole idea of shifting additional checkpointer jobs as much as possible to another worker. In my view, it
ismore appropriate to call this worker "bg cleaner" or "bg file cleaner" or smth.
 
>
> It could be useful for systems with high load, which may deal with deleting many files at once, but I'm not sure
about"small" installations. Extra bg worker need more resources to do occasional deletion of small amounts of files. I
reallydo not know how to do it better, maybe to have two different code paths switched by GUC?
 

I'd personally like to avoid creating two code paths for the same
thing.  Are there really cases when this one extra auxiliary process
would be too many?  And if so, how would a user know when to adjust
this GUC?  I understand the point that we should introduce new
processes sparingly to avoid burdening low-end systems, but I don't
think we should be afraid to add new ones when it is needed.

That being said, if making the extra worker optional addresses the
concerns about resource usage, maybe we should consider it.  Justin
suggested using something like max_parallel_maintenance_workers
upthread [0].

> Should we also think about adding WAL preallocation into custodian worker from the patch "Pre-alocationg WAL files"
[1]?
 

This was brought up in the pre-allocation thread [1].  I don't think
the custodian process would be the right place for it, and I'm also
not as concerned about it because it will generally be a small, fixed,
and configurable amount of work.  In any case, I don't sense a ton of
support for a new auxiliary process in this thread, so I'm hesitant to
go down the same path for pre-allocation.

Nathan

[0] https://postgr.es/m/20211213171935.GX17618%40telsasoft.com
[1] https://postgr.es/m/B2ACCC5A-F9F2-41D9-AC3B-251362A0A254%40amazon.com


Re: O(n) tasks cause lengthy startups and checkpoints

From
Bharath Rupireddy
Date:
On Sat, Jan 15, 2022 at 12:46 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>
> On 1/14/22, 3:43 AM, "Maxim Orlov" <orlovmg@gmail.com> wrote:
> > The code seems to be in good condition. All the tests are running ok with no errors.
>
> Thanks for your review.
>
> > I like the whole idea of shifting additional checkpointer jobs as much as possible to another worker. In my view,
itis more appropriate to call this worker "bg cleaner" or "bg file cleaner" or smth. 

I personally prefer "background cleaner" as the new process name in
line with "background writer" and "background worker".

> > It could be useful for systems with high load, which may deal with deleting many files at once, but I'm not sure
about"small" installations. Extra bg worker need more resources to do occasional deletion of small amounts of files. I
reallydo not know how to do it better, maybe to have two different code paths switched by GUC? 
>
> I'd personally like to avoid creating two code paths for the same
> thing.  Are there really cases when this one extra auxiliary process
> would be too many?  And if so, how would a user know when to adjust
> this GUC?  I understand the point that we should introduce new
> processes sparingly to avoid burdening low-end systems, but I don't
> think we should be afraid to add new ones when it is needed.

IMO, having a GUC for enabling/disabling this new worker and it's
related code would be a better idea. The reason is that if the
postgres has no replication slots at all(which is quite possible in
real stand-alone production environments) or if the file enumeration
(directory traversal and file removal) is fast enough on the servers,
there's no point having this new worker, the checkpointer itself can
take care of the work as it is doing today.

> That being said, if making the extra worker optional addresses the
> concerns about resource usage, maybe we should consider it.  Justin
> suggested using something like max_parallel_maintenance_workers
> upthread [0].

I don't think having this new process is built as part of
max_parallel_maintenance_workers, instead I prefer to have it as an
auxiliary process much like "background writer", "wal writer" and so
on.

I think now it's the time for us to run some use cases and get the
perf reports to see how beneficial this new process is going to be, in
terms of improving the checkpoint timings.

> > Should we also think about adding WAL preallocation into custodian worker from the patch "Pre-alocationg WAL files"
[1]? 
>
> This was brought up in the pre-allocation thread [1].  I don't think
> the custodian process would be the right place for it, and I'm also
> not as concerned about it because it will generally be a small, fixed,
> and configurable amount of work.  In any case, I don't sense a ton of
> support for a new auxiliary process in this thread, so I'm hesitant to
> go down the same path for pre-allocation.
>
> [0] https://postgr.es/m/20211213171935.GX17618%40telsasoft.com
> [1] https://postgr.es/m/B2ACCC5A-F9F2-41D9-AC3B-251362A0A254%40amazon.com

I think the idea of weaving every non-critical task to a common
background process is a good idea but let's not mix up with the new
background cleaner process here for now, at least until we get some
numbers and prove that the idea proposed here will be beneficial.

Regards,
Bharath Rupireddy.



Re: O(n) tasks cause lengthy startups and checkpoints

From
"Bossart, Nathan"
Date:
On 1/14/22, 11:26 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
> On Sat, Jan 15, 2022 at 12:46 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>> I'd personally like to avoid creating two code paths for the same
>> thing.  Are there really cases when this one extra auxiliary process
>> would be too many?  And if so, how would a user know when to adjust
>> this GUC?  I understand the point that we should introduce new
>> processes sparingly to avoid burdening low-end systems, but I don't
>> think we should be afraid to add new ones when it is needed.
>
> IMO, having a GUC for enabling/disabling this new worker and it's
> related code would be a better idea. The reason is that if the
> postgres has no replication slots at all(which is quite possible in
> real stand-alone production environments) or if the file enumeration
> (directory traversal and file removal) is fast enough on the servers,
> there's no point having this new worker, the checkpointer itself can
> take care of the work as it is doing today.

IMO introducing a GUC wouldn't be doing users many favors.  Their
cluster might work just fine for a long time before they begin
encountering problems during startups/checkpoints.  Once the user
discovers the underlying reason, they have to then find a GUC for
enabling a special background worker that makes this problem go away.
Why not just fix the problem for everybody by default?

I've been thinking about what other approaches we could take besides
creating more processes.  The root of the problem seems to be that
there are a number of tasks that are performed synchronously that can
take a long time.  The process approach essentially makes these tasks
asynchronous so that they do not block startup and checkpointing.  But
perhaps this can be done in an existing process, possibly even the
checkpointer.  Like the current WAL pre-allocation patch, we could do
this work when the checkpointer isn't checkpointing, and we could also
do small amounts of work in CheckpointWriteDelay() (or a new function
called in a similar way).  In theory, this would help avoid delaying
checkpoints too long while doing cleanup at every opportunity to lower
the chances it falls far behind.

Nathan


Re: O(n) tasks cause lengthy startups and checkpoints

From
Andres Freund
Date:
Hi,

On 2022-02-16 16:50:57 -0800, Nathan Bossart wrote:
> + * The custodian process is new as of Postgres 15.

I think this kind of comment tends to age badly and not be very useful.


> It's main purpose is to
> + * offload tasks that could otherwise delay startup and checkpointing, but
> + * it needn't be restricted to just those things.  Offloaded tasks should
> + * not be synchronous (e.g., checkpointing shouldn't need to wait for the
> + * custodian to complete a task before proceeding).  Also, ensure that any
> + * offloaded tasks are either not required during single-user mode or are
> + * performed separately during single-user mode.
> + *
> + * The custodian is not an essential process and can shutdown quickly when
> + * requested.  The custodian will wake up approximately once every 5
> + * minutes to perform its tasks, but backends can (and should) set its
> + * latch to wake it up sooner.

Hm. This kind policy makes it easy to introduce bugs where the occasional runs
mask forgotten notifications etc.


> + * Normal termination is by SIGTERM, which instructs the bgwriter to
> + * exit(0).

s/bgwriter/.../

> Emergency termination is by SIGQUIT; like any backend, the
> + * custodian will simply abort and exit on SIGQUIT.
> + *
> + * If the custodian exits unexpectedly, the postmaster treats that the same
> + * as a backend crash: shared memory may be corrupted, so remaining
> + * backends should be killed by SIGQUIT and then a recovery cycle started.

This doesn't really seem useful stuff to me.



> +    /*
> +     * If an exception is encountered, processing resumes here.
> +     *
> +     * You might wonder why this isn't coded as an infinite loop around a
> +     * PG_TRY construct.  The reason is that this is the bottom of the
> +     * exception stack, and so with PG_TRY there would be no exception handler
> +     * in force at all during the CATCH part.  By leaving the outermost setjmp
> +     * always active, we have at least some chance of recovering from an error
> +     * during error recovery.  (If we get into an infinite loop thereby, it
> +     * will soon be stopped by overflow of elog.c's internal state stack.)
> +     *
> +     * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask
> +     * (to wit, BlockSig) will be restored when longjmp'ing to here.  Thus,
> +     * signals other than SIGQUIT will be blocked until we complete error
> +     * recovery.  It might seem that this policy makes the HOLD_INTERRUPS()
> +     * call redundant, but it is not since InterruptPending might be set
> +     * already.
> +     */

I think it's bad to copy this comment into even more places.


> +        /* Since not using PG_TRY, must reset error stack by hand */
> +    if (sigsetjmp(local_sigjmp_buf, 1) != 0)
> +    {

I also think it's a bad idea to introduce even more copies of the error
handling body.  I think we need to unify this. And yes, it's unfair to stick
you with it, but it's been a while since a new aux process has been added.


> +        /*
> +         * These operations are really just a minimal subset of
> +         * AbortTransaction().  We don't have very many resources to worry
> +         * about.
> +         */

Given what you're proposing this for, are you actually confident that we don't
need more than this?


> From d9826f75ad2259984d55fc04622f0b91ebbba65a Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <bossartn@amazon.com>
> Date: Sun, 5 Dec 2021 19:38:20 -0800
> Subject: [PATCH v5 2/8] Also remove pgsql_tmp directories during startup.
>
> Presently, the server only removes the contents of the temporary
> directories during startup, not the directory itself.  This changes
> that to prepare for future commits that will move temporary file
> cleanup to a separate auxiliary process.

Is this actually safe? Is there a guarantee no process can access a temp table
stored in one of these? Because without WAL guaranteeing consistency, we can't
just access e.g. temp tables written before a crash.


> +extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
> +                            bool unlink_all);

I don't like functions with multiple consecutive booleans, they tend to get
swapped around. Why not just split unlink_all=true/false into different
functions?


> Subject: [PATCH v5 3/8] Split pgsql_tmp cleanup into two stages.
>
> First, pgsql_tmp directories will be renamed to stage them for
> removal.

What if the target name already exists?


> Then, all files in pgsql_tmp are removed before removing
> the staged directories themselves.  This change is being made in
> preparation for a follow-up change to offload most temporary file
> cleanup to the new custodian process.
>
> Note that temporary relation files cannot be cleaned up via the
> aforementioned strategy and will not be offloaded to the custodian.

This should be in the prior commit message, otherwise people will ask the same
question as I did.



> +    /*
> +     * Find a name for the stage directory.  We just increment an integer at the
> +     * end of the name until we find one that doesn't exist.
> +     */
> +    for (int n = 0; n <= INT_MAX; n++)
> +    {
> +        snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path,
> +                 PG_TEMP_DIR_TO_REMOVE_PREFIX, n);

Uninterruptible loops up to INT_MAX do not seem like a good idea.


> +        dir = AllocateDir(stage_path);
> +        if (dir == NULL)
> +        {

Why not just use stat()? That's cheaper, and there's no
time-to-check-time-to-use issue here, we're the only one writing.


> +            if (errno == ENOENT)
> +                break;
> +
> +            ereport(LOG,
> +                    (errcode_for_file_access(),
> +                     errmsg("could not open directory \"%s\": %m",
> +                            stage_path)));

I think this kind of lenience is just hiding bugs.




>  File
>  PathNameCreateTemporaryFile(const char *path, bool error_on_failure)
> @@ -3175,7 +3178,8 @@ RemovePgTempFiles(bool stage, bool remove_relation_files)
>       */
>      spc_dir = AllocateDir("pg_tblspc");
>
> -    while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
> +    while (!ShutdownRequestPending &&
> +           (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)

Uh, huh? It strikes me as a supremely bad idea to have functions *silently*
not do their jobs when ShutdownRequestPending is set, particularly without a
huge fat comment.


>      {
>          if (strcmp(spc_de->d_name, ".") == 0 ||
>              strcmp(spc_de->d_name, "..") == 0)
> @@ -3211,6 +3215,14 @@ RemovePgTempFiles(bool stage, bool remove_relation_files)
>       * would create a race condition.  It's done separately, earlier in
>       * postmaster startup.
>       */
> +
> +    /*
> +     * If we just staged some pgsql_tmp directories for removal, wake up the
> +     * custodian process so that it deletes all the files in the staged
> +     * directories as well as the directories themselves.
> +     */
> +    if (stage && ProcGlobal->custodianLatch)
> +        SetLatch(ProcGlobal->custodianLatch);

Just signalling without letting the custodian know what it's expected to do
strikes me as a bad idea.



> From 9c2013d53cc5c857ef8aca3df044613e66215aee Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <bossartn@amazon.com>
> Date: Sun, 5 Dec 2021 22:02:40 -0800
> Subject: [PATCH v5 5/8] Move removal of old serialized snapshots to custodian.
>
> This was only done during checkpoints because it was a convenient
> place to put it.  However, if there are many snapshots to remove,
> it can significantly extend checkpoint time.  To avoid this, move
> this work to the newly-introduced custodian process.
> ---
>  src/backend/access/transam/xlog.c           |  2 --
>  src/backend/postmaster/custodian.c          | 11 +++++++++++
>  src/backend/replication/logical/snapbuild.c | 13 +++++++------
>  src/include/replication/snapbuild.h         |  2 +-
>  4 files changed, 19 insertions(+), 9 deletions(-)

Why does this not open us up to new xid wraparound issues? Before there was a
hard bound on how long these files could linger around. Now there's not
anymore.



> -    while ((snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL)
> +    while (!ShutdownRequestPending &&
> +           (snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL)

I really really strenuously object to these checks.



> Subject: [PATCH v5 6/8] Move removal of old logical rewrite mapping files to
>  custodian.

> If there are many such files to remove, checkpoints can take much
> longer.  To avoid this, move this work to the newly-introduced
> custodian process.

Same wraparound concerns.


> +#include "postmaster/bgwriter.h"

I think it's a bad idea to put these functions into bgwriter.h



> From cfca62dd55d7be7e0025e5625f18d3ab9180029c Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <bossartn@amazon.com>
> Date: Mon, 13 Dec 2021 20:20:12 -0800
> Subject: [PATCH v5 7/8] Use syncfs() in CheckPointLogicalRewriteHeap() for
>  shutdown and end-of-recovery checkpoints.
>
> This may save quite a bit of time when there are many mapping files
> to flush to disk.

Seems like an a mostly independent proposal.


> +#ifdef HAVE_SYNCFS
> +
> +    /*
> +     * If we are doing a shutdown or end-of-recovery checkpoint, let's use
> +     * syncfs() to flush the mappings to disk instead of flushing each one
> +     * individually.  This may save us quite a bit of time when there are many
> +     * such files to flush.
> +     */

I am doubtful this is a good idea. This will cause all dirty files to be
written back, even ones we don't need to be written back. At once. Very
possibly *slowing down* the shutdown.

What is even the theory of the case here? That there's so many dirty mapping
files that fsyncing them will take too long? That iterating would take too
long?


> From b5923b1b76a1fab6c21d6aec086219160473f464 Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <nathandbossart@gmail.com>
> Date: Fri, 11 Feb 2022 09:43:57 -0800
> Subject: [PATCH v5 8/8] Move removal of spilled logical slot data to
>  custodian.
>
> If there are many such files, startup can take much longer than
> necessary.  To handle this, startup creates a new slot directory,
> copies the state file, and swaps the new directory with the old
> one.  The custodian then asynchronously cleans up the old slot
> directory.

You guess it: I don't see what prevents wraparound issues.


> 5 files changed, 317 insertions(+), 9 deletions(-)

This seems such an increase in complexity and fragility that I really doubt
this is a good idea.



> +/*
> + * This function renames the given directory with a special suffix that the
> + * custodian will know to look for.  An integer is appended to the end of the
> + * new directory name in case previously staged slot directories have not yet
> + * been removed.
> + */
> +static void
> +StageSlotDirForRemoval(const char *slotname, const char *slotpath)
> +{
> +    char        stage_path[MAXPGPATH];
> +
> +    /*
> +     * Find a name for the stage directory.  We just increment an integer at the
> +     * end of the name until we find one that doesn't exist.
> +     */
> +    for (int n = 0; n <= INT_MAX; n++)
> +    {
> +        DIR           *dir;
> +
> +        sprintf(stage_path, "pg_replslot/%s.to_remove_%d", slotname, n);
> +
> +        dir = AllocateDir(stage_path);
> +        if (dir == NULL)
> +        {
> +            if (errno == ENOENT)
> +                break;
> +
> +            ereport(ERROR,
> +                    (errcode_for_file_access(),
> +                     errmsg("could not open directory \"%s\": %m",
> +                            stage_path)));
> +        }
> +        FreeDir(dir);
> +
> +        stage_path[0] = '\0';
> +    }

Copy of "find free name" logic.


Greetings,

Andres Freund



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
Hi Andres,

I appreciate the feedback.

On Wed, Feb 16, 2022 at 05:50:52PM -0800, Andres Freund wrote:
>> +        /* Since not using PG_TRY, must reset error stack by hand */
>> +    if (sigsetjmp(local_sigjmp_buf, 1) != 0)
>> +    {
> 
> I also think it's a bad idea to introduce even more copies of the error
> handling body.  I think we need to unify this. And yes, it's unfair to stick
> you with it, but it's been a while since a new aux process has been added.

+1, I think this is useful refactoring.  I might spin this off to its own
thread.

>> +        /*
>> +         * These operations are really just a minimal subset of
>> +         * AbortTransaction().  We don't have very many resources to worry
>> +         * about.
>> +         */
> 
> Given what you're proposing this for, are you actually confident that we don't
> need more than this?

I will give this a closer look.

>> +extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok,
>> +                            bool unlink_all);
> 
> I don't like functions with multiple consecutive booleans, they tend to get
> swapped around. Why not just split unlink_all=true/false into different
> functions?

Will do.

>> Subject: [PATCH v5 3/8] Split pgsql_tmp cleanup into two stages.
>>
>> First, pgsql_tmp directories will be renamed to stage them for
>> removal.
> 
> What if the target name already exists?

The integer at the end of the target name is incremented until we find a
unique name.

>> Note that temporary relation files cannot be cleaned up via the
>> aforementioned strategy and will not be offloaded to the custodian.
> 
> This should be in the prior commit message, otherwise people will ask the same
> question as I did.

Will do.

>> +    /*
>> +     * Find a name for the stage directory.  We just increment an integer at the
>> +     * end of the name until we find one that doesn't exist.
>> +     */
>> +    for (int n = 0; n <= INT_MAX; n++)
>> +    {
>> +        snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path,
>> +                 PG_TEMP_DIR_TO_REMOVE_PREFIX, n);
> 
> Uninterruptible loops up to INT_MAX do not seem like a good idea.

I modeled this after ChooseRelationName() in indexcmds.c.  Looking again, I
see that it loops forever until a unique name is found.  I suspect this is
unlikely to be a problem in practice.  What strategy would you recommend
for choosing a unique name?  Should we just append a couple of random
characters?

>> +        dir = AllocateDir(stage_path);
>> +        if (dir == NULL)
>> +        {
> 
> Why not just use stat()? That's cheaper, and there's no
> time-to-check-time-to-use issue here, we're the only one writing.

I'm not sure why I didn't use stat().  I will update this.

>> -    while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
>> +    while (!ShutdownRequestPending &&
>> +           (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
> 
> Uh, huh? It strikes me as a supremely bad idea to have functions *silently*
> not do their jobs when ShutdownRequestPending is set, particularly without a
> huge fat comment.

The idea was to avoid delaying shutdown because we're waiting for the
custodian to finish relatively nonessential tasks.  Another option might be
to just exit immediately when the custodian receives a shutdown request.

>> +    /*
>> +     * If we just staged some pgsql_tmp directories for removal, wake up the
>> +     * custodian process so that it deletes all the files in the staged
>> +     * directories as well as the directories themselves.
>> +     */
>> +    if (stage && ProcGlobal->custodianLatch)
>> +        SetLatch(ProcGlobal->custodianLatch);
> 
> Just signalling without letting the custodian know what it's expected to do
> strikes me as a bad idea.

Good point.  I will work on that.

>> From 9c2013d53cc5c857ef8aca3df044613e66215aee Mon Sep 17 00:00:00 2001
>> From: Nathan Bossart <bossartn@amazon.com>
>> Date: Sun, 5 Dec 2021 22:02:40 -0800
>> Subject: [PATCH v5 5/8] Move removal of old serialized snapshots to custodian.
>>
>> This was only done during checkpoints because it was a convenient
>> place to put it.  However, if there are many snapshots to remove,
>> it can significantly extend checkpoint time.  To avoid this, move
>> this work to the newly-introduced custodian process.
>> ---
>>  src/backend/access/transam/xlog.c           |  2 --
>>  src/backend/postmaster/custodian.c          | 11 +++++++++++
>>  src/backend/replication/logical/snapbuild.c | 13 +++++++------
>>  src/include/replication/snapbuild.h         |  2 +-
>>  4 files changed, 19 insertions(+), 9 deletions(-)
> 
> Why does this not open us up to new xid wraparound issues? Before there was a
> hard bound on how long these files could linger around. Now there's not
> anymore.

Sorry, I'm probably missing something obvious, but I'm not sure how this
adds transaction ID wraparound risk.  These files are tied to LSNs, and
AFAIK they won't impact slots' xmins.

>> +#ifdef HAVE_SYNCFS
>> +
>> +    /*
>> +     * If we are doing a shutdown or end-of-recovery checkpoint, let's use
>> +     * syncfs() to flush the mappings to disk instead of flushing each one
>> +     * individually.  This may save us quite a bit of time when there are many
>> +     * such files to flush.
>> +     */
> 
> I am doubtful this is a good idea. This will cause all dirty files to be
> written back, even ones we don't need to be written back. At once. Very
> possibly *slowing down* the shutdown.
> 
> What is even the theory of the case here? That there's so many dirty mapping
> files that fsyncing them will take too long? That iterating would take too
> long?

Well, yes.  My idea was to model this after 61752af, which allows using
syncfs() instead of individually fsync-ing every file in the data
directory.  However, I would likely need to introduce a GUC because 1) as
you pointed out, it might be slower and 2) syncfs() doesn't report errors
on older versions of Linux.

TBH I do feel like this one is a bit of a stretch, so I am okay with
leaving it out for now.

>> 5 files changed, 317 insertions(+), 9 deletions(-)
> 
> This seems such an increase in complexity and fragility that I really doubt
> this is a good idea.

I think that's a fair point.  I'm okay with leaving this one out for now,
too.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Andres Freund
Date:
Hi,

On 2022-02-16 20:14:04 -0800, Nathan Bossart wrote:
> >> -    while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
> >> +    while (!ShutdownRequestPending &&
> >> +           (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
> >
> > Uh, huh? It strikes me as a supremely bad idea to have functions *silently*
> > not do their jobs when ShutdownRequestPending is set, particularly without a
> > huge fat comment.
>
> The idea was to avoid delaying shutdown because we're waiting for the
> custodian to finish relatively nonessential tasks.  Another option might be
> to just exit immediately when the custodian receives a shutdown request.

I think we should just not do either of these and let the functions
finish. For the cases where shutdown really needs to be immediate
there's, uhm, immediate mode shutdowns.


> > Why does this not open us up to new xid wraparound issues? Before there was a
> > hard bound on how long these files could linger around. Now there's not
> > anymore.
>
> Sorry, I'm probably missing something obvious, but I'm not sure how this
> adds transaction ID wraparound risk.  These files are tied to LSNs, and
> AFAIK they won't impact slots' xmins.

They're accessed by xid. The LSN is just for cleanup. Accessing files
left over from a previous transaction with the same xid wouldn't be
good - we'd read wrong catalog state for decoding...

Andres



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Wed, Feb 16, 2022 at 10:59:38PM -0800, Andres Freund wrote:
> On 2022-02-16 20:14:04 -0800, Nathan Bossart wrote:
>> >> -    while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
>> >> +    while (!ShutdownRequestPending &&
>> >> +           (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
>> >
>> > Uh, huh? It strikes me as a supremely bad idea to have functions *silently*
>> > not do their jobs when ShutdownRequestPending is set, particularly without a
>> > huge fat comment.
>>
>> The idea was to avoid delaying shutdown because we're waiting for the
>> custodian to finish relatively nonessential tasks.  Another option might be
>> to just exit immediately when the custodian receives a shutdown request.
> 
> I think we should just not do either of these and let the functions
> finish. For the cases where shutdown really needs to be immediate
> there's, uhm, immediate mode shutdowns.

Alright.

>> > Why does this not open us up to new xid wraparound issues? Before there was a
>> > hard bound on how long these files could linger around. Now there's not
>> > anymore.
>>
>> Sorry, I'm probably missing something obvious, but I'm not sure how this
>> adds transaction ID wraparound risk.  These files are tied to LSNs, and
>> AFAIK they won't impact slots' xmins.
> 
> They're accessed by xid. The LSN is just for cleanup. Accessing files
> left over from a previous transaction with the same xid wouldn't be
> good - we'd read wrong catalog state for decoding...

Okay, that part makes sense to me.  However, I'm still confused about how
this is handled today and why moving cleanup to a separate auxiliary
process makes matters worse.  I've done quite a bit of reading, and I
haven't found anything that seems intended to prevent this problem.  Do you
have any pointers?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Andres Freund
Date:
Hi,

On 2022-02-17 10:23:37 -0800, Nathan Bossart wrote:
> On Wed, Feb 16, 2022 at 10:59:38PM -0800, Andres Freund wrote:
> > They're accessed by xid. The LSN is just for cleanup. Accessing files
> > left over from a previous transaction with the same xid wouldn't be
> > good - we'd read wrong catalog state for decoding...
> 
> Okay, that part makes sense to me.  However, I'm still confused about how
> this is handled today and why moving cleanup to a separate auxiliary
> process makes matters worse.

Right now cleanup happens every checkpoint. So cleanup can't be deferred all
that far. We currently include a bunch of 32bit xids inside checkspoints, so
if they're rarer than 2^31-1, we're in trouble independent of logical
decoding.

But with this patch cleanup of logical decoding mapping files (and other
pieces) can be *indefinitely* deferred, without being noticeable.


One possible way to improve this would be to switch the on-disk filenames to
be based on 64bit xids. But that might also present some problems (file name
length, cost of converting 32bit xids to 64bit xids).


> I've done quite a bit of reading, and I haven't found anything that seems
> intended to prevent this problem.  Do you have any pointers?

I don't know if we have an iron-clad enforcement of checkpoints happening
every 2*31-1 xids. It's very unlikely to happen - you'd run out of space
etc. But it'd be good to have something better than that.

Greetings,

Andres Freund



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Thu, Feb 17, 2022 at 11:27:09AM -0800, Andres Freund wrote:
> On 2022-02-17 10:23:37 -0800, Nathan Bossart wrote:
>> On Wed, Feb 16, 2022 at 10:59:38PM -0800, Andres Freund wrote:
>> > They're accessed by xid. The LSN is just for cleanup. Accessing files
>> > left over from a previous transaction with the same xid wouldn't be
>> > good - we'd read wrong catalog state for decoding...
>> 
>> Okay, that part makes sense to me.  However, I'm still confused about how
>> this is handled today and why moving cleanup to a separate auxiliary
>> process makes matters worse.
> 
> Right now cleanup happens every checkpoint. So cleanup can't be deferred all
> that far. We currently include a bunch of 32bit xids inside checkspoints, so
> if they're rarer than 2^31-1, we're in trouble independent of logical
> decoding.
> 
> But with this patch cleanup of logical decoding mapping files (and other
> pieces) can be *indefinitely* deferred, without being noticeable.

I see.  The custodian should ordinarily remove the files as quickly as
possible.  In fact, I bet it will typically line up with checkpoints for
most users, as the checkpointer will set the latch.  However, if there are
many temporary files to clean up, removing the logical decoding files could
be delayed for some time, as you said.

> One possible way to improve this would be to switch the on-disk filenames to
> be based on 64bit xids. But that might also present some problems (file name
> length, cost of converting 32bit xids to 64bit xids).

Okay.

>> I've done quite a bit of reading, and I haven't found anything that seems
>> intended to prevent this problem.  Do you have any pointers?
> 
> I don't know if we have an iron-clad enforcement of checkpoints happening
> every 2*31-1 xids. It's very unlikely to happen - you'd run out of space
> etc. But it'd be good to have something better than that.

Okay.  So IIUC the problem might already exist today, but offloading these
tasks to a separate process could make it more likely.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Andres Freund
Date:
Hi,

On 2022-02-17 13:00:22 -0800, Nathan Bossart wrote:
> Okay.  So IIUC the problem might already exist today, but offloading these
> tasks to a separate process could make it more likely.

Vastly more, yes. Before checkpoints not happening would be a (but not a
great) form of backpressure. You can't cancel them without triggering a
crash-restart. Whereas custodian can be cancelled etc.


As I said before, I think this is tackling things from the wrong end. Instead
of moving the sometimes expensive task out of the way, but still expensive,
the focus should be to make the expensive task cheaper.

As far as I understand, the primary concern are logical decoding serialized
snapshots, because a lot of them can accumulate if there e.g. is an old unused
/ far behind slot. It should be easy to reduce the number of those snapshots
by e.g. eliding some redundant ones. Perhaps we could also make backends in
logical decoding occasionally do a bit of cleanup themselves.

I've not seen reports of the number of mapping files to be an real issue?


The improvements around deleting temporary files and serialized snapshots
afaict don't require a dedicated process - they're only relevant during
startup. We could use the approach of renaming the directory out of the way as
done in this patchset but perform the cleanup in the startup process after
we're up.

Greetings,

Andres Freund



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Thu, Feb 17, 2022 at 02:28:29PM -0800, Andres Freund wrote:
> As far as I understand, the primary concern are logical decoding serialized
> snapshots, because a lot of them can accumulate if there e.g. is an old unused
> / far behind slot. It should be easy to reduce the number of those snapshots
> by e.g. eliding some redundant ones. Perhaps we could also make backends in
> logical decoding occasionally do a bit of cleanup themselves.
> 
> I've not seen reports of the number of mapping files to be an real issue?

I routinely see all four of these tasks impacting customers, but I'd say
the most common one is the temporary file cleanup.  Besides eliminating
some redundant files and having backends perform some cleanup, what do you
think about skipping the logical decoding cleanup during
end-of-recovery/shutdown checkpoints?  This was something that Bharath
brought up a while back [0].  As I noted in that thread, startup and
shutdown could still take a while if checkpoints are regularly delayed due
to logical decoding cleanup, but that might still help avoid a bit of
downtime.

> The improvements around deleting temporary files and serialized snapshots
> afaict don't require a dedicated process - they're only relevant during
> startup. We could use the approach of renaming the directory out of the way as
> done in this patchset but perform the cleanup in the startup process after
> we're up.

Perhaps this is a good place to start.  As I mentioned above, IME the
temporary file cleanup is the most common problem, so I think even getting
that one fixed would be a huge improvement.

[0] https://postgr.es/m/CALj2ACXkkSL8EBpR7m%3DMt%3DyRGBhevcCs3x4fsp3Bc-D13yyHOg%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Andres Freund
Date:
Hi,

On 2022-02-17 14:58:38 -0800, Nathan Bossart wrote:
> On Thu, Feb 17, 2022 at 02:28:29PM -0800, Andres Freund wrote:
> > As far as I understand, the primary concern are logical decoding serialized
> > snapshots, because a lot of them can accumulate if there e.g. is an old unused
> > / far behind slot. It should be easy to reduce the number of those snapshots
> > by e.g. eliding some redundant ones. Perhaps we could also make backends in
> > logical decoding occasionally do a bit of cleanup themselves.
> > 
> > I've not seen reports of the number of mapping files to be an real issue?
> 
> I routinely see all four of these tasks impacting customers, but I'd say
> the most common one is the temporary file cleanup.

I took temp file cleanup and StartupReorderBuffer() "out of consideration" for
custodian, because they're not needed during normal running.


> Besides eliminating some redundant files and having backends perform some
> cleanup, what do you think about skipping the logical decoding cleanup
> during end-of-recovery/shutdown checkpoints?

I strongly disagree with it. Then you might never get the cleanup done, but
keep on operating until you hit corruption issues.


> > The improvements around deleting temporary files and serialized snapshots
> > afaict don't require a dedicated process - they're only relevant during
> > startup. We could use the approach of renaming the directory out of the way as
> > done in this patchset but perform the cleanup in the startup process after
> > we're up.
> 
> Perhaps this is a good place to start.  As I mentioned above, IME the
> temporary file cleanup is the most common problem, so I think even getting
> that one fixed would be a huge improvement.

Cool.

Greetings,

Andres Freund



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Thu, Feb 17, 2022 at 03:12:47PM -0800, Andres Freund wrote:
>> > The improvements around deleting temporary files and serialized snapshots
>> > afaict don't require a dedicated process - they're only relevant during
>> > startup. We could use the approach of renaming the directory out of the way as
>> > done in this patchset but perform the cleanup in the startup process after
>> > we're up.
>> 
>> Perhaps this is a good place to start.  As I mentioned above, IME the
>> temporary file cleanup is the most common problem, so I think even getting
>> that one fixed would be a huge improvement.
> 
> Cool.

Hm.  How should this work for standbys?  I can think of the following
options:
    1. Do temporary file cleanup in the postmaster (as it does today).
    2. Pause after allowing connections to clean up temporary files.
    3. Do small amounts of temporary file cleanup whenever there is an
       opportunity during recovery.
    4. Wait until recovery completes before cleaning up temporary files.

I'm not too thrilled about any of these options.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
> On Thu, Feb 17, 2022 at 03:12:47PM -0800, Andres Freund wrote:
>>> > The improvements around deleting temporary files and serialized snapshots
>>> > afaict don't require a dedicated process - they're only relevant during
>>> > startup. We could use the approach of renaming the directory out of the way as
>>> > done in this patchset but perform the cleanup in the startup process after
>>> > we're up.

BTW I know you don't like the dedicated process approach, but one
improvement to that approach could be to shut down the custodian process
when it has nothing to do.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
It seems unlikely that anything discussed in this thread will be committed
for v15, so I've adjusted the commitfest entry to v16 and moved it to the
next commitfest.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Simon Riggs
Date:
On Fri, 18 Feb 2022 at 20:51, Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> > On Thu, Feb 17, 2022 at 03:12:47PM -0800, Andres Freund wrote:
> >>> > The improvements around deleting temporary files and serialized snapshots
> >>> > afaict don't require a dedicated process - they're only relevant during
> >>> > startup. We could use the approach of renaming the directory out of the way as
> >>> > done in this patchset but perform the cleanup in the startup process after
> >>> > we're up.
>
> BTW I know you don't like the dedicated process approach, but one
> improvement to that approach could be to shut down the custodian process
> when it has nothing to do.

Having a central cleanup process makes a lot of sense. There is a long
list of potential tasks for such a process. My understanding is that
autovacuum already has an interface for handling additional workload
types, which is how BRIN indexes are handled. Do we really need a new
process? If so, lets do this now.

Nathan's point that certain tasks are blocking fast startup is a good
one and higher availability is a critical end goal. The thought that
we should complete these tasks during checkpoint is a good one, but
checkpoints should NOT be delayed by long running tasks that are
secondary to availability.

Andres' point that it would be better to avoid long running tasks is
good, if that is possible. That can be done better over time. This
point does not block the higher level goal of better availability
asap, so I support Nathan's overall proposals.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: O(n) tasks cause lengthy startups and checkpoints

From
Robert Haas
Date:
On Thu, Jun 23, 2022 at 7:58 AM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
> Having a central cleanup process makes a lot of sense. There is a long
> list of potential tasks for such a process. My understanding is that
> autovacuum already has an interface for handling additional workload
> types, which is how BRIN indexes are handled. Do we really need a new
> process?

It seems to me that if there's a long list of possible tasks for such
a process, that's actually a trickier situation than if there were
only one or two, because it may happen that when task X is really
urgent, the process is already busy with task Y.

I don't think that piggybacking more stuff onto autovacuum is a very
good idea for this exact reason. We already know that autovacuum
workers can get so busy that they can't keep up with the need to
vacuum and analyze tables. If we give them more things to do, that
figures to make it worse, at least on busy systems.

I do agree that a general mechanism for getting cleanup tasks done in
the background could be a useful thing to have, but I feel like it's
hard to see exactly how to make it work well. We can't just allow it
to spin up a million new processes, but at the same time, if it can't
guarantee that time-critical tasks get performed relatively quickly,
it's pretty worthless.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Simon Riggs
Date:
On Thu, 23 Jun 2022 at 14:46, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jun 23, 2022 at 7:58 AM Simon Riggs
> <simon.riggs@enterprisedb.com> wrote:
> > Having a central cleanup process makes a lot of sense. There is a long
> > list of potential tasks for such a process. My understanding is that
> > autovacuum already has an interface for handling additional workload
> > types, which is how BRIN indexes are handled. Do we really need a new
> > process?
>
> It seems to me that if there's a long list of possible tasks for such
> a process, that's actually a trickier situation than if there were
> only one or two, because it may happen that when task X is really
> urgent, the process is already busy with task Y.
>
> I don't think that piggybacking more stuff onto autovacuum is a very
> good idea for this exact reason. We already know that autovacuum
> workers can get so busy that they can't keep up with the need to
> vacuum and analyze tables. If we give them more things to do, that
> figures to make it worse, at least on busy systems.
>
> I do agree that a general mechanism for getting cleanup tasks done in
> the background could be a useful thing to have, but I feel like it's
> hard to see exactly how to make it work well. We can't just allow it
> to spin up a million new processes, but at the same time, if it can't
> guarantee that time-critical tasks get performed relatively quickly,
> it's pretty worthless.

Most of the tasks mentioned aren't time critical.

I have no objection to a new auxiliary process to execute those tasks,
which can be spawned when needed.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Thu, Jun 23, 2022 at 09:46:28AM -0400, Robert Haas wrote:
> I do agree that a general mechanism for getting cleanup tasks done in
> the background could be a useful thing to have, but I feel like it's
> hard to see exactly how to make it work well. We can't just allow it
> to spin up a million new processes, but at the same time, if it can't
> guarantee that time-critical tasks get performed relatively quickly,
> it's pretty worthless.

My intent with this new auxiliary process is to offload tasks that aren't
particularly time-critical.  They are only time-critical in the sense that
1) you might eventually run out of space and 2) you might encounter
wraparound with the logical replication files.  But AFAICT these same risks
exist today in the checkpointer approach, although maybe not to the same
extent.  In any case, 2 seems solvable to me outside of this patch set.

I'm grateful for the discussion in this thread so far, but I'm not seeing a
clear path forward.  I'm glad to see threads like the one to stop doing
end-of-recovery checkpoints [0], but I don't know if it will be possible to
solve all of these availability concerns in a piecemeal fashion.  I remain
open to exploring other suggested approaches beyond creating a new
auxiliary process.

[0] https://postgr.es/m/CA%2BTgmobrM2jvkiccCS9NgFcdjNSgAvk1qcAPx5S6F%2BoJT3D2mQ%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Simon Riggs
Date:
On Thu, 23 Jun 2022 at 18:15, Nathan Bossart <nathandbossart@gmail.com> wrote:

> I'm grateful for the discussion in this thread so far, but I'm not seeing a
> clear path forward.

+1 to add the new auxiliary process.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Fri, Jun 24, 2022 at 11:45:22AM +0100, Simon Riggs wrote:
> On Thu, 23 Jun 2022 at 18:15, Nathan Bossart <nathandbossart@gmail.com> wrote:
>> I'm grateful for the discussion in this thread so far, but I'm not seeing a
>> clear path forward.
> 
> +1 to add the new auxiliary process.

I went ahead and put together a new patch set for this in which I've
attempted to address most of the feedback from upthread.  Notably, I've
abandoned 0007 and 0008, added a way for processes to request specific
tasks for the custodian, and removed all the checks for
ShutdownRequestPending.

I haven't addressed the existing transaction ID wraparound risk with the
logical replication files.  My instinct is that this deserveѕ its own
thread, and it might need to be considered a prerequisite to this change
based on the prior discussion here.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: O(n) tasks cause lengthy startups and checkpoints

From
Andres Freund
Date:
Hi,

On 2022-07-02 15:05:54 -0700, Nathan Bossart wrote:
> +        /* Obtain requested tasks */
> +        SpinLockAcquire(&CustodianShmem->cust_lck);
> +        flags = CustodianShmem->cust_flags;
> +        CustodianShmem->cust_flags = 0;
> +        SpinLockRelease(&CustodianShmem->cust_lck);

Just resetting the flags to 0 is problematic. Consider what happens if there's
two tasks and and the one processed first errors out. You'll loose information
about needing to run the second task.


> +        /* TODO: offloaded tasks go here */

Seems we're going to need some sorting of which tasks are most "urgent" / need
to be processed next if we plan to make this into some generic facility.


> +/*
> + * RequestCustodian
> + *        Called to request a custodian task.
> + */
> +void
> +RequestCustodian(int flags)
> +{
> +    SpinLockAcquire(&CustodianShmem->cust_lck);
> +    CustodianShmem->cust_flags |= flags;
> +    SpinLockRelease(&CustodianShmem->cust_lck);
> +
> +    if (ProcGlobal->custodianLatch)
> +        SetLatch(ProcGlobal->custodianLatch);
> +}

With this representation we can't really implement waiting for a task or
such. And it doesn't seem like a great API for the caller to just specify a
mix of flags.


> +        /* Calculate how long to sleep */
> +        end_time = (pg_time_t) time(NULL);
> +        elapsed_secs = end_time - start_time;
> +        if (elapsed_secs >= CUSTODIAN_TIMEOUT_S)
> +            continue;            /* no sleep for us */
> +        cur_timeout = CUSTODIAN_TIMEOUT_S - elapsed_secs;
> +
> +        (void) WaitLatch(MyLatch,
> +                         WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> +                         cur_timeout * 1000L /* convert to ms */ ,
> +                         WAIT_EVENT_CUSTODIAN_MAIN);
> +    }

I don't think we should have this thing wake up on a regular basis. We're
doing way too much of that already, and I don't think we should add
more. Either we need a list of times when tasks need to be processed and wake
up at that time, or just wake up if somebody requests a task.


> From 5e95666efa31d6c8aa351e430c37ead6e27acb72 Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <bossartn@amazon.com>
> Date: Sun, 5 Dec 2021 21:16:44 -0800
> Subject: [PATCH v6 3/6] Split pgsql_tmp cleanup into two stages.
> 
> First, pgsql_tmp directories will be renamed to stage them for
> removal.  Then, all files in pgsql_tmp are removed before removing
> the staged directories themselves.  This change is being made in
> preparation for a follow-up change to offload most temporary file
> cleanup to the new custodian process.

> Note that temporary relation files cannot be cleaned up via the
> aforementioned strategy and will not be offloaded to the custodian.

> ---
>  src/backend/postmaster/postmaster.c |   8 +-
>  src/backend/storage/file/fd.c       | 174 ++++++++++++++++++++++++----
>  src/include/storage/fd.h            |   2 +-
>  3 files changed, 160 insertions(+), 24 deletions(-)
> 
> diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
> index e67370012f..82aa0c6307 100644
> --- a/src/backend/postmaster/postmaster.c
> +++ b/src/backend/postmaster/postmaster.c
> @@ -1402,7 +1402,8 @@ PostmasterMain(int argc, char *argv[])
>       * Remove old temporary files.  At this point there can be no other
>       * Postgres processes running in this directory, so this should be safe.
>       */
> -    RemovePgTempFiles();
> +    RemovePgTempFiles(true, true);
> +    RemovePgTempFiles(false, false);

This is imo hard to read and easy to get wrong. Make it multiple functions or
pass named flags in.


> + * StagePgTempDirForRemoval
> + *
> + * This function renames the given directory with a special prefix that
> + * RemoveStagedPgTempDirs() will know to look for.  An integer is appended to
> + * the end of the new directory name in case previously staged pgsql_tmp
> + * directories have not yet been removed.
> + */

It doesn't seem great to need to iterate through a directory that contains
other files, potentially a significant number. How about having a
staged_for_removal/ directory, and then only scanning that?


> +static void
> +StagePgTempDirForRemoval(const char *tmp_dir)
> +{
> +    DIR           *dir;
> +    char        stage_path[MAXPGPATH * 2];
> +    char        parent_path[MAXPGPATH * 2];
> +    struct stat statbuf;
> +
> +    /*
> +     * If tmp_dir doesn't exist, there is nothing to stage.
> +     */
> +    dir = AllocateDir(tmp_dir);
> +    if (dir == NULL)
> +    {
> +        if (errno != ENOENT)
> +            ereport(LOG,
> +                    (errcode_for_file_access(),
> +                     errmsg("could not open directory \"%s\": %m", tmp_dir)));
> +        return;
> +    }
> +    FreeDir(dir);
> +
> +    strlcpy(parent_path, tmp_dir, MAXPGPATH * 2);
> +    get_parent_directory(parent_path);
> +
> +    /*
> +     * get_parent_directory() returns an empty string if the input argument is
> +     * just a file name (see comments in path.c), so handle that as being the
> +     * current directory.
> +     */
> +    if (strlen(parent_path) == 0)
> +        strlcpy(parent_path, ".", MAXPGPATH * 2);
> +
> +    /*
> +     * Find a name for the stage directory.  We just increment an integer at the
> +     * end of the name until we find one that doesn't exist.
> +     */
> +    for (int n = 0; n <= INT_MAX; n++)
> +    {
> +        snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path,
> +                 PG_TEMP_DIR_TO_REMOVE_PREFIX, n);
> +
> +        if (stat(stage_path, &statbuf) != 0)
> +        {
> +            if (errno == ENOENT)
> +                break;
> +
> +            ereport(LOG,
> +                    (errcode_for_file_access(),
> +                     errmsg("could not stat file \"%s\": %m", stage_path)));
> +            return;
> +        }
> +
> +        stage_path[0] = '\0';

I still dislike this approach. Loops until INT_MAX, not interruptible... Can't
we prevent conflicts by adding a timestamp or such?


> +    }
> +
> +    /*
> +     * In the unlikely event that we couldn't find a name for the stage
> +     * directory, bail out.
> +     */
> +    if (stage_path[0] == '\0')
> +    {
> +        ereport(LOG,
> +                (errmsg("could not stage \"%s\" for deletion",
> +                        tmp_dir)));
> +        return;
> +    }

That's imo very much not ok. Just continuing in unexpected situations is a
recipe for introducing bugs / being hard to debug.


> From 43042799b96b588a446c509637b5acf570e2a325 Mon Sep 17 00:00:00 2001

> From a58a6bb70785a557a150680b64cd8ce78ce1b73a Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <bossartn@amazon.com>
> Date: Sun, 5 Dec 2021 22:02:40 -0800
> Subject: [PATCH v6 5/6] Move removal of old serialized snapshots to custodian.
> 
> This was only done during checkpoints because it was a convenient
> place to put it.

As mentioned before, having it done as part of checkpoints provides pretty
decent wraparound protection - yes, it's not theoretically perfect, but in
reality it's very unlikely you can have an xid wraparound within one
checkpoint. I've mentioned this before, so at the very least I'd like to see
this acknowledged in the commit message.


> However, if there are many snapshots to remove, it can significantly extend
> checkpoint time.

I'd really like to see a reproducer or profile for this...


> +        /*
> +         * Remove serialized snapshots that are no longer required by any
> +         * logical replication slot.
> +         *
> +         * It is not important for these to be removed in single-user mode, so
> +         * we don't need any extra handling outside of the custodian process for
> +         * this.
> +         */

I don't think this claim is correct.



> From 0add8bb19a4ee83c6a6ec1f313329d737bf304a5 Mon Sep 17 00:00:00 2001
> From: Nathan Bossart <bossartn@amazon.com>
> Date: Sun, 12 Dec 2021 22:07:11 -0800
> Subject: [PATCH v6 6/6] Move removal of old logical rewrite mapping files to
>  custodian.
> 
> If there are many such files to remove, checkpoints can take much
> longer.  To avoid this, move this work to the newly-introduced
> custodian process.

As above I'd like to know why this could take that long. What are you doing
that there's so many mapping files (which only exist for catalog tables!) that
this is a significant fraction of a checkpoint?


> ---
>  src/backend/access/heap/rewriteheap.c | 79 +++++++++++++++++++++++----
>  src/backend/postmaster/custodian.c    | 44 +++++++++++++++
>  src/include/access/rewriteheap.h      |  1 +
>  src/include/postmaster/custodian.h    |  5 ++
>  4 files changed, 119 insertions(+), 10 deletions(-)
> 
> diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
> index 2a53826736..edeab65e60 100644
> --- a/src/backend/access/heap/rewriteheap.c
> +++ b/src/backend/access/heap/rewriteheap.c
> @@ -116,6 +116,7 @@
>  #include "lib/ilist.h"
>  #include "miscadmin.h"
>  #include "pgstat.h"
> +#include "postmaster/custodian.h"
>  #include "replication/logical.h"
>  #include "replication/slot.h"
>  #include "storage/bufmgr.h"
> @@ -1182,7 +1183,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
>   * Perform a checkpoint for logical rewrite mappings
>   *
>   * This serves two tasks:
> - * 1) Remove all mappings not needed anymore based on the logical restart LSN
> + * 1) Alert the custodian to remove all mappings not needed anymore based on the
> + *    logical restart LSN
>   * 2) Flush all remaining mappings to disk, so that replay after a checkpoint
>   *      only has to deal with the parts of a mapping that have been written out
>   *      after the checkpoint started.
> @@ -1210,6 +1212,10 @@ CheckPointLogicalRewriteHeap(void)
>      if (cutoff != InvalidXLogRecPtr && redo < cutoff)
>          cutoff = redo;
>  
> +    /* let the custodian know what it can remove */
> +    CustodianSetLogicalRewriteCutoff(cutoff);

Setting this variable in a custodian datastructure and then fetching it from
there seems architecturally wrong to me.


> +    RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS);

What about single user mode?


ISTM that RequestCustodian() needs to either assert out if called in single
user mode, or execute tasks immediately in that context.


> +
> +/*
> + * Remove all mappings not needed anymore based on the logical restart LSN saved
> + * by the checkpointer.  We use this saved value instead of calling
> + * ReplicationSlotsComputeLogicalRestartLSN() so that we don't interfere with an
> + * ongoing call to CheckPointLogicalRewriteHeap() that is flushing mappings to
> + * disk.
> + */

What interference could there be?


> +void
> +RemoveOldLogicalRewriteMappings(void)
> +{
> +    XLogRecPtr    cutoff;
> +    DIR           *mappings_dir;
> +    struct dirent *mapping_de;
> +    char        path[MAXPGPATH + 20];
> +    bool        value_set = false;
> +
> +    cutoff = CustodianGetLogicalRewriteCutoff(&value_set);
> +    if (!value_set)
> +        return;

Afaics nothing clears values_set - is that a good idea?

Greetings,

Andres Freund



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
Hi Andres,

Thanks for the prompt review.

On Sat, Jul 02, 2022 at 03:54:56PM -0700, Andres Freund wrote:
> On 2022-07-02 15:05:54 -0700, Nathan Bossart wrote:
>> +        /* Obtain requested tasks */
>> +        SpinLockAcquire(&CustodianShmem->cust_lck);
>> +        flags = CustodianShmem->cust_flags;
>> +        CustodianShmem->cust_flags = 0;
>> +        SpinLockRelease(&CustodianShmem->cust_lck);
> 
> Just resetting the flags to 0 is problematic. Consider what happens if there's
> two tasks and and the one processed first errors out. You'll loose information
> about needing to run the second task.

I think we also want to retry any failed tasks.  The way v6 handles this is
by requesting all tasks after an exception.  Another way to handle this
could be to reset each individual flag before the task is executed, and
then we could surround each one with a PG_CATCH block that resets the flag.
I'll do it this way in the next revision.

>> +/*
>> + * RequestCustodian
>> + *        Called to request a custodian task.
>> + */
>> +void
>> +RequestCustodian(int flags)
>> +{
>> +    SpinLockAcquire(&CustodianShmem->cust_lck);
>> +    CustodianShmem->cust_flags |= flags;
>> +    SpinLockRelease(&CustodianShmem->cust_lck);
>> +
>> +    if (ProcGlobal->custodianLatch)
>> +        SetLatch(ProcGlobal->custodianLatch);
>> +}
> 
> With this representation we can't really implement waiting for a task or
> such. And it doesn't seem like a great API for the caller to just specify a
> mix of flags.

At the moment, the idea is that nothing should need to wait for a task
because the custodian only handles things that are relatively non-critical.
If that changes, this could probably be expanded to look more like
RequestCheckpoint().

What would you suggest using instead of a mix of flags?

>> +        /* Calculate how long to sleep */
>> +        end_time = (pg_time_t) time(NULL);
>> +        elapsed_secs = end_time - start_time;
>> +        if (elapsed_secs >= CUSTODIAN_TIMEOUT_S)
>> +            continue;            /* no sleep for us */
>> +        cur_timeout = CUSTODIAN_TIMEOUT_S - elapsed_secs;
>> +
>> +        (void) WaitLatch(MyLatch,
>> +                         WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
>> +                         cur_timeout * 1000L /* convert to ms */ ,
>> +                         WAIT_EVENT_CUSTODIAN_MAIN);
>> +    }
> 
> I don't think we should have this thing wake up on a regular basis. We're
> doing way too much of that already, and I don't think we should add
> more. Either we need a list of times when tasks need to be processed and wake
> up at that time, or just wake up if somebody requests a task.

I agree.  I will remove the timeout in the next revision.

>> diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
>> index e67370012f..82aa0c6307 100644
>> --- a/src/backend/postmaster/postmaster.c
>> +++ b/src/backend/postmaster/postmaster.c
>> @@ -1402,7 +1402,8 @@ PostmasterMain(int argc, char *argv[])
>>       * Remove old temporary files.  At this point there can be no other
>>       * Postgres processes running in this directory, so this should be safe.
>>       */
>> -    RemovePgTempFiles();
>> +    RemovePgTempFiles(true, true);
>> +    RemovePgTempFiles(false, false);
> 
> This is imo hard to read and easy to get wrong. Make it multiple functions or
> pass named flags in.

Will do.

>> + * StagePgTempDirForRemoval
>> + *
>> + * This function renames the given directory with a special prefix that
>> + * RemoveStagedPgTempDirs() will know to look for.  An integer is appended to
>> + * the end of the new directory name in case previously staged pgsql_tmp
>> + * directories have not yet been removed.
>> + */
> 
> It doesn't seem great to need to iterate through a directory that contains
> other files, potentially a significant number. How about having a
> staged_for_removal/ directory, and then only scanning that?

Yeah, that seems like a good idea.  Will do.

>> +    /*
>> +     * Find a name for the stage directory.  We just increment an integer at the
>> +     * end of the name until we find one that doesn't exist.
>> +     */
>> +    for (int n = 0; n <= INT_MAX; n++)
>> +    {
>> +        snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path,
>> +                 PG_TEMP_DIR_TO_REMOVE_PREFIX, n);
>> +
>> +        if (stat(stage_path, &statbuf) != 0)
>> +        {
>> +            if (errno == ENOENT)
>> +                break;
>> +
>> +            ereport(LOG,
>> +                    (errcode_for_file_access(),
>> +                     errmsg("could not stat file \"%s\": %m", stage_path)));
>> +            return;
>> +        }
>> +
>> +        stage_path[0] = '\0';
> 
> I still dislike this approach. Loops until INT_MAX, not interruptible... Can't
> we prevent conflicts by adding a timestamp or such?

I suppose it's highly unlikely that we'd see a conflict if we used the
timestamp instead.  I'll do it this way in the next revision if that seems
good enough.

>> From a58a6bb70785a557a150680b64cd8ce78ce1b73a Mon Sep 17 00:00:00 2001
>> From: Nathan Bossart <bossartn@amazon.com>
>> Date: Sun, 5 Dec 2021 22:02:40 -0800
>> Subject: [PATCH v6 5/6] Move removal of old serialized snapshots to custodian.
>> 
>> This was only done during checkpoints because it was a convenient
>> place to put it.
> 
> As mentioned before, having it done as part of checkpoints provides pretty
> decent wraparound protection - yes, it's not theoretically perfect, but in
> reality it's very unlikely you can have an xid wraparound within one
> checkpoint. I've mentioned this before, so at the very least I'd like to see
> this acknowledged in the commit message.

Will do.

>> +    /* let the custodian know what it can remove */
>> +    CustodianSetLogicalRewriteCutoff(cutoff);
> 
> Setting this variable in a custodian datastructure and then fetching it from
> there seems architecturally wrong to me.

Where do you think it should go?  I previously had it in the checkpointer's
shared memory, but you didn't like that the functions were declared in
bgwriter.h (along with the other checkpoint stuff).  If the checkpointer
shared memory is the right place, should we create checkpointer.h and use
that instead?

>> +    RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS);
> 
> What about single user mode?
> 
> 
> ISTM that RequestCustodian() needs to either assert out if called in single
> user mode, or execute tasks immediately in that context.

I like the idea of executing the tasks immediately since that's what
happens today in single-user mode.  I will try doing it that way.

>> +/*
>> + * Remove all mappings not needed anymore based on the logical restart LSN saved
>> + * by the checkpointer.  We use this saved value instead of calling
>> + * ReplicationSlotsComputeLogicalRestartLSN() so that we don't interfere with an
>> + * ongoing call to CheckPointLogicalRewriteHeap() that is flushing mappings to
>> + * disk.
>> + */
> 
> What interference could there be?

My concern is that the custodian could obtain a later cutoff than what the
checkpointer does, which might cause files to be concurrently unlinked and
fsync'd.  If we always use the checkpointer's cutoff, that shouldn't be a
problem.  This could probably be better explained in this comment.

>> +void
>> +RemoveOldLogicalRewriteMappings(void)
>> +{
>> +    XLogRecPtr    cutoff;
>> +    DIR           *mappings_dir;
>> +    struct dirent *mapping_de;
>> +    char        path[MAXPGPATH + 20];
>> +    bool        value_set = false;
>> +
>> +    cutoff = CustodianGetLogicalRewriteCutoff(&value_set);
>> +    if (!value_set)
>> +        return;
> 
> Afaics nothing clears values_set - is that a good idea?

I'm using value_set to differentiate the case where InvalidXLogRecPtr means
the checkpointer hasn't determined a value yet versus the case where it
has.  In the former, we don't want to take any action.  In the latter, we
want to unlink all the files.  Since we're moving to a request model for
the custodian, I might be able to remove this value_set stuff completely.
If that's not possible, it probably deserves a better comment.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Andres Freund
Date:
Hi,

On 2022-07-03 10:07:54 -0700, Nathan Bossart wrote:
> Thanks for the prompt review.
> 
> On Sat, Jul 02, 2022 at 03:54:56PM -0700, Andres Freund wrote:
> > On 2022-07-02 15:05:54 -0700, Nathan Bossart wrote:
> >> +        /* Obtain requested tasks */
> >> +        SpinLockAcquire(&CustodianShmem->cust_lck);
> >> +        flags = CustodianShmem->cust_flags;
> >> +        CustodianShmem->cust_flags = 0;
> >> +        SpinLockRelease(&CustodianShmem->cust_lck);
> > 
> > Just resetting the flags to 0 is problematic. Consider what happens if there's
> > two tasks and and the one processed first errors out. You'll loose information
> > about needing to run the second task.
> 
> I think we also want to retry any failed tasks.

I don't think so, at least not if it's just going to retry that task straight
away - then we'll get stuck on that one task forever. If we had the ability to
"queue" it the end, to be processed after other already dequeued tasks, it'd
be a different story.


> The way v6 handles this is by requesting all tasks after an exception.

Ick. That strikes me as a bad idea.


> >> +/*
> >> + * RequestCustodian
> >> + *        Called to request a custodian task.
> >> + */
> >> +void
> >> +RequestCustodian(int flags)
> >> +{
> >> +    SpinLockAcquire(&CustodianShmem->cust_lck);
> >> +    CustodianShmem->cust_flags |= flags;
> >> +    SpinLockRelease(&CustodianShmem->cust_lck);
> >> +
> >> +    if (ProcGlobal->custodianLatch)
> >> +        SetLatch(ProcGlobal->custodianLatch);
> >> +}
> > 
> > With this representation we can't really implement waiting for a task or
> > such. And it doesn't seem like a great API for the caller to just specify a
> > mix of flags.
> 
> At the moment, the idea is that nothing should need to wait for a task
> because the custodian only handles things that are relatively non-critical.

Which is just plainly not true as the patchset stands...

I think we're going to have to block if some cleanup as part of a checkpoint
hasn't been completed by the next checkpoint - otherwise it'll just end up
being way too confusing and there's absolutely no backpressure anymore.


> If that changes, this could probably be expanded to look more like
> RequestCheckpoint().
> 
> What would you suggest using instead of a mix of flags?

I suspect an array of tasks with requested and completed counters or such?
With a condition variable to wait on?


> >> +    /* let the custodian know what it can remove */
> >> +    CustodianSetLogicalRewriteCutoff(cutoff);
> > 
> > Setting this variable in a custodian datastructure and then fetching it from
> > there seems architecturally wrong to me.
> 
> Where do you think it should go?  I previously had it in the checkpointer's
> shared memory, but you didn't like that the functions were declared in
> bgwriter.h (along with the other checkpoint stuff).  If the checkpointer
> shared memory is the right place, should we create checkpointer.h and use
> that instead?

Well, so far I have not understood what the whole point of the shared state
is, so i have a bit of a hard time answering this ;)


> >> +/*
> >> + * Remove all mappings not needed anymore based on the logical restart LSN saved
> >> + * by the checkpointer.  We use this saved value instead of calling
> >> + * ReplicationSlotsComputeLogicalRestartLSN() so that we don't interfere with an
> >> + * ongoing call to CheckPointLogicalRewriteHeap() that is flushing mappings to
> >> + * disk.
> >> + */
> > 
> > What interference could there be?
> 
> My concern is that the custodian could obtain a later cutoff than what the
> checkpointer does, which might cause files to be concurrently unlinked and
> fsync'd.  If we always use the checkpointer's cutoff, that shouldn't be a
> problem.  This could probably be better explained in this comment.

How about having a Datum argument to RequestCustodian() that is forwarded to
the task?


> >> +void
> >> +RemoveOldLogicalRewriteMappings(void)
> >> +{
> >> +    XLogRecPtr    cutoff;
> >> +    DIR           *mappings_dir;
> >> +    struct dirent *mapping_de;
> >> +    char        path[MAXPGPATH + 20];
> >> +    bool        value_set = false;
> >> +
> >> +    cutoff = CustodianGetLogicalRewriteCutoff(&value_set);
> >> +    if (!value_set)
> >> +        return;
> > 
> > Afaics nothing clears values_set - is that a good idea?
> 
> I'm using value_set to differentiate the case where InvalidXLogRecPtr means
> the checkpointer hasn't determined a value yet versus the case where it
> has.  In the former, we don't want to take any action.  In the latter, we
> want to unlink all the files.  Since we're moving to a request model for
> the custodian, I might be able to remove this value_set stuff completely.
> If that's not possible, it probably deserves a better comment.

It would.

Greetings,

Andres Freund



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
Here's a new revision where I've attempted to address all the feedback I've
received thus far.  Notably, the custodian now uses a queue for registering
tasks and determining which tasks to execute.  Other changes include
splitting the temporary file functions apart to avoid consecutive boolean
flags, using a timestamp instead of an integer for the staging name for
temporary directories, moving temporary directories to a dedicated
directory so that the custodian doesn't need to scan relation files,
ERROR-ing when something goes wrong when cleaning up temporary files,
executing requested tasks immediately in single-user mode, and more.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Wed, Jul 06, 2022 at 09:51:10AM -0700, Nathan Bossart wrote:
> Here's a new revision where I've attempted to address all the feedback I've
> received thus far.  Notably, the custodian now uses a queue for registering
> tasks and determining which tasks to execute.  Other changes include
> splitting the temporary file functions apart to avoid consecutive boolean
> flags, using a timestamp instead of an integer for the staging name for
> temporary directories, moving temporary directories to a dedicated
> directory so that the custodian doesn't need to scan relation files,
> ERROR-ing when something goes wrong when cleaning up temporary files,
> executing requested tasks immediately in single-user mode, and more.

Here is a rebased patch set for cfbot.  There are no other differences
between v7 and v8.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Thu, Aug 11, 2022 at 04:09:21PM -0700, Nathan Bossart wrote:
> Here is a rebased patch set for cfbot.  There are no other differences
> between v7 and v8.

Another rebase for cfbot.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: O(n) tasks cause lengthy startups and checkpoints

From
Simon Riggs
Date:
On Thu, 24 Nov 2022 at 00:19, Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Sun, Nov 06, 2022 at 02:38:42PM -0800, Nathan Bossart wrote:
> > rebased
>
> another rebase for cfbot

0001 seems good to me
* I like that it sleeps forever until requested
* not sure I believe that everything it does can always be aborted out
of and shutdown - to achieve that you will need a
CHECK_FOR_INTERRUPTS() calls in the loops in patches 5 and 6 at least
* not sure why you want immediate execution of custodian tasks - I
feel supporting two modes will be a lot harder. For me, I would run
locally when !IsUnderPostmaster and also in an Assert build, so we can
test it works right - i.e. running in its own process is just a
production optimization for performance (which is the stated reason
for having this)

0005 seems good from what I know
* There is no check to see if it worked in any sane time
* It seems possible that "Old" might change meaning - will that make
it break/fail?

0006 seems good also
* same comments for 5

Rather than explicitly use DEBUG1 everywhere I would have an
#define CUSTODIAN_LOG_LEVEL     LOG
so we can run with it in LOG mode and then set it to DEBUG1 with a one
line change in a later phase of Beta

I can't really comment with knowledge on sub-patches 0002 to 0004.

Perhaps you should aim to get 1, 5, 6 committed first and then return
to the others in a later CF/separate thread?

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
Thanks for taking a look!

On Thu, Nov 24, 2022 at 05:31:02PM +0000, Simon Riggs wrote:
> * not sure I believe that everything it does can always be aborted out
> of and shutdown - to achieve that you will need a
> CHECK_FOR_INTERRUPTS() calls in the loops in patches 5 and 6 at least

I did something like this earlier, but was advised to simply let the
functions finish as usual during shutdown [0].  I think this is what the
checkpointer process does today, anyway.

> * not sure why you want immediate execution of custodian tasks - I
> feel supporting two modes will be a lot harder. For me, I would run
> locally when !IsUnderPostmaster and also in an Assert build, so we can
> test it works right - i.e. running in its own process is just a
> production optimization for performance (which is the stated reason
> for having this)

I added this because 0004 involves requesting a task from the postmaster,
so checking for IsUnderPostmaster doesn't work.  Those tasks would always
run immediately.  However, we could use IsPostmasterEnvironment instead,
which would allow us to remove the "immediate" argument.  I did it this way
in v14.

I'm not sure about running locally in Assert builds.  It's true that would
help ensure there's test coverage for the task logic, but it would also
reduce coverage for the custodian logic.  And in general, I'm worried about
having Assert builds use a different code path than production builds.

> 0005 seems good from what I know
> * There is no check to see if it worked in any sane time

What did you have in mind?  Should the custodian begin emitting WARNINGs
after a while?

> * It seems possible that "Old" might change meaning - will that make
> it break/fail?

I don't believe so.

> Rather than explicitly use DEBUG1 everywhere I would have an
> #define CUSTODIAN_LOG_LEVEL     LOG
> so we can run with it in LOG mode and then set it to DEBUG1 with a one
> line change in a later phase of Beta

I can create a separate patch for this, but I don't think I've ever seen
this sort of thing before.  Is the idea just to help with debugging during
the development phase?

> I can't really comment with knowledge on sub-patches 0002 to 0004.
> 
> Perhaps you should aim to get 1, 5, 6 committed first and then return
> to the others in a later CF/separate thread?

That seems like a good idea since those are all relatively self-contained.
I removed 0002-0004 in v14.

[0] https://postgr.es/m/20220217065938.x2esfdppzypegn5j%40alap3.anarazel.de

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: O(n) tasks cause lengthy startups and checkpoints

From
Simon Riggs
Date:
On Sun, 27 Nov 2022 at 23:34, Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> Thanks for taking a look!
>
> On Thu, Nov 24, 2022 at 05:31:02PM +0000, Simon Riggs wrote:
> > * not sure I believe that everything it does can always be aborted out
> > of and shutdown - to achieve that you will need a
> > CHECK_FOR_INTERRUPTS() calls in the loops in patches 5 and 6 at least
>
> I did something like this earlier, but was advised to simply let the
> functions finish as usual during shutdown [0].  I think this is what the
> checkpointer process does today, anyway.

If we say "The custodian is not an essential process and can shutdown
quickly when requested.", and yet we know its not true in all cases,
then that will lead to misunderstandings and bugs.

If we perform a restart and the custodian is performing extra work
that delays shutdown, then it also delays restart. Given the title of
the thread, we should be looking to improve that, or at least know it
occurred.

> > * not sure why you want immediate execution of custodian tasks - I
> > feel supporting two modes will be a lot harder. For me, I would run
> > locally when !IsUnderPostmaster and also in an Assert build, so we can
> > test it works right - i.e. running in its own process is just a
> > production optimization for performance (which is the stated reason
> > for having this)
>
> I added this because 0004 involves requesting a task from the postmaster,
> so checking for IsUnderPostmaster doesn't work.  Those tasks would always
> run immediately.  However, we could use IsPostmasterEnvironment instead,
> which would allow us to remove the "immediate" argument.  I did it this way
> in v14.

Thanks

> > 0005 seems good from what I know
> > * There is no check to see if it worked in any sane time
>
> What did you have in mind?  Should the custodian begin emitting WARNINGs
> after a while?

I think it might be useful if it logged anything that took an
"extended period", TBD.

Maybe that is already covered by startup process logging. Please tell
me that still works?

> > Rather than explicitly use DEBUG1 everywhere I would have an
> > #define CUSTODIAN_LOG_LEVEL     LOG
> > so we can run with it in LOG mode and then set it to DEBUG1 with a one
> > line change in a later phase of Beta
>
> I can create a separate patch for this, but I don't think I've ever seen
> this sort of thing before.

Much of recovery is coded that way, for the same reason.

> Is the idea just to help with debugging during
> the development phase?

"Just", yes. Tests would be desirable also, under src/test/modules.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: O(n) tasks cause lengthy startups and checkpoints

From
Andres Freund
Date:
On 2022-11-28 13:08:57 +0000, Simon Riggs wrote:
> On Sun, 27 Nov 2022 at 23:34, Nathan Bossart <nathandbossart@gmail.com> wrote:
> > > Rather than explicitly use DEBUG1 everywhere I would have an
> > > #define CUSTODIAN_LOG_LEVEL     LOG
> > > so we can run with it in LOG mode and then set it to DEBUG1 with a one
> > > line change in a later phase of Beta
> >
> > I can create a separate patch for this, but I don't think I've ever seen
> > this sort of thing before.
> 
> Much of recovery is coded that way, for the same reason.

I think that's not a good thing to copy without a lot more justification than
"some old code also does it that way". It's sometimes justified, but also
makes code harder to read (one doesn't know what it does without looking up
the #define, line length).



Re: O(n) tasks cause lengthy startups and checkpoints

From
Robert Haas
Date:
On Mon, Nov 28, 2022 at 1:31 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-11-28 13:08:57 +0000, Simon Riggs wrote:
> > On Sun, 27 Nov 2022 at 23:34, Nathan Bossart <nathandbossart@gmail.com> wrote:
> > > > Rather than explicitly use DEBUG1 everywhere I would have an
> > > > #define CUSTODIAN_LOG_LEVEL     LOG
> > > > so we can run with it in LOG mode and then set it to DEBUG1 with a one
> > > > line change in a later phase of Beta
> > >
> > > I can create a separate patch for this, but I don't think I've ever seen
> > > this sort of thing before.
> >
> > Much of recovery is coded that way, for the same reason.
>
> I think that's not a good thing to copy without a lot more justification than
> "some old code also does it that way". It's sometimes justified, but also
> makes code harder to read (one doesn't know what it does without looking up
> the #define, line length).

Yeah. If people need some of the log messages at a higher level during
development, they can patch their own copies.

I think there might be some argument for having a facility that lets
you pick subsystems or even individual messages that you want to trace
and pump up the log level for just those call sites. But I don't know
exactly what that would look like, and I don't think inventing one-off
mechanisms for particular cases is a good idea.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
Okay, here is a new patch set.  0004 adds logic to prevent custodian tasks
from delaying shutdown.

I haven't added any logging for long-running tasks yet.  Tasks might
ordinarily take a while, so such logs wouldn't necessarily indicate
something is wrong.  Perhaps we could add a GUC for the amount of time to
wait before logging.  This feature would be off by default.  Another option
could be to create a log_custodian GUC that causes tasks to be logged when
completed, similar to log_checkpoints.  Thoughts?

On Mon, Nov 28, 2022 at 01:37:01PM -0500, Robert Haas wrote:
> On Mon, Nov 28, 2022 at 1:31 PM Andres Freund <andres@anarazel.de> wrote:
>> On 2022-11-28 13:08:57 +0000, Simon Riggs wrote:
>> > On Sun, 27 Nov 2022 at 23:34, Nathan Bossart <nathandbossart@gmail.com> wrote:
>> > > > Rather than explicitly use DEBUG1 everywhere I would have an
>> > > > #define CUSTODIAN_LOG_LEVEL     LOG
>> > > > so we can run with it in LOG mode and then set it to DEBUG1 with a one
>> > > > line change in a later phase of Beta
>> > >
>> > > I can create a separate patch for this, but I don't think I've ever seen
>> > > this sort of thing before.
>> >
>> > Much of recovery is coded that way, for the same reason.
>>
>> I think that's not a good thing to copy without a lot more justification than
>> "some old code also does it that way". It's sometimes justified, but also
>> makes code harder to read (one doesn't know what it does without looking up
>> the #define, line length).
> 
> Yeah. If people need some of the log messages at a higher level during
> development, they can patch their own copies.
> 
> I think there might be some argument for having a facility that lets
> you pick subsystems or even individual messages that you want to trace
> and pump up the log level for just those call sites. But I don't know
> exactly what that would look like, and I don't think inventing one-off
> mechanisms for particular cases is a good idea.

Given this discussion, I haven't made any changes to the logging in the new
patch set.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: O(n) tasks cause lengthy startups and checkpoints

From
Simon Riggs
Date:
On Mon, 28 Nov 2022 at 23:40, Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> Okay, here is a new patch set.  0004 adds logic to prevent custodian tasks
> from delaying shutdown.

That all seems good, thanks.

The last important point for me is tests, in src/test/modules
probably. It might be possible to reuse the final state of other
modules' tests to test cleanup, or at least integrate a custodian test
into each module.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Tue, Nov 29, 2022 at 12:02:44PM +0000, Simon Riggs wrote:
> The last important point for me is tests, in src/test/modules
> probably. It might be possible to reuse the final state of other
> modules' tests to test cleanup, or at least integrate a custodian test
> into each module.

Of course.  I found some existing tests for the test_decoding plugin that
appear to reliably generate the files we want the custodian to clean up, so
I added them there.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Tue, Nov 29, 2022 at 07:56:53PM -0800, Nathan Bossart wrote:
> On Tue, Nov 29, 2022 at 12:02:44PM +0000, Simon Riggs wrote:
>> The last important point for me is tests, in src/test/modules
>> probably. It might be possible to reuse the final state of other
>> modules' tests to test cleanup, or at least integrate a custodian test
>> into each module.
> 
> Of course.  I found some existing tests for the test_decoding plugin that
> appear to reliably generate the files we want the custodian to clean up, so
> I added them there.

cfbot is not happy with v16.  AFAICT this is just due to poor placement, so
here is another attempt with the tests moved to a new location.  Apologies
for the noise.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: O(n) tasks cause lengthy startups and checkpoints

From
Simon Riggs
Date:
On Wed, 30 Nov 2022 at 03:56, Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Tue, Nov 29, 2022 at 12:02:44PM +0000, Simon Riggs wrote:
> > The last important point for me is tests, in src/test/modules
> > probably. It might be possible to reuse the final state of other
> > modules' tests to test cleanup, or at least integrate a custodian test
> > into each module.
>
> Of course.  I found some existing tests for the test_decoding plugin that
> appear to reliably generate the files we want the custodian to clean up, so
> I added them there.

Thanks for adding the tests; I can see they run clean.

The only minor thing I would personally add is a note in each piece of
code to explain where the tests are for each one and/or something in
the main custodian file that says tests exist within src/test/module.

Otherwise, ready for committer.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: O(n) tasks cause lengthy startups and checkpoints

From
Bharath Rupireddy
Date:
On Wed, Nov 30, 2022 at 10:48 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
>
> cfbot is not happy with v16.  AFAICT this is just due to poor placement, so
> here is another attempt with the tests moved to a new location.  Apologies
> for the noise.

Thanks for the patches. I spent some time on reviewing v17 patch set
and here are my comments:

0001:
1. I think the custodian process needs documentation - it needs a
definition in glossary.sgml and perhaps a dedicated page describing
what tasks it takes care of.

2.
+        LWLockReleaseAll();
+        ConditionVariableCancelSleep();
+        AbortBufferIO();
+        UnlockBuffers();
+        ReleaseAuxProcessResources(false);
+        AtEOXact_Buffers(false);
+        AtEOXact_SMgr();
+        AtEOXact_Files(false);
+        AtEOXact_HashTables(false);
Do we need all of these in the exit path? Isn't the stuff that
ShutdownAuxiliaryProcess() does enough for the custodian process?
AFAICS, the custodian process uses LWLocks (which the
ShutdownAuxiliaryProcess() takes care of) and it doesn't access shared
buffers and so on.
Having said that, I'm fine to keep them for future use and all of
those cleanup functions exit if nothing related occurs.

3.
+     * Advertise out latch that backends can use to wake us up while we're
Typo - %s/out/our

4. Is it a good idea to add log messages in the DoCustodianTasks()
loop? Maybe at a debug level? The log message can say the current task
the custodian is processing. And/Or setting the custodian's status on
the ps display is also a good idea IMO.

0002 and 0003:
1.
+CHECKPOINT;
+DO $$
I think we need to ensure that there are some snapshot files before
the checkpoint. Otherwise, it may happen that the above test case
exits without the custodian process doing anything.

2. I think the best way to test the custodian process code is by
adding a TAP test module to see actually the custodian process kicks
in. Perhaps, add elog(DEBUGX,...) messages to various custodian
process functions and see if we see the logs in server logs.

0004:
I think the 0004 patch can be merged into 0001, 0002 and 0003 patches.
Otherwise the patch LGTM.

Few thoughts:
1. I think we can trivially extend the custodian process to remove any
future WAL files on the old timeline, something like the attached
0001-Move-removal-of-future-WAL-files-on-the-old-timeline.text file).
While this offloads the recovery a bit, the server may archive such
WAL files before the custodian removes them. We can do a bit more to
stop the server from archiving such WAL files, but that needs more
coding. I don't think we need to do all that now, perhaps, we can give
it a try once the basic custodian stuff gets in.
2. Moving RemovePgTempFiles() to the custodian can bring up the server
soon. The idea is that the postmaster just renames the temp
directories and informs the custodian so that it can go delete such
temp files and directories. I have personally seen cases where the
server spent a good amount of time cleaning up temp files. We can park
it for later.
3. Moving RemoveOldXlogFiles() to the custodian can make checkpoints faster.
4. PreallocXlogFiles() - if we ever have plans to make pre-allocation
more aggressive (pre-allocate more than 1 WAL file), perhaps letting
custodian do that is a good idea. Again, too many tasks for a single
process.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Bharath Rupireddy
Date:
On Wed, Nov 30, 2022 at 4:52 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Wed, Nov 30, 2022 at 10:48 AM Nathan Bossart
> <nathandbossart@gmail.com> wrote:
> >
> >
> > cfbot is not happy with v16.  AFAICT this is just due to poor placement, so
> > here is another attempt with the tests moved to a new location.  Apologies
> > for the noise.
>
> Thanks for the patches. I spent some time on reviewing v17 patch set
> and here are my comments:
>
> 0001:
> 1. I think the custodian process needs documentation - it needs a
> definition in glossary.sgml and perhaps a dedicated page describing
> what tasks it takes care of.
>
> 2.
> +        LWLockReleaseAll();
> +        ConditionVariableCancelSleep();
> +        AbortBufferIO();
> +        UnlockBuffers();
> +        ReleaseAuxProcessResources(false);
> +        AtEOXact_Buffers(false);
> +        AtEOXact_SMgr();
> +        AtEOXact_Files(false);
> +        AtEOXact_HashTables(false);
> Do we need all of these in the exit path? Isn't the stuff that
> ShutdownAuxiliaryProcess() does enough for the custodian process?
> AFAICS, the custodian process uses LWLocks (which the
> ShutdownAuxiliaryProcess() takes care of) and it doesn't access shared
> buffers and so on.
> Having said that, I'm fine to keep them for future use and all of
> those cleanup functions exit if nothing related occurs.
>
> 3.
> +     * Advertise out latch that backends can use to wake us up while we're
> Typo - %s/out/our
>
> 4. Is it a good idea to add log messages in the DoCustodianTasks()
> loop? Maybe at a debug level? The log message can say the current task
> the custodian is processing. And/Or setting the custodian's status on
> the ps display is also a good idea IMO.
>
> 0002 and 0003:
> 1.
> +CHECKPOINT;
> +DO $$
> I think we need to ensure that there are some snapshot files before
> the checkpoint. Otherwise, it may happen that the above test case
> exits without the custodian process doing anything.
>
> 2. I think the best way to test the custodian process code is by
> adding a TAP test module to see actually the custodian process kicks
> in. Perhaps, add elog(DEBUGX,...) messages to various custodian
> process functions and see if we see the logs in server logs.
>
> 0004:
> I think the 0004 patch can be merged into 0001, 0002 and 0003 patches.
> Otherwise the patch LGTM.
>
> Few thoughts:
> 1. I think we can trivially extend the custodian process to remove any
> future WAL files on the old timeline, something like the attached
> 0001-Move-removal-of-future-WAL-files-on-the-old-timeline.text file).
> While this offloads the recovery a bit, the server may archive such
> WAL files before the custodian removes them. We can do a bit more to
> stop the server from archiving such WAL files, but that needs more
> coding. I don't think we need to do all that now, perhaps, we can give
> it a try once the basic custodian stuff gets in.
> 2. Moving RemovePgTempFiles() to the custodian can bring up the server
> soon. The idea is that the postmaster just renames the temp
> directories and informs the custodian so that it can go delete such
> temp files and directories. I have personally seen cases where the
> server spent a good amount of time cleaning up temp files. We can park
> it for later.
> 3. Moving RemoveOldXlogFiles() to the custodian can make checkpoints faster.
> 4. PreallocXlogFiles() - if we ever have plans to make pre-allocation
> more aggressive (pre-allocate more than 1 WAL file), perhaps letting
> custodian do that is a good idea. Again, too many tasks for a single
> process.

Another comment:
IIUC, there's no custodian_delay GUC as we want to avoid unnecessary
wakeups for power savings (being discussed in the other thread).
However, can it happen that the custodian missed to capture SetLatch
wakeups by other backends? In other words, can the custodian process
be sleeping when there's work to do?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Wed, Nov 30, 2022 at 05:27:10PM +0530, Bharath Rupireddy wrote:
> On Wed, Nov 30, 2022 at 4:52 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
>> Thanks for the patches. I spent some time on reviewing v17 patch set
>> and here are my comments:

Thanks for reviewing!

>> 0001:
>> 1. I think the custodian process needs documentation - it needs a
>> definition in glossary.sgml and perhaps a dedicated page describing
>> what tasks it takes care of.

Good catch.  I added this in v18.  I stopped short of adding a dedicated
page to describe the tasks because 1) there are no parameters for the
custodian and 2) AFAICT none of its tasks are described in the docs today.

>> 2.
>> +        LWLockReleaseAll();
>> +        ConditionVariableCancelSleep();
>> +        AbortBufferIO();
>> +        UnlockBuffers();
>> +        ReleaseAuxProcessResources(false);
>> +        AtEOXact_Buffers(false);
>> +        AtEOXact_SMgr();
>> +        AtEOXact_Files(false);
>> +        AtEOXact_HashTables(false);
>> Do we need all of these in the exit path? Isn't the stuff that
>> ShutdownAuxiliaryProcess() does enough for the custodian process?
>> AFAICS, the custodian process uses LWLocks (which the
>> ShutdownAuxiliaryProcess() takes care of) and it doesn't access shared
>> buffers and so on.
>> Having said that, I'm fine to keep them for future use and all of
>> those cleanup functions exit if nothing related occurs.

Yeah, I don't think we need a few of these.  In v18, I've kept the
following:
    * LWLockReleaseAll()
    * ConditionVariableCancelSleep()
    * ReleaseAuxProcessResources(false)
    * AtEOXact_Files(false)

>> 3.
>> +     * Advertise out latch that backends can use to wake us up while we're
>> Typo - %s/out/our

fixed

>> 4. Is it a good idea to add log messages in the DoCustodianTasks()
>> loop? Maybe at a debug level? The log message can say the current task
>> the custodian is processing. And/Or setting the custodian's status on
>> the ps display is also a good idea IMO.

I'd like to pick these up in a new thread if/when this initial patch set is
committed.  The tasks already do some logging, and the checkpointer process
doesn't update the ps display for these tasks today.

>> 0002 and 0003:
>> 1.
>> +CHECKPOINT;
>> +DO $$
>> I think we need to ensure that there are some snapshot files before
>> the checkpoint. Otherwise, it may happen that the above test case
>> exits without the custodian process doing anything.
>>
>> 2. I think the best way to test the custodian process code is by
>> adding a TAP test module to see actually the custodian process kicks
>> in. Perhaps, add elog(DEBUGX,...) messages to various custodian
>> process functions and see if we see the logs in server logs.

The test appears to reliably create snapshot and mapping files, so if the
directories are empty at some point after the checkpoint at the end, we can
be reasonably certain the custodian took action.  I didn't add explicit
checks that there are files in the directories before the checkpoint
because a concurrent checkpoint could make such checks unreliable.

>> 0004:
>> I think the 0004 patch can be merged into 0001, 0002 and 0003 patches.
>> Otherwise the patch LGTM.

I'm keeping this one separate because I've received conflicting feedback
about the idea.

>> 1. I think we can trivially extend the custodian process to remove any
>> future WAL files on the old timeline, something like the attached
>> 0001-Move-removal-of-future-WAL-files-on-the-old-timeline.text file).
>> While this offloads the recovery a bit, the server may archive such
>> WAL files before the custodian removes them. We can do a bit more to
>> stop the server from archiving such WAL files, but that needs more
>> coding. I don't think we need to do all that now, perhaps, we can give
>> it a try once the basic custodian stuff gets in.
>> 2. Moving RemovePgTempFiles() to the custodian can bring up the server
>> soon. The idea is that the postmaster just renames the temp
>> directories and informs the custodian so that it can go delete such
>> temp files and directories. I have personally seen cases where the
>> server spent a good amount of time cleaning up temp files. We can park
>> it for later.
>> 3. Moving RemoveOldXlogFiles() to the custodian can make checkpoints faster.
>> 4. PreallocXlogFiles() - if we ever have plans to make pre-allocation
>> more aggressive (pre-allocate more than 1 WAL file), perhaps letting
>> custodian do that is a good idea. Again, too many tasks for a single
>> process.

I definitely want to do #2.  І have some patches for that upthread, but I
removed them for now based on Simon's feedback.  I intend to pick that up
in a new thread.  I haven't thought too much about the others yet.

> Another comment:
> IIUC, there's no custodian_delay GUC as we want to avoid unnecessary
> wakeups for power savings (being discussed in the other thread).
> However, can it happen that the custodian missed to capture SetLatch
> wakeups by other backends? In other words, can the custodian process
> be sleeping when there's work to do?

I'm not aware of any way this could happen, but if there is one, I think we
should treat it as a bug instead of relying on the custodian process to
periodically wake up and check for work to do.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: O(n) tasks cause lengthy startups and checkpoints

From
Bharath Rupireddy
Date:
On Fri, Dec 2, 2022 at 3:10 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> >> 4. Is it a good idea to add log messages in the DoCustodianTasks()
> >> loop? Maybe at a debug level? The log message can say the current task
> >> the custodian is processing. And/Or setting the custodian's status on
> >> the ps display is also a good idea IMO.
>
> I'd like to pick these up in a new thread if/when this initial patch set is
> committed.  The tasks already do some logging, and the checkpointer process
> doesn't update the ps display for these tasks today.

It'll be good to have some kind of dedicated monitoring for the
custodian process as it can do a "good" amount of work at times and
users will have a way to know what it currently is doing - it can be
logs at debug level, progress reporting via
ereport_startup_progress()-sort of mechanism, ps display,
pg_stat_custodian or a special function that tells some details, or
some other. In any case, I agree to park this for later.

> >> 0002 and 0003:
> >> 1.
> >> +CHECKPOINT;
> >> +DO $$
> >> I think we need to ensure that there are some snapshot files before
> >> the checkpoint. Otherwise, it may happen that the above test case
> >> exits without the custodian process doing anything.
> >>
> >> 2. I think the best way to test the custodian process code is by
> >> adding a TAP test module to see actually the custodian process kicks
> >> in. Perhaps, add elog(DEBUGX,...) messages to various custodian
> >> process functions and see if we see the logs in server logs.
>
> The test appears to reliably create snapshot and mapping files, so if the
> directories are empty at some point after the checkpoint at the end, we can
> be reasonably certain the custodian took action.  I didn't add explicit
> checks that there are files in the directories before the checkpoint
> because a concurrent checkpoint could make such checks unreliable.

I think you're right. I added sqls to see if the snapshot and mapping
files count > 0, see [1] and the cirrus-ci members are happy too -
https://github.com/BRupireddy/postgres/tree/custodian_review_2. I
think we can consider adding these count > 0 checks to tests.

> >> 0004:
> >> I think the 0004 patch can be merged into 0001, 0002 and 0003 patches.
> >> Otherwise the patch LGTM.
>
> I'm keeping this one separate because I've received conflicting feedback
> about the idea.

If we classify custodian as a process doing non-critical tasks that
have nothing to do with regular server functioning, then processing
ShutdownRequestPending looks okay. However, delaying these
non-critical tasks such as file removals which reclaims disk space
might impact the server overall especially when it's reaching 100%
disk usage and we want the custodian to do its job fully before we
shutdown the server.

If we delay processing shutdown requests, that can impact the server
overall (might delay restarts, failovers etc.), because at times there
can be a lot of tasks with a good amount of work pending in the
custodian's task queue.

Having said above, I'm okay to process ShutdownRequestPending as early
as possible, however, should we also add CHECK_FOR_INTERRUPTS()
alongside ShutdownRequestPending?

Also, I think it's enough to just have ShutdownRequestPending check in
DoCustodianTasks(void)'s main loop and we can let
RemoveOldSerializedSnapshots() and RemoveOldLogicalRewriteMappings()
do their jobs to the fullest as they do today.

While thinking about this, one thing that really struck me is what
happens if we let the custodian exit, say after processing
ShutdownRequestPending immediately or after a restart, leaving other
queued tasks? The custodian will never get to work on those tasks
unless the requestors (say checkpoint or some other process) requests
it to do so after restart. Maybe, we don't need to worry about it.
Maybe we need to worry about it. Maybe it's an overkill to save the
custodian's task state to disk so that it can come up and do the
leftover tasks upon restart.

> > Another comment:
> > IIUC, there's no custodian_delay GUC as we want to avoid unnecessary
> > wakeups for power savings (being discussed in the other thread).
> > However, can it happen that the custodian missed to capture SetLatch
> > wakeups by other backends? In other words, can the custodian process
> > be sleeping when there's work to do?
>
> I'm not aware of any way this could happen, but if there is one, I think we
> should treat it as a bug instead of relying on the custodian process to
> periodically wake up and check for work to do.

One possible scenario is that the requestor adds its task details to
the queue and sets the latch, the custodian can miss this SetLatch()
when it's in the midst of processing a task. However, it guarantees
the requester that it'll process the added task after it completes the
current task. And, I don't know the other reasons when the custodian
can miss SetLatch().

[1]
diff --git a/contrib/test_decoding/expected/rewrite.out
b/contrib/test_decoding/expected/rewrite.out
index 214a514a0a..0029e48852 100644
--- a/contrib/test_decoding/expected/rewrite.out
+++ b/contrib/test_decoding/expected/rewrite.out
@@ -163,6 +163,20 @@ DROP FUNCTION iamalongfunction();
 DROP FUNCTION exec(text);
 DROP ROLE regress_justforcomments;
 -- make sure custodian cleans up files
+-- make sure snapshot files exist for custodian to clean up
+SELECT count(*) > 0 FROM pg_ls_logicalsnapdir();
+ ?column?
+----------
+ t
+(1 row)
+
+-- make sure rewrite mapping files exist for custodian to clean up
+SELECT count(*) > 0 FROM pg_ls_logicalmapdir();
+ ?column?
+----------
+ t
+(1 row)
+
 CHECKPOINT;
 DO $$
 DECLARE
diff --git a/contrib/test_decoding/sql/rewrite.sql
b/contrib/test_decoding/sql/rewrite.sql
index d66f70f837..c076809f37 100644
--- a/contrib/test_decoding/sql/rewrite.sql
+++ b/contrib/test_decoding/sql/rewrite.sql
@@ -107,6 +107,13 @@ DROP FUNCTION exec(text);
 DROP ROLE regress_justforcomments;

 -- make sure custodian cleans up files
+
+-- make sure snapshot files exist for custodian to clean up
+SELECT count(*) > 0 FROM pg_ls_logicalsnapdir();
+
+-- make sure rewrite mapping files exist for custodian to clean up
+SELECT count(*) > 0 FROM pg_ls_logicalmapdir();
+
 CHECKPOINT;
 DO $$
 DECLARE

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Fri, Dec 02, 2022 at 12:11:35PM +0530, Bharath Rupireddy wrote:
> On Fri, Dec 2, 2022 at 3:10 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> The test appears to reliably create snapshot and mapping files, so if the
>> directories are empty at some point after the checkpoint at the end, we can
>> be reasonably certain the custodian took action.  I didn't add explicit
>> checks that there are files in the directories before the checkpoint
>> because a concurrent checkpoint could make such checks unreliable.
> 
> I think you're right. I added sqls to see if the snapshot and mapping
> files count > 0, see [1] and the cirrus-ci members are happy too -
> https://github.com/BRupireddy/postgres/tree/custodian_review_2. I
> think we can consider adding these count > 0 checks to tests.

My worry about adding "count > 0" checks is that a concurrent checkpoint
could make them unreliable.  In other words, those checks might ordinarily
work, but if an automatic checkpoint causes the files be cleaned up just
beforehand, they will fail.

> Having said above, I'm okay to process ShutdownRequestPending as early
> as possible, however, should we also add CHECK_FOR_INTERRUPTS()
> alongside ShutdownRequestPending?

I'm not seeing a need for CHECK_FOR_INTERRUPTS.  Do you see one?

> While thinking about this, one thing that really struck me is what
> happens if we let the custodian exit, say after processing
> ShutdownRequestPending immediately or after a restart, leaving other
> queued tasks? The custodian will never get to work on those tasks
> unless the requestors (say checkpoint or some other process) requests
> it to do so after restart. Maybe, we don't need to worry about it.
> Maybe we need to worry about it. Maybe it's an overkill to save the
> custodian's task state to disk so that it can come up and do the
> leftover tasks upon restart.

Yes, tasks will need to be retried when the server starts again.  The ones
in this patch set should be requested again during the next checkpoint.
Temporary file cleanup would always be requested during server start, so
that should be handled as well.  Even today, the server might abruptly shut
down while executing these tasks, and we don't have any special handling
for that.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Bharath Rupireddy
Date:
On Sat, Dec 3, 2022 at 12:45 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Fri, Dec 02, 2022 at 12:11:35PM +0530, Bharath Rupireddy wrote:
> > On Fri, Dec 2, 2022 at 3:10 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> The test appears to reliably create snapshot and mapping files, so if the
> >> directories are empty at some point after the checkpoint at the end, we can
> >> be reasonably certain the custodian took action.  I didn't add explicit
> >> checks that there are files in the directories before the checkpoint
> >> because a concurrent checkpoint could make such checks unreliable.
> >
> > I think you're right. I added sqls to see if the snapshot and mapping
> > files count > 0, see [1] and the cirrus-ci members are happy too -
> > https://github.com/BRupireddy/postgres/tree/custodian_review_2. I
> > think we can consider adding these count > 0 checks to tests.
>
> My worry about adding "count > 0" checks is that a concurrent checkpoint
> could make them unreliable.  In other words, those checks might ordinarily
> work, but if an automatic checkpoint causes the files be cleaned up just
> beforehand, they will fail.

Hm. It would have been better with a TAP test module for testing the
custodian code reliably. Anyway, that mustn't stop the patch getting
in. If required, we can park the TAP test module for later - IMO.
Others may have different thoughts here.

> > Having said above, I'm okay to process ShutdownRequestPending as early
> > as possible, however, should we also add CHECK_FOR_INTERRUPTS()
> > alongside ShutdownRequestPending?
>
> I'm not seeing a need for CHECK_FOR_INTERRUPTS.  Do you see one?

Since the custodian has SignalHandlerForShutdownRequest as SIGINT and
SIGTERM handlers, unlike StatementCancelHandler and die respectively,
no need of CFI I guess. And also none of the CFI signal handler flags
applies to the custodian.

> > While thinking about this, one thing that really struck me is what
> > happens if we let the custodian exit, say after processing
> > ShutdownRequestPending immediately or after a restart, leaving other
> > queued tasks? The custodian will never get to work on those tasks
> > unless the requestors (say checkpoint or some other process) requests
> > it to do so after restart. Maybe, we don't need to worry about it.
> > Maybe we need to worry about it. Maybe it's an overkill to save the
> > custodian's task state to disk so that it can come up and do the
> > leftover tasks upon restart.
>
> Yes, tasks will need to be retried when the server starts again.  The ones
> in this patch set should be requested again during the next checkpoint.
> Temporary file cleanup would always be requested during server start, so
> that should be handled as well.  Even today, the server might abruptly shut
> down while executing these tasks, and we don't have any special handling
> for that.

Right.

The v18 patch set posted upthread
https://www.postgresql.org/message-id/20221201214026.GA1799688%40nathanxps13
looks good to me. I see the CF entry is marked RfC -
https://commitfest.postgresql.org/41/3448/.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Thu, Feb 02, 2023 at 09:48:08PM -0800, Nathan Bossart wrote:
> rebased for cfbot

another rebase for cfbot

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: O(n) tasks cause lengthy startups and checkpoints

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> another rebase for cfbot

I took a brief look through v20, and generally liked what I saw,
but there are a few things troubling me:

* The comments for CustodianEnqueueTask claim that it won't enqueue an
already-queued task, but I don't think I believe that, because it stops
scanning as soon as it finds an empty slot.  That data structure seems
quite oddly designed in any case.  Why isn't it simply an array of
need-to-run-this-one booleans indexed by the CustodianTask enum?
Fairness of dispatch could be ensured by the same state variable that
CustodianGetNextTask already uses to track which array element to
inspect next.  While that wouldn't guarantee that tasks A and B are
dispatched in the same order they were requested in, I'm not sure why
we should care.

* I don't much like cust_lck, mainly because you didn't bother to
document what it protects (in general, CustodianShmemStruct deserves
more than zero commentary).  Do we need it at all?  If the task-needed
flags were sig_atomic_t not bool, we probably don't need it for the
basic job of tracking which tasks remain to be run.  I see that some
of the tasks have possibly-non-atomically-assigned parameters to be
transmitted, but restricting cust_lck to protect those seems like a
better idea.

* Not quite convinced about handle_arg_func, mainly because the Datum
API would be pretty inconvenient for any task with more than one arg.
Why do we need that at all, rather than saying that callers should
set up any required parameters separately before invoking
RequestCustodian?

* Why does LookupCustodianFunctions think it needs to search the
constant array?

* The original proposal included moving RemovePgTempFiles into this
mechanism, which I thought was probably the most useful bit of the
whole thing.  I'm sad to see that gone, what became of it?

            regards, tom lane



Re: O(n) tasks cause lengthy startups and checkpoints

From
Andres Freund
Date:
Hi,

On 2023-04-02 13:40:05 -0400, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
> > another rebase for cfbot
> 
> I took a brief look through v20, and generally liked what I saw,
> but there are a few things troubling me:

Just want to note that I've repeatedly objected to 0002 and 0003, i.e. moving
serialized logical decoding snapshots and mapping files, to custodian, and
still do. Without further work it increases wraparound risks (the filenames
contain xids), and afaict nothing has been done to ameliorate that.

Without those, the current patch series does not have any tasks:

> * The original proposal included moving RemovePgTempFiles into this
> mechanism, which I thought was probably the most useful bit of the
> whole thing.  I'm sad to see that gone, what became of it?

Greetings,

Andres Freund



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Sun, Apr 02, 2023 at 01:40:05PM -0400, Tom Lane wrote:
> I took a brief look through v20, and generally liked what I saw,
> but there are a few things troubling me:

Thanks for taking a look.

> * The comments for CustodianEnqueueTask claim that it won't enqueue an
> already-queued task, but I don't think I believe that, because it stops
> scanning as soon as it finds an empty slot.  That data structure seems
> quite oddly designed in any case.  Why isn't it simply an array of
> need-to-run-this-one booleans indexed by the CustodianTask enum?
> Fairness of dispatch could be ensured by the same state variable that
> CustodianGetNextTask already uses to track which array element to
> inspect next.  While that wouldn't guarantee that tasks A and B are
> dispatched in the same order they were requested in, I'm not sure why
> we should care.

That works.  Will update.

> * I don't much like cust_lck, mainly because you didn't bother to
> document what it protects (in general, CustodianShmemStruct deserves
> more than zero commentary).  Do we need it at all?  If the task-needed
> flags were sig_atomic_t not bool, we probably don't need it for the
> basic job of tracking which tasks remain to be run.  I see that some
> of the tasks have possibly-non-atomically-assigned parameters to be
> transmitted, but restricting cust_lck to protect those seems like a
> better idea.

Will do.

> * Not quite convinced about handle_arg_func, mainly because the Datum
> API would be pretty inconvenient for any task with more than one arg.
> Why do we need that at all, rather than saying that callers should
> set up any required parameters separately before invoking
> RequestCustodian?

I had done it this way earlier, but added the Datum argument based on
feedback upthread [0].  It presently has only one proposed use, anyway, so
I think it would be fine to switch it back for now.

> * Why does LookupCustodianFunctions think it needs to search the
> constant array?

The order of the tasks in the array isn't guaranteed to match the order in
the CustodianTask enum.

> * The original proposal included moving RemovePgTempFiles into this
> mechanism, which I thought was probably the most useful bit of the
> whole thing.  I'm sad to see that gone, what became of it?

I postponed that based on advice from upthread [1].  I was hoping to start
a dedicated thread for that immediately after the custodian infrastructure
was committed.  FWIW I agree that it's the most useful task of what's
proposed thus far.

[0] https://postgr.es/m/20220703172732.wembjsb55xl63vuw%40awork3.anarazel.de
[1] https://postgr.es/m/CANbhV-EagKLoUH7tLEfg__VcLu37LY78F8gvLMzHrRZyZKm6sw%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Sun, Apr 02, 2023 at 11:42:26AM -0700, Andres Freund wrote:
> Just want to note that I've repeatedly objected to 0002 and 0003, i.e. moving
> serialized logical decoding snapshots and mapping files, to custodian, and
> still do. Without further work it increases wraparound risks (the filenames
> contain xids), and afaict nothing has been done to ameliorate that.

From your feedback earlier [0], I was under the (perhaps false) impression
that adding a note about this existing issue in the commit message was
sufficient, at least initially.  I did add such a note in 0003, but it's
missing from 0002 for some reason.  I suspect I left it out because the
serialized snapshot file names do not contain XIDs.  You cleared that up
earlier [1], so this is my bad.

It's been a little while since I dug into this, but I do see your point
that the wraparound risk could be higher in some cases.  For example, if
you have a billion temp files to clean up, the custodian could be stuck on
that task for a long time.  I will give this some further thought.  I'm all
ears if anyone has ideas about how to reduce this risk.

[0] https://postgr.es/m/20220702225456.zit5kjdtdfqmjujt%40alap3.anarazel.de
[1] https://postgr.es/m/20220217065938.x2esfdppzypegn5j%40alap3.anarazel.de

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> On Sun, Apr 02, 2023 at 01:40:05PM -0400, Tom Lane wrote:
>> * Why does LookupCustodianFunctions think it needs to search the
>> constant array?

> The order of the tasks in the array isn't guaranteed to match the order in
> the CustodianTask enum.

Why not?  It's a constant array, we can surely manage to make its
order match the enum.

>> * The original proposal included moving RemovePgTempFiles into this
>> mechanism, which I thought was probably the most useful bit of the
>> whole thing.  I'm sad to see that gone, what became of it?

> I postponed that based on advice from upthread [1].  I was hoping to start
> a dedicated thread for that immediately after the custodian infrastructure
> was committed.  FWIW I agree that it's the most useful task of what's
> proposed thus far.

Hmm, given Andres' objections there's little point in moving forward
without that task.

            regards, tom lane



Re: O(n) tasks cause lengthy startups and checkpoints

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> It's been a little while since I dug into this, but I do see your point
> that the wraparound risk could be higher in some cases.  For example, if
> you have a billion temp files to clean up, the custodian could be stuck on
> that task for a long time.  I will give this some further thought.  I'm all
> ears if anyone has ideas about how to reduce this risk.

I wonder if a single long-lived custodian task is the right model at all.
At least for RemovePgTempFiles, it'd make more sense to write it as a
background worker that spawns, does its work, and then exits,
independently of anything else.  Of course, then you need some mechanism
for ensuring that a bgworker slot is available when needed, but that
doesn't seem horridly difficult --- we could have a few "reserved
bgworker" slots, perhaps.  An idle bgworker slot doesn't cost much.

            regards, tom lane



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Sun, Apr 02, 2023 at 04:23:05PM -0400, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> On Sun, Apr 02, 2023 at 01:40:05PM -0400, Tom Lane wrote:
>>> * Why does LookupCustodianFunctions think it needs to search the
>>> constant array?
> 
>> The order of the tasks in the array isn't guaranteed to match the order in
>> the CustodianTask enum.
> 
> Why not?  It's a constant array, we can surely manage to make its
> order match the enum.

Alright.  I'll change this.

>>> * The original proposal included moving RemovePgTempFiles into this
>>> mechanism, which I thought was probably the most useful bit of the
>>> whole thing.  I'm sad to see that gone, what became of it?
> 
>> I postponed that based on advice from upthread [1].  I was hoping to start
>> a dedicated thread for that immediately after the custodian infrastructure
>> was committed.  FWIW I agree that it's the most useful task of what's
>> proposed thus far.
> 
> Hmm, given Andres' objections there's little point in moving forward
> without that task.

Yeah.  I should probably tackle that one first and leave the logical tasks
for later, given there is some prerequisite work required.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Sun, Apr 02, 2023 at 04:37:38PM -0400, Tom Lane wrote:
> Nathan Bossart <nathandbossart@gmail.com> writes:
>> It's been a little while since I dug into this, but I do see your point
>> that the wraparound risk could be higher in some cases.  For example, if
>> you have a billion temp files to clean up, the custodian could be stuck on
>> that task for a long time.  I will give this some further thought.  I'm all
>> ears if anyone has ideas about how to reduce this risk.
> 
> I wonder if a single long-lived custodian task is the right model at all.
> At least for RemovePgTempFiles, it'd make more sense to write it as a
> background worker that spawns, does its work, and then exits,
> independently of anything else.  Of course, then you need some mechanism
> for ensuring that a bgworker slot is available when needed, but that
> doesn't seem horridly difficult --- we could have a few "reserved
> bgworker" slots, perhaps.  An idle bgworker slot doesn't cost much.

This has crossed my mind.  Even if we use the custodian for several
different tasks, perhaps it could shut down while not in use.  For many
servers, the custodian process will be used sparingly, if at all.  And if
we introduce something like custodian_max_workers, perhaps we could dodge
the wraparound issue a bit by setting the default to the number of
supported tasks.  That being said, this approach adds some complexity.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
I sent this one to the next commitfest and marked it as waiting-on-author
and targeted for v17.  I'm aiming to have something that addresses the
latest feedback ready for the July commitfest.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: O(n) tasks cause lengthy startups and checkpoints

From
Daniel Gustafsson
Date:
> On 4 Apr 2023, at 05:36, Nathan Bossart <nathandbossart@gmail.com> wrote:
> 
> I sent this one to the next commitfest and marked it as waiting-on-author
> and targeted for v17.  I'm aiming to have something that addresses the
> latest feedback ready for the July commitfest.

Have you had a chance to look at this such that there is something ready?

--
Daniel Gustafsson




Re: O(n) tasks cause lengthy startups and checkpoints

From
Nathan Bossart
Date:
On Tue, Jul 04, 2023 at 09:30:43AM +0200, Daniel Gustafsson wrote:
>> On 4 Apr 2023, at 05:36, Nathan Bossart <nathandbossart@gmail.com> wrote:
>> 
>> I sent this one to the next commitfest and marked it as waiting-on-author
>> and targeted for v17.  I'm aiming to have something that addresses the
>> latest feedback ready for the July commitfest.
> 
> Have you had a chance to look at this such that there is something ready?

Not yet, sorry.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com