Thread: O(n) tasks cause lengthy startups and checkpoints
Hi hackers, Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to avoid individually syncing all database files after a crash. However, as noted earlier this year [0], there are still a number of O(n) tasks that affect startup and checkpointing that I'd like to improve. Below, I've attempted to summarize each task and to offer ideas for improving matters. I'll likely split each of these into its own thread, given there is community interest for such changes. 1) CheckPointSnapBuild(): This function loops through pg_logical/snapshots to remove all snapshots that are no longer needed. If there are many entries in this directory, this can take a long time. The note above this function indicates that this is done during checkpoints simply because it is convenient. IIUC there is no requirement that this function actually completes for a given checkpoint. My current idea is to move this to a new maintenance worker. 2) CheckPointLogicalRewriteHeap(): This function loops through pg_logical/mappings to remove old mappings and flush all remaining ones. IIUC there is no requirement that the "remove old mappings" part must complete for a given checkpoint, but the "flush all remaining" portion allows replay after a checkpoint to only "deal with the parts of a mapping that have been written out after the checkpoint started." Therefore, I think we should move the "remove old mappings" part to a new maintenance worker (probably the same one as for 1), and we should consider using syncfs() for the "flush all remaining" part. (I suspect the main argument against the latter will be that it could cause IO spikes.) 3) RemovePgTempFiles(): This step can delay startup if there are many temporary files to individually remove. This step is already optionally done after a crash via the remove_temp_files_after_crash GUC. I propose that we have startup move the temporary file directories aside and create new ones, and then a separate worker (probably the same one from 1 and 2) could clean up the old files. 4) StartupReorderBuffer(): This step deletes logical slot data that has been spilled to disk. This code appears to be written to avoid deleting different types of files in these directories, but AFAICT there shouldn't be any other files. Therefore, I think we could do something similar to 3 (i.e., move the directories aside during startup and clean them up via a new maintenance worker). I realize adding a new maintenance worker might be a bit heavy-handed, but I think it would be nice to have somewhere to offload tasks that really shouldn't impact startup and checkpointing. I imagine such a process would come in handy down the road, too. WDYT? Nathan [0] https://postgr.es/m/32B59582-AA6C-4609-B08F-2256A271F7A5%40amazon.com
+1 to the idea. I don't see a reason why checkpointer has to do all of that. Keeping checkpoint to minimal essential work helps servers recover faster in the event of a crash.
RemoveOldXlogFiles is also an O(N) operation that can at least be avoided during the end of recovery (CHECKPOINT_END_OF_RECOVERY) checkpoint. When a sufficient number of WAL files accumulated and the previous checkpoint did not get a chance to cleanup, this can increase the unavailability of the server.
RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
On Wed, Dec 1, 2021 at 12:24 PM Bossart, Nathan <bossartn@amazon.com> wrote:
Hi hackers,
Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to
avoid individually syncing all database files after a crash. However,
as noted earlier this year [0], there are still a number of O(n) tasks
that affect startup and checkpointing that I'd like to improve.
Below, I've attempted to summarize each task and to offer ideas for
improving matters. I'll likely split each of these into its own
thread, given there is community interest for such changes.
1) CheckPointSnapBuild(): This function loops through
pg_logical/snapshots to remove all snapshots that are no longer
needed. If there are many entries in this directory, this can take
a long time. The note above this function indicates that this is
done during checkpoints simply because it is convenient. IIUC
there is no requirement that this function actually completes for a
given checkpoint. My current idea is to move this to a new
maintenance worker.
2) CheckPointLogicalRewriteHeap(): This function loops through
pg_logical/mappings to remove old mappings and flush all remaining
ones. IIUC there is no requirement that the "remove old mappings"
part must complete for a given checkpoint, but the "flush all
remaining" portion allows replay after a checkpoint to only "deal
with the parts of a mapping that have been written out after the
checkpoint started." Therefore, I think we should move the "remove
old mappings" part to a new maintenance worker (probably the same
one as for 1), and we should consider using syncfs() for the "flush
all remaining" part. (I suspect the main argument against the
latter will be that it could cause IO spikes.)
3) RemovePgTempFiles(): This step can delay startup if there are many
temporary files to individually remove. This step is already
optionally done after a crash via the remove_temp_files_after_crash
GUC. I propose that we have startup move the temporary file
directories aside and create new ones, and then a separate worker
(probably the same one from 1 and 2) could clean up the old files.
4) StartupReorderBuffer(): This step deletes logical slot data that
has been spilled to disk. This code appears to be written to avoid
deleting different types of files in these directories, but AFAICT
there shouldn't be any other files. Therefore, I think we could do
something similar to 3 (i.e., move the directories aside during
startup and clean them up via a new maintenance worker).
I realize adding a new maintenance worker might be a bit heavy-handed,
but I think it would be nice to have somewhere to offload tasks that
really shouldn't impact startup and checkpointing. I imagine such a
process would come in handy down the road, too. WDYT?
Nathan
[0] https://postgr.es/m/32B59582-AA6C-4609-B08F-2256A271F7A5%40amazon.com
Hi, On 2021-12-01 20:24:25 +0000, Bossart, Nathan wrote: > I realize adding a new maintenance worker might be a bit heavy-handed, > but I think it would be nice to have somewhere to offload tasks that > really shouldn't impact startup and checkpointing. I imagine such a > process would come in handy down the road, too. WDYT? -1. I think the overhead of an additional worker is disproportional here. And there's simplicity benefits in having a predictable cleanup interlock as well. I think particularly for the snapshot stuff it'd be better to optimize away unnecessary snapshot files, rather than making the cleanup more asynchronous. Greetings, Andres Freund
On 12/1/21, 2:56 PM, "Andres Freund" <andres@anarazel.de> wrote: > On 2021-12-01 20:24:25 +0000, Bossart, Nathan wrote: >> I realize adding a new maintenance worker might be a bit heavy-handed, >> but I think it would be nice to have somewhere to offload tasks that >> really shouldn't impact startup and checkpointing. I imagine such a >> process would come in handy down the road, too. WDYT? > > -1. I think the overhead of an additional worker is disproportional here. And > there's simplicity benefits in having a predictable cleanup interlock as well. Another idea I had was to put some upper limit on how much time is spent on such tasks. For example, a checkpoint would only spend X minutes on CheckPointSnapBuild() before giving up until the next one. I think the main downside of that approach is that it could lead to unbounded growth, so perhaps we would limit (or even skip) such tasks only for end-of-recovery and shutdown checkpoints. Perhaps the startup tasks could be limited in a similar fashion. > I think particularly for the snapshot stuff it'd be better to optimize away > unnecessary snapshot files, rather than making the cleanup more asynchronous. I can look into this. Any pointers would be much appreciated. Nathan
On Wed, Dec 1, 2021, at 9:19 PM, Bossart, Nathan wrote:
On 12/1/21, 2:56 PM, "Andres Freund" <andres@anarazel.de> wrote:> On 2021-12-01 20:24:25 +0000, Bossart, Nathan wrote:>> I realize adding a new maintenance worker might be a bit heavy-handed,>> but I think it would be nice to have somewhere to offload tasks that>> really shouldn't impact startup and checkpointing. I imagine such a>> process would come in handy down the road, too. WDYT?>> -1. I think the overhead of an additional worker is disproportional here. And> there's simplicity benefits in having a predictable cleanup interlock as well.Another idea I had was to put some upper limit on how much time isspent on such tasks. For example, a checkpoint would only spend Xminutes on CheckPointSnapBuild() before giving up until the next one.I think the main downside of that approach is that it could lead tounbounded growth, so perhaps we would limit (or even skip) such tasksonly for end-of-recovery and shutdown checkpoints. Perhaps thestartup tasks could be limited in a similar fashion.
Saying that a certain task is O(n) doesn't mean it needs a separate process to
handle it. Did you have a use case or even better numbers (% of checkpoint /
startup time) that makes your proposal worthwhile?
I would try to optimize (1) and (2). However, delayed removal can be a
long-term issue if the new routine cannot keep up with the pace of file
creation (specially if the checkpoints are far apart).
For (3), there is already a GUC that would avoid the slowdown during startup.
Use it if you think the startup time is more important that disk space occupied
by useless files.
For (4), you are forgetting that the on-disk state of replication slots is
stored in the pg_replslot/SLOTNAME/state. It seems you cannot just rename the
replication slot directory and copy the state file. What happen if there is a
crash before copying the state file?
While we are talking about items (1), (2) and (4), we could probably have an
option to create some ephemeral logical decoding files into ramdisk (similar to
statistics directory). I wouldn't like to hijack this thread but this proposal
could alleviate the possible issues that you pointed out. If people are
interested in this proposal, I can start a new thread about it.
On Thu, Dec 2, 2021 at 1:54 AM Bossart, Nathan <bossartn@amazon.com> wrote: > > Hi hackers, > > Thanks to 61752af, SyncDataDirectory() can make use of syncfs() to > avoid individually syncing all database files after a crash. However, > as noted earlier this year [0], there are still a number of O(n) tasks > that affect startup and checkpointing that I'd like to improve. > Below, I've attempted to summarize each task and to offer ideas for > improving matters. I'll likely split each of these into its own > thread, given there is community interest for such changes. > > 1) CheckPointSnapBuild(): This function loops through > pg_logical/snapshots to remove all snapshots that are no longer > needed. If there are many entries in this directory, this can take > a long time. The note above this function indicates that this is > done during checkpoints simply because it is convenient. IIUC > there is no requirement that this function actually completes for a > given checkpoint. My current idea is to move this to a new > maintenance worker. > 2) CheckPointLogicalRewriteHeap(): This function loops through > pg_logical/mappings to remove old mappings and flush all remaining > ones. IIUC there is no requirement that the "remove old mappings" > part must complete for a given checkpoint, but the "flush all > remaining" portion allows replay after a checkpoint to only "deal > with the parts of a mapping that have been written out after the > checkpoint started." Therefore, I think we should move the "remove > old mappings" part to a new maintenance worker (probably the same > one as for 1), and we should consider using syncfs() for the "flush > all remaining" part. (I suspect the main argument against the > latter will be that it could cause IO spikes.) > 3) RemovePgTempFiles(): This step can delay startup if there are many > temporary files to individually remove. This step is already > optionally done after a crash via the remove_temp_files_after_crash > GUC. I propose that we have startup move the temporary file > directories aside and create new ones, and then a separate worker > (probably the same one from 1 and 2) could clean up the old files. > 4) StartupReorderBuffer(): This step deletes logical slot data that > has been spilled to disk. This code appears to be written to avoid > deleting different types of files in these directories, but AFAICT > there shouldn't be any other files. Therefore, I think we could do > something similar to 3 (i.e., move the directories aside during > startup and clean them up via a new maintenance worker). > > I realize adding a new maintenance worker might be a bit heavy-handed, > but I think it would be nice to have somewhere to offload tasks that > really shouldn't impact startup and checkpointing. I imagine such a > process would come in handy down the road, too. WDYT? +1 for the overall idea of making the checkpoint faster. In fact, we here at our team have been thinking about this problem for a while. If there are a lot of files that checkpoint has to loop over and remove, IMO, that task can be delegated to someone else (maybe a background worker called background cleaner or bg cleaner, of course, we can have a GUC to enable or disable it). The checkpoint can just write some marker files (for instance, it can write snapshot_<cutofflsn> files with file name itself representing the cutoff lsn so that the new bg cleaner can remove the snapshot files, similarly it can write marker files for other file removals). Having said that, a new bg cleaner deleting the files asynchronously on behalf of checkpoint can look an overkill until we have some numbers that we could save with this approach. For this purpose, I did a small experiment to figure out how much usually file deletion takes [1] on a SSD, for 1million files 8seconds, I'm sure it will be much more on HDD. The bg cleaner can also be used for RemovePgTempFiles, probably the postmaster just renaming the pgsql_temp to something pgsql_temp_delete, then proceeding with the server startup, the bg cleaner can then delete the files. Also, we could do something similar for removing/recycling old xlog files and StartupReorderBuffer. Another idea could be to parallelize the checkpoint i.e. IIUC, the tasks that checkpoint do in CheckPointGuts are independent and if we have some counters like (how many snapshot/mapping files that the server generated) [1] on SSD: deletion of 1000000 files took 7.930380 seconds deletion of 500000 files took 3.921676 seconds deletion of 100000 files took 0.768772 seconds deletion of 50000 files took 0.400623 seconds deletion of 10000 files took 0.077565 seconds deletion of 1000 files took 0.006232 seconds Regards, Bharath Rupireddy.
On 12/1/21, 6:06 PM, "Euler Taveira" <euler@eulerto.com> wrote: > Saying that a certain task is O(n) doesn't mean it needs a separate process to > handle it. Did you have a use case or even better numbers (% of checkpoint / > startup time) that makes your proposal worthwhile? I don't have specific numbers on hand, but each of the four functions I listed is something I routinely see impacting customers. > For (3), there is already a GUC that would avoid the slowdown during startup. > Use it if you think the startup time is more important that disk space occupied > by useless files. Setting remove_temp_files_after_crash to false only prevents temp file cleanup during restart after a backend crash. It is always called for other startups. > For (4), you are forgetting that the on-disk state of replication slots is > stored in the pg_replslot/SLOTNAME/state. It seems you cannot just rename the > replication slot directory and copy the state file. What happen if there is a > crash before copying the state file? Good point. I think it's possible to deal with this, though. Perhaps the files that should be deleted on startup should go in a separate directory, or maybe we could devise a way to ensure the state file is copied even if there is a crash at an inconvenient time. Nathan
On 12/1/21, 6:48 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote: > +1 for the overall idea of making the checkpoint faster. In fact, we > here at our team have been thinking about this problem for a while. If > there are a lot of files that checkpoint has to loop over and remove, > IMO, that task can be delegated to someone else (maybe a background > worker called background cleaner or bg cleaner, of course, we can have > a GUC to enable or disable it). The checkpoint can just write some Right. IMO it isn't optimal to have critical things like startup and checkpointing depend on somewhat-unrelated tasks. I understand the desire to avoid adding additional processes, and maybe it is a bigger hammer than what is necessary to reduce the impact, but it seemed like a natural solution for this problem. That being said, I'm all for exploring other ways to handle this. > Another idea could be to parallelize the checkpoint i.e. IIUC, the > tasks that checkpoint do in CheckPointGuts are independent and if we > have some counters like (how many snapshot/mapping files that the > server generated) Could you elaborate on this? Is your idea that the checkpointer would create worker processes like autovacuum does? Nathan
On Fri, Dec 3, 2021 at 3:01 AM Bossart, Nathan <bossartn@amazon.com> wrote: > > On 12/1/21, 6:48 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote: > > +1 for the overall idea of making the checkpoint faster. In fact, we > > here at our team have been thinking about this problem for a while. If > > there are a lot of files that checkpoint has to loop over and remove, > > IMO, that task can be delegated to someone else (maybe a background > > worker called background cleaner or bg cleaner, of course, we can have > > a GUC to enable or disable it). The checkpoint can just write some > > Right. IMO it isn't optimal to have critical things like startup and > checkpointing depend on somewhat-unrelated tasks. I understand the > desire to avoid adding additional processes, and maybe it is a bigger > hammer than what is necessary to reduce the impact, but it seemed like > a natural solution for this problem. That being said, I'm all for > exploring other ways to handle this. Having a generic background cleaner process (controllable via a few GUCs), which can delete a bunch of files (snapshot, mapping, old WAL, temp files etc.) or some other task on behalf of the checkpointer, seems to be the easiest solution. I'm too open for other ideas. > > Another idea could be to parallelize the checkpoint i.e. IIUC, the > > tasks that checkpoint do in CheckPointGuts are independent and if we > > have some counters like (how many snapshot/mapping files that the > > server generated) > > Could you elaborate on this? Is your idea that the checkpointer would > create worker processes like autovacuum does? Yes, I was thinking that the checkpointer creates one or more dynamic background workers (we can assume one background worker for now) to delete the files. If a threshold of files crosses (snapshot files count is more than this threshold), the new worker gets spawned which would then enumerate the files and delete the unneeded ones, the checkpointer can proceed with the other tasks and finish the checkpointing. Having said this, I prefer the background cleaner approach over the dynamic background worker. The advantage with the background cleaner being that it can do other tasks (like other kinds of file deletion). Another idea could be that, use the existing background writer to do the file deletion while the checkpoint is happening. But again, this might cause problems because the bg writer flushing dirty buffers will get delayed. Regards, Bharath Rupireddy.
On 12/3/21, 5:57 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote: > On Fri, Dec 3, 2021 at 3:01 AM Bossart, Nathan <bossartn@amazon.com> wrote: >> >> On 12/1/21, 6:48 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote: >> > +1 for the overall idea of making the checkpoint faster. In fact, we >> > here at our team have been thinking about this problem for a while. If >> > there are a lot of files that checkpoint has to loop over and remove, >> > IMO, that task can be delegated to someone else (maybe a background >> > worker called background cleaner or bg cleaner, of course, we can have >> > a GUC to enable or disable it). The checkpoint can just write some >> >> Right. IMO it isn't optimal to have critical things like startup and >> checkpointing depend on somewhat-unrelated tasks. I understand the >> desire to avoid adding additional processes, and maybe it is a bigger >> hammer than what is necessary to reduce the impact, but it seemed like >> a natural solution for this problem. That being said, I'm all for >> exploring other ways to handle this. > > Having a generic background cleaner process (controllable via a few > GUCs), which can delete a bunch of files (snapshot, mapping, old WAL, > temp files etc.) or some other task on behalf of the checkpointer, > seems to be the easiest solution. > > I'm too open for other ideas. I might hack something together for the separate worker approach, if for no other reason than to make sure I really understand how these functions work. If/when a better idea emerges, we can alter course. Nathan
On Fri, Dec 3, 2021 at 11:50 PM Bossart, Nathan <bossartn@amazon.com> wrote: > > On 12/3/21, 5:57 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Fri, Dec 3, 2021 at 3:01 AM Bossart, Nathan <bossartn@amazon.com> wrote: > >> > >> On 12/1/21, 6:48 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote: > >> > +1 for the overall idea of making the checkpoint faster. In fact, we > >> > here at our team have been thinking about this problem for a while. If > >> > there are a lot of files that checkpoint has to loop over and remove, > >> > IMO, that task can be delegated to someone else (maybe a background > >> > worker called background cleaner or bg cleaner, of course, we can have > >> > a GUC to enable or disable it). The checkpoint can just write some > >> > >> Right. IMO it isn't optimal to have critical things like startup and > >> checkpointing depend on somewhat-unrelated tasks. I understand the > >> desire to avoid adding additional processes, and maybe it is a bigger > >> hammer than what is necessary to reduce the impact, but it seemed like > >> a natural solution for this problem. That being said, I'm all for > >> exploring other ways to handle this. > > > > Having a generic background cleaner process (controllable via a few > > GUCs), which can delete a bunch of files (snapshot, mapping, old WAL, > > temp files etc.) or some other task on behalf of the checkpointer, > > seems to be the easiest solution. > > > > I'm too open for other ideas. > > I might hack something together for the separate worker approach, if > for no other reason than to make sure I really understand how these > functions work. If/when a better idea emerges, we can alter course. Thanks. As I said upthread we've been discussing the approach of offloading some of the checkpoint tasks like (deleting snapshot files) internally for quite some time and I would like to share a patch that adds a new background cleaner process (currently able to delete the logical replication snapshot files, if required can be extended to do other tasks as well). I don't mind if it gets rejected. Please have a look. Regards, Bharath Rupireddy.
Attachment
On 12/6/21, 3:44 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote: > On Fri, Dec 3, 2021 at 11:50 PM Bossart, Nathan <bossartn@amazon.com> wrote: >> I might hack something together for the separate worker approach, if >> for no other reason than to make sure I really understand how these >> functions work. If/when a better idea emerges, we can alter course. > > Thanks. As I said upthread we've been discussing the approach of > offloading some of the checkpoint tasks like (deleting snapshot files) > internally for quite some time and I would like to share a patch that > adds a new background cleaner process (currently able to delete the > logical replication snapshot files, if required can be extended to do > other tasks as well). I don't mind if it gets rejected. Please have a > look. Thanks for sharing! I've also spent some time on a patch set, which I intend to share once I have handling for all four tasks (so far I have handling for CheckPointSnapBuild() and RemovePgTempFiles()). I'll take a look at your patch as well. Nathan
On 12/6/21, 11:23 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote: > On 12/6/21, 3:44 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote: >> Thanks. As I said upthread we've been discussing the approach of >> offloading some of the checkpoint tasks like (deleting snapshot files) >> internally for quite some time and I would like to share a patch that >> adds a new background cleaner process (currently able to delete the >> logical replication snapshot files, if required can be extended to do >> other tasks as well). I don't mind if it gets rejected. Please have a >> look. > > Thanks for sharing! I've also spent some time on a patch set, which I > intend to share once I have handling for all four tasks (so far I have > handling for CheckPointSnapBuild() and RemovePgTempFiles()). I'll > take a look at your patch as well. Well, I haven't had a chance to look at your patch, and my patch set still only has handling for CheckPointSnapBuild() and RemovePgTempFiles(), but I thought I'd share what I have anyway. I split it into 5 patches: 0001 - Adds a new "custodian" auxiliary process that does nothing. 0002 - During startup, remove the pgsql_tmp directories instead of only clearing the contents. 0003 - Split temporary file cleanup during startup into two stages. The first renames the directories, and the second clears them. 0004 - Moves the second stage from 0003 to the custodian process. 0005 - Moves CheckPointSnapBuild() to the custodian process. This is still very much a work in progress, and I've done minimal testing so far. Nathan
Attachment
On Fri, Dec 10, 2021 at 2:03 PM Bossart, Nathan <bossartn@amazon.com> wrote: > Well, I haven't had a chance to look at your patch, and my patch set > still only has handling for CheckPointSnapBuild() and > RemovePgTempFiles(), but I thought I'd share what I have anyway. I > split it into 5 patches: > > 0001 - Adds a new "custodian" auxiliary process that does nothing. > 0002 - During startup, remove the pgsql_tmp directories instead of > only clearing the contents. > 0003 - Split temporary file cleanup during startup into two stages. > The first renames the directories, and the second clears them. > 0004 - Moves the second stage from 0003 to the custodian process. > 0005 - Moves CheckPointSnapBuild() to the custodian process. I don't know whether this kind of idea is good or not. One thing we've seen a number of times now is that entrusting the same process with multiple responsibilities often ends poorly. Sometimes it's busy with one thing when another thing really needs to be done RIGHT NOW. Perhaps that won't be an issue here since all of these things are related to checkpointing, but then the process name should reflect that rather than making it sound like we can just keep piling more responsibilities onto this process indefinitely. At some point that seems bound to become an issue. Another issue is that we don't want to increase the number of processes without bound. Processes use memory and CPU resources and if we run too many of them it becomes a burden on the system. Low-end systems may not have too many resources in total, and high-end systems can struggle to fit demanding workloads within the resources that they have. Maybe it would be cheaper to do more things at once if we were using threads rather than processes, but that day still seems fairly far off. But against all that, if these tasks are slowing down checkpoints and that's avoidable, that seems pretty important too. Interestingly, I can't say that I've ever seen any of these things be a problem for checkpoint or startup speed. I wonder why you've had a different experience. -- Robert Haas EDB: http://www.enterprisedb.com
On 12/13/21, 5:54 AM, "Robert Haas" <robertmhaas@gmail.com> wrote: > I don't know whether this kind of idea is good or not. Thanks for chiming in. I have an almost-complete patch set that I'm hoping to post to the lists in the next couple of days. > One thing we've seen a number of times now is that entrusting the same > process with multiple responsibilities often ends poorly. Sometimes > it's busy with one thing when another thing really needs to be done > RIGHT NOW. Perhaps that won't be an issue here since all of these > things are related to checkpointing, but then the process name should > reflect that rather than making it sound like we can just keep piling > more responsibilities onto this process indefinitely. At some point > that seems bound to become an issue. Two of the tasks are cleanup tasks that checkpointing handles at the moment, and two are cleanup tasks that are done at startup. For now, all of these tasks are somewhat nonessential. There's no requirement that any of these tasks complete in order to finish startup or checkpointing. In fact, outside of preventing the server from running out of disk space, I don't think there's any requirement that these tasks run at all. IMO this would have to be a core tenet of a new auxiliary process like this. That being said, I totally understand your point. If there were a dozen such tasks handled by a single auxiliary process, that could cause a new set of problems. Your checkpointing and startup might be fast, but you might run out of disk space because our cleanup process can't handle it all. So a new worker could end up becoming an availability risk as well. > Another issue is that we don't want to increase the number of > processes without bound. Processes use memory and CPU resources and if > we run too many of them it becomes a burden on the system. Low-end > systems may not have too many resources in total, and high-end systems > can struggle to fit demanding workloads within the resources that they > have. Maybe it would be cheaper to do more things at once if we were > using threads rather than processes, but that day still seems fairly > far off. I do agree that it is important to be very careful about adding new processes, and if a better idea for how to handle these tasks emerges, I will readily abandon my current approach. Upthread, Andres mentioned optimizing unnecessary snapshot files, and I mentioned possibly limiting how much time startup and checkpoints spend on these tasks. I don't have too many details for the former, and for the latter, I'm worried about not being able to keep up. But if the prospect of adding a new auxiliary process for this stuff is a non- starter, perhaps I should explore that approach some more. > But against all that, if these tasks are slowing down checkpoints and > that's avoidable, that seems pretty important too. Interestingly, I > can't say that I've ever seen any of these things be a problem for > checkpoint or startup speed. I wonder why you've had a different > experience. Yeah, it's difficult for me to justify why users should suffer long periods of downtime because startup or checkpointing is taking a very long time doing things that are arguably unrelated to startup and checkpointing. Nathan
On 12/13/21, 9:20 AM, "Justin Pryzby" <pryzby@telsasoft.com> wrote: > On Mon, Dec 13, 2021 at 08:53:37AM -0500, Robert Haas wrote: >> Another issue is that we don't want to increase the number of >> processes without bound. Processes use memory and CPU resources and if >> we run too many of them it becomes a burden on the system. Low-end >> systems may not have too many resources in total, and high-end systems >> can struggle to fit demanding workloads within the resources that they >> have. Maybe it would be cheaper to do more things at once if we were >> using threads rather than processes, but that day still seems fairly >> far off. > > Maybe that's an argument that this should be a dynamic background worker > instead of an auxilliary process. Then maybe it would be controlled by > max_parallel_maintenance_workers (or something similar). The checkpointer > would need to do these tasks itself if parallel workers were disabled or > couldn't be launched. I think this is an interesting idea. I dislike the prospect of having two code paths for all this stuff, but if it addresses the concerns about resource usage, maybe it's worth it. Nathan
On Mon, Dec 13, 2021 at 1:21 PM Bossart, Nathan <bossartn@amazon.com> wrote: > > But against all that, if these tasks are slowing down checkpoints and > > that's avoidable, that seems pretty important too. Interestingly, I > > can't say that I've ever seen any of these things be a problem for > > checkpoint or startup speed. I wonder why you've had a different > > experience. > > Yeah, it's difficult for me to justify why users should suffer long > periods of downtime because startup or checkpointing is taking a very > long time doing things that are arguably unrelated to startup and > checkpointing. Well sure. But I've never actually seen that happen. -- Robert Haas EDB: http://www.enterprisedb.com
On 12/13/21, 12:37 PM, "Robert Haas" <robertmhaas@gmail.com> wrote: > On Mon, Dec 13, 2021 at 1:21 PM Bossart, Nathan <bossartn@amazon.com> wrote: >> > But against all that, if these tasks are slowing down checkpoints and >> > that's avoidable, that seems pretty important too. Interestingly, I >> > can't say that I've ever seen any of these things be a problem for >> > checkpoint or startup speed. I wonder why you've had a different >> > experience. >> >> Yeah, it's difficult for me to justify why users should suffer long >> periods of downtime because startup or checkpointing is taking a very >> long time doing things that are arguably unrelated to startup and >> checkpointing. > > Well sure. But I've never actually seen that happen. I'll admit that surprises me. As noted elsewhere [0], we were seeing this enough with pgsql_tmp that we started moving the directory aside before starting the server. Discussions about handling this usually prompt questions about why there are so many temporary files in the first place (which is fair). FWIW all four functions noted in my original message [1] are things I've seen firsthand affecting users. Nathan [0] https://postgr.es/m/E7573D54-A8C9-40A8-89D7-0596A36ED124%40amazon.com [1] https://postgr.es/m/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB%40amazon.com
On Mon, Dec 13, 2021 at 11:05:46PM +0000, Bossart, Nathan wrote: > On 12/13/21, 12:37 PM, "Robert Haas" <robertmhaas@gmail.com> wrote: > > On Mon, Dec 13, 2021 at 1:21 PM Bossart, Nathan <bossartn@amazon.com> wrote: > >> > But against all that, if these tasks are slowing down checkpoints and > >> > that's avoidable, that seems pretty important too. Interestingly, I > >> > can't say that I've ever seen any of these things be a problem for > >> > checkpoint or startup speed. I wonder why you've had a different > >> > experience. > >> > >> Yeah, it's difficult for me to justify why users should suffer long > >> periods of downtime because startup or checkpointing is taking a very > >> long time doing things that are arguably unrelated to startup and > >> checkpointing. > > > > Well sure. But I've never actually seen that happen. > > I'll admit that surprises me. As noted elsewhere [0], we were seeing > this enough with pgsql_tmp that we started moving the directory aside > before starting the server. Discussions about handling this usually > prompt questions about why there are so many temporary files in the > first place (which is fair). FWIW all four functions noted in my > original message [1] are things I've seen firsthand affecting users. Have we changed temporary file handling in any recent major releases, meaning is this a current problem or one already improved in PG 14. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On 12/14/21, 9:00 AM, "Bruce Momjian" <bruce@momjian.us> wrote: > Have we changed temporary file handling in any recent major releases, > meaning is this a current problem or one already improved in PG 14. I haven't noticed any recent improvements while working in this area. Nathan
On 12/14/21, 12:09 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote: > On 12/14/21, 9:00 AM, "Bruce Momjian" <bruce@momjian.us> wrote: >> Have we changed temporary file handling in any recent major releases, >> meaning is this a current problem or one already improved in PG 14. > > I haven't noticed any recent improvements while working in this area. On second thought, the addition of the remove_temp_files_after_crash GUC is arguably an improvement since it could prevent files from accumulating after repeated crashes. Nathan
On 12/13/21, 10:21 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote: > Thanks for chiming in. I have an almost-complete patch set that I'm > hoping to post to the lists in the next couple of days. As promised, here is v2. This patch set includes handling for all four tasks noted upthread. I'd still consider this a work-in- progress, as I've done minimal testing. At the very least, it should demonstrate what an auxiliary process approach might look like. Nathan
Attachment
- v2-0001-Introduce-custodian.patch
- v2-0008-Move-removal-of-spilled-logical-slot-data-to-cust.patch
- v2-0007-Use-syncfs-in-CheckPointLogicalRewriteHeap-for-sh.patch
- v2-0006-Move-removal-of-old-logical-rewrite-mapping-files.patch
- v2-0005-Move-removal-of-old-serialized-snapshots-to-custo.patch
- v2-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patch
- v2-0003-Split-pgsql_tmp-cleanup-into-two-stages.patch
- v2-0002-Also-remove-pgsql_tmp-directories-during-startup.patch
Hi, On 2021-12-14 20:23:57 +0000, Bossart, Nathan wrote: > As promised, here is v2. This patch set includes handling for all > four tasks noted upthread. I'd still consider this a work-in- > progress, as I've done minimal testing. At the very least, it should > demonstrate what an auxiliary process approach might look like. This generates a compiler warning: https://cirrus-ci.com/task/5740581082103808?logs=mingw_cross_warning#L378 Greetings, Andres Freund
On Mon, Jan 3, 2022 at 2:56 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-12-14 20:23:57 +0000, Bossart, Nathan wrote: > > As promised, here is v2. This patch set includes handling for all > > four tasks noted upthread. I'd still consider this a work-in- > > progress, as I've done minimal testing. At the very least, it should > > demonstrate what an auxiliary process approach might look like. > > This generates a compiler warning: > https://cirrus-ci.com/task/5740581082103808?logs=mingw_cross_warning#L378 > Somehow, I am not getting these compiler warnings on the latest master head (69872d0bbe6). Here are the few minor comments for the v2 version, I thought would help: + * Copyright (c) 2021, PostgreSQL Global Development Group Time to change the year :) -- + + /* These operations are really just a minimal subset of + * AbortTransaction(). We don't have very many resources to worry + * about. + */ Incorrect formatting, the first line should be empty in the multiline code comment. -- + XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove older mappings */ + XLogRecPtr logical_rewrite_mappings_cutoff_set; Look like logical_rewrite_mappings_cutoff gets to set only once and never get reset, if it is true then I think that variable can be skipped completely and set the initial logical_rewrite_mappings_cutoff to InvalidXLogRecPtr, that will do the needful. -- Regards, Amul
Thanks for your review. On 1/2/22, 11:00 PM, "Amul Sul" <sulamul@gmail.com> wrote: > On Mon, Jan 3, 2022 at 2:56 AM Andres Freund <andres@anarazel.de> wrote: >> This generates a compiler warning: >> https://cirrus-ci.com/task/5740581082103808?logs=mingw_cross_warning#L378 > > Somehow, I am not getting these compiler warnings on the latest master > head (69872d0bbe6). I attempted to fix this by including time.h in custodian.c. > Here are the few minor comments for the v2 version, I thought would help: > > + * Copyright (c) 2021, PostgreSQL Global Development Group > > Time to change the year :) Fixed in v3. > + > + /* These operations are really just a minimal subset of > + * AbortTransaction(). We don't have very many resources to worry > + * about. > + */ > > Incorrect formatting, the first line should be empty in the multiline > code comment. Fixed in v3. > + XLogRecPtr logical_rewrite_mappings_cutoff; /* can remove > older mappings */ > + XLogRecPtr logical_rewrite_mappings_cutoff_set; > > Look like logical_rewrite_mappings_cutoff gets to set only once and > never get reset, if it is true then I think that variable can be > skipped completely and set the initial logical_rewrite_mappings_cutoff > to InvalidXLogRecPtr, that will do the needful. I think the problem with this is that when the cutoff is InvalidXLogRecPtr, it is taken to mean that all logical rewrite files can be removed. If we just used the cutoff variable, we could remove files we need if the custodian ran before the cutoff was set. I suppose we could initially set the cutoff to MaxXLogRecPtr to indicate that the value is not yet set, but I see no real advantage to doing it that way versus just using a bool. Speaking of which, logical_rewrite_mappings_cutoff_set obviously should be a bool. I've fixed that in v3. Nathan
Attachment
- v3-0008-Move-removal-of-spilled-logical-slot-data-to-cust.patch
- v3-0007-Use-syncfs-in-CheckPointLogicalRewriteHeap-for-sh.patch
- v3-0006-Move-removal-of-old-logical-rewrite-mapping-files.patch
- v3-0005-Move-removal-of-old-serialized-snapshots-to-custo.patch
- v3-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patch
- v3-0003-Split-pgsql_tmp-cleanup-into-two-stages.patch
- v3-0002-Also-remove-pgsql_tmp-directories-during-startup.patch
- v3-0001-Introduce-custodian.patch
The code seems to be in good condition. All the tests are running ok with no errors.
I like the whole idea of shifting additional checkpointer jobs as much as possible to another worker. In my view, it is more appropriate to call this worker "bg cleaner" or "bg file cleaner" or smth.
It could be useful for systems with high load, which may deal with deleting many files at once, but I'm not sure about "small" installations. Extra bg worker need more resources to do occasional deletion of small amounts of files. I really do not know how to do it better, maybe to have two different code paths switched by GUC?
Should we also think about adding WAL preallocation into custodian worker from the patch "Pre-alocationg WAL files" [1] ?
Best regards,
Maxim Orlov.
On 1/14/22, 3:43 AM, "Maxim Orlov" <orlovmg@gmail.com> wrote: > The code seems to be in good condition. All the tests are running ok with no errors. Thanks for your review. > I like the whole idea of shifting additional checkpointer jobs as much as possible to another worker. In my view, it ismore appropriate to call this worker "bg cleaner" or "bg file cleaner" or smth. > > It could be useful for systems with high load, which may deal with deleting many files at once, but I'm not sure about"small" installations. Extra bg worker need more resources to do occasional deletion of small amounts of files. I reallydo not know how to do it better, maybe to have two different code paths switched by GUC? I'd personally like to avoid creating two code paths for the same thing. Are there really cases when this one extra auxiliary process would be too many? And if so, how would a user know when to adjust this GUC? I understand the point that we should introduce new processes sparingly to avoid burdening low-end systems, but I don't think we should be afraid to add new ones when it is needed. That being said, if making the extra worker optional addresses the concerns about resource usage, maybe we should consider it. Justin suggested using something like max_parallel_maintenance_workers upthread [0]. > Should we also think about adding WAL preallocation into custodian worker from the patch "Pre-alocationg WAL files" [1]? This was brought up in the pre-allocation thread [1]. I don't think the custodian process would be the right place for it, and I'm also not as concerned about it because it will generally be a small, fixed, and configurable amount of work. In any case, I don't sense a ton of support for a new auxiliary process in this thread, so I'm hesitant to go down the same path for pre-allocation. Nathan [0] https://postgr.es/m/20211213171935.GX17618%40telsasoft.com [1] https://postgr.es/m/B2ACCC5A-F9F2-41D9-AC3B-251362A0A254%40amazon.com
On Sat, Jan 15, 2022 at 12:46 AM Bossart, Nathan <bossartn@amazon.com> wrote: > > On 1/14/22, 3:43 AM, "Maxim Orlov" <orlovmg@gmail.com> wrote: > > The code seems to be in good condition. All the tests are running ok with no errors. > > Thanks for your review. > > > I like the whole idea of shifting additional checkpointer jobs as much as possible to another worker. In my view, itis more appropriate to call this worker "bg cleaner" or "bg file cleaner" or smth. I personally prefer "background cleaner" as the new process name in line with "background writer" and "background worker". > > It could be useful for systems with high load, which may deal with deleting many files at once, but I'm not sure about"small" installations. Extra bg worker need more resources to do occasional deletion of small amounts of files. I reallydo not know how to do it better, maybe to have two different code paths switched by GUC? > > I'd personally like to avoid creating two code paths for the same > thing. Are there really cases when this one extra auxiliary process > would be too many? And if so, how would a user know when to adjust > this GUC? I understand the point that we should introduce new > processes sparingly to avoid burdening low-end systems, but I don't > think we should be afraid to add new ones when it is needed. IMO, having a GUC for enabling/disabling this new worker and it's related code would be a better idea. The reason is that if the postgres has no replication slots at all(which is quite possible in real stand-alone production environments) or if the file enumeration (directory traversal and file removal) is fast enough on the servers, there's no point having this new worker, the checkpointer itself can take care of the work as it is doing today. > That being said, if making the extra worker optional addresses the > concerns about resource usage, maybe we should consider it. Justin > suggested using something like max_parallel_maintenance_workers > upthread [0]. I don't think having this new process is built as part of max_parallel_maintenance_workers, instead I prefer to have it as an auxiliary process much like "background writer", "wal writer" and so on. I think now it's the time for us to run some use cases and get the perf reports to see how beneficial this new process is going to be, in terms of improving the checkpoint timings. > > Should we also think about adding WAL preallocation into custodian worker from the patch "Pre-alocationg WAL files" [1]? > > This was brought up in the pre-allocation thread [1]. I don't think > the custodian process would be the right place for it, and I'm also > not as concerned about it because it will generally be a small, fixed, > and configurable amount of work. In any case, I don't sense a ton of > support for a new auxiliary process in this thread, so I'm hesitant to > go down the same path for pre-allocation. > > [0] https://postgr.es/m/20211213171935.GX17618%40telsasoft.com > [1] https://postgr.es/m/B2ACCC5A-F9F2-41D9-AC3B-251362A0A254%40amazon.com I think the idea of weaving every non-critical task to a common background process is a good idea but let's not mix up with the new background cleaner process here for now, at least until we get some numbers and prove that the idea proposed here will be beneficial. Regards, Bharath Rupireddy.
On 1/14/22, 11:26 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote: > On Sat, Jan 15, 2022 at 12:46 AM Bossart, Nathan <bossartn@amazon.com> wrote: >> I'd personally like to avoid creating two code paths for the same >> thing. Are there really cases when this one extra auxiliary process >> would be too many? And if so, how would a user know when to adjust >> this GUC? I understand the point that we should introduce new >> processes sparingly to avoid burdening low-end systems, but I don't >> think we should be afraid to add new ones when it is needed. > > IMO, having a GUC for enabling/disabling this new worker and it's > related code would be a better idea. The reason is that if the > postgres has no replication slots at all(which is quite possible in > real stand-alone production environments) or if the file enumeration > (directory traversal and file removal) is fast enough on the servers, > there's no point having this new worker, the checkpointer itself can > take care of the work as it is doing today. IMO introducing a GUC wouldn't be doing users many favors. Their cluster might work just fine for a long time before they begin encountering problems during startups/checkpoints. Once the user discovers the underlying reason, they have to then find a GUC for enabling a special background worker that makes this problem go away. Why not just fix the problem for everybody by default? I've been thinking about what other approaches we could take besides creating more processes. The root of the problem seems to be that there are a number of tasks that are performed synchronously that can take a long time. The process approach essentially makes these tasks asynchronous so that they do not block startup and checkpointing. But perhaps this can be done in an existing process, possibly even the checkpointer. Like the current WAL pre-allocation patch, we could do this work when the checkpointer isn't checkpointing, and we could also do small amounts of work in CheckpointWriteDelay() (or a new function called in a similar way). In theory, this would help avoid delaying checkpoints too long while doing cleanup at every opportunity to lower the chances it falls far behind. Nathan
Here is a rebased patch set. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
- v4-0001-Introduce-custodian.patch
- v4-0002-Also-remove-pgsql_tmp-directories-during-startup.patch
- v4-0003-Split-pgsql_tmp-cleanup-into-two-stages.patch
- v4-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patch
- v4-0005-Move-removal-of-old-serialized-snapshots-to-custo.patch
- v4-0006-Move-removal-of-old-logical-rewrite-mapping-files.patch
- v4-0007-Use-syncfs-in-CheckPointLogicalRewriteHeap-for-sh.patch
- v4-0008-Move-removal-of-spilled-logical-slot-data-to-cust.patch
Here is another rebased patch set. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
- v5-0001-Introduce-custodian.patch
- v5-0002-Also-remove-pgsql_tmp-directories-during-startup.patch
- v5-0003-Split-pgsql_tmp-cleanup-into-two-stages.patch
- v5-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patch
- v5-0005-Move-removal-of-old-serialized-snapshots-to-custo.patch
- v5-0006-Move-removal-of-old-logical-rewrite-mapping-files.patch
- v5-0007-Use-syncfs-in-CheckPointLogicalRewriteHeap-for-sh.patch
- v5-0008-Move-removal-of-spilled-logical-slot-data-to-cust.patch
Hi, On 2022-02-16 16:50:57 -0800, Nathan Bossart wrote: > + * The custodian process is new as of Postgres 15. I think this kind of comment tends to age badly and not be very useful. > It's main purpose is to > + * offload tasks that could otherwise delay startup and checkpointing, but > + * it needn't be restricted to just those things. Offloaded tasks should > + * not be synchronous (e.g., checkpointing shouldn't need to wait for the > + * custodian to complete a task before proceeding). Also, ensure that any > + * offloaded tasks are either not required during single-user mode or are > + * performed separately during single-user mode. > + * > + * The custodian is not an essential process and can shutdown quickly when > + * requested. The custodian will wake up approximately once every 5 > + * minutes to perform its tasks, but backends can (and should) set its > + * latch to wake it up sooner. Hm. This kind policy makes it easy to introduce bugs where the occasional runs mask forgotten notifications etc. > + * Normal termination is by SIGTERM, which instructs the bgwriter to > + * exit(0). s/bgwriter/.../ > Emergency termination is by SIGQUIT; like any backend, the > + * custodian will simply abort and exit on SIGQUIT. > + * > + * If the custodian exits unexpectedly, the postmaster treats that the same > + * as a backend crash: shared memory may be corrupted, so remaining > + * backends should be killed by SIGQUIT and then a recovery cycle started. This doesn't really seem useful stuff to me. > + /* > + * If an exception is encountered, processing resumes here. > + * > + * You might wonder why this isn't coded as an infinite loop around a > + * PG_TRY construct. The reason is that this is the bottom of the > + * exception stack, and so with PG_TRY there would be no exception handler > + * in force at all during the CATCH part. By leaving the outermost setjmp > + * always active, we have at least some chance of recovering from an error > + * during error recovery. (If we get into an infinite loop thereby, it > + * will soon be stopped by overflow of elog.c's internal state stack.) > + * > + * Note that we use sigsetjmp(..., 1), so that the prevailing signal mask > + * (to wit, BlockSig) will be restored when longjmp'ing to here. Thus, > + * signals other than SIGQUIT will be blocked until we complete error > + * recovery. It might seem that this policy makes the HOLD_INTERRUPS() > + * call redundant, but it is not since InterruptPending might be set > + * already. > + */ I think it's bad to copy this comment into even more places. > + /* Since not using PG_TRY, must reset error stack by hand */ > + if (sigsetjmp(local_sigjmp_buf, 1) != 0) > + { I also think it's a bad idea to introduce even more copies of the error handling body. I think we need to unify this. And yes, it's unfair to stick you with it, but it's been a while since a new aux process has been added. > + /* > + * These operations are really just a minimal subset of > + * AbortTransaction(). We don't have very many resources to worry > + * about. > + */ Given what you're proposing this for, are you actually confident that we don't need more than this? > From d9826f75ad2259984d55fc04622f0b91ebbba65a Mon Sep 17 00:00:00 2001 > From: Nathan Bossart <bossartn@amazon.com> > Date: Sun, 5 Dec 2021 19:38:20 -0800 > Subject: [PATCH v5 2/8] Also remove pgsql_tmp directories during startup. > > Presently, the server only removes the contents of the temporary > directories during startup, not the directory itself. This changes > that to prepare for future commits that will move temporary file > cleanup to a separate auxiliary process. Is this actually safe? Is there a guarantee no process can access a temp table stored in one of these? Because without WAL guaranteeing consistency, we can't just access e.g. temp tables written before a crash. > +extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok, > + bool unlink_all); I don't like functions with multiple consecutive booleans, they tend to get swapped around. Why not just split unlink_all=true/false into different functions? > Subject: [PATCH v5 3/8] Split pgsql_tmp cleanup into two stages. > > First, pgsql_tmp directories will be renamed to stage them for > removal. What if the target name already exists? > Then, all files in pgsql_tmp are removed before removing > the staged directories themselves. This change is being made in > preparation for a follow-up change to offload most temporary file > cleanup to the new custodian process. > > Note that temporary relation files cannot be cleaned up via the > aforementioned strategy and will not be offloaded to the custodian. This should be in the prior commit message, otherwise people will ask the same question as I did. > + /* > + * Find a name for the stage directory. We just increment an integer at the > + * end of the name until we find one that doesn't exist. > + */ > + for (int n = 0; n <= INT_MAX; n++) > + { > + snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path, > + PG_TEMP_DIR_TO_REMOVE_PREFIX, n); Uninterruptible loops up to INT_MAX do not seem like a good idea. > + dir = AllocateDir(stage_path); > + if (dir == NULL) > + { Why not just use stat()? That's cheaper, and there's no time-to-check-time-to-use issue here, we're the only one writing. > + if (errno == ENOENT) > + break; > + > + ereport(LOG, > + (errcode_for_file_access(), > + errmsg("could not open directory \"%s\": %m", > + stage_path))); I think this kind of lenience is just hiding bugs. > File > PathNameCreateTemporaryFile(const char *path, bool error_on_failure) > @@ -3175,7 +3178,8 @@ RemovePgTempFiles(bool stage, bool remove_relation_files) > */ > spc_dir = AllocateDir("pg_tblspc"); > > - while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL) > + while (!ShutdownRequestPending && > + (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL) Uh, huh? It strikes me as a supremely bad idea to have functions *silently* not do their jobs when ShutdownRequestPending is set, particularly without a huge fat comment. > { > if (strcmp(spc_de->d_name, ".") == 0 || > strcmp(spc_de->d_name, "..") == 0) > @@ -3211,6 +3215,14 @@ RemovePgTempFiles(bool stage, bool remove_relation_files) > * would create a race condition. It's done separately, earlier in > * postmaster startup. > */ > + > + /* > + * If we just staged some pgsql_tmp directories for removal, wake up the > + * custodian process so that it deletes all the files in the staged > + * directories as well as the directories themselves. > + */ > + if (stage && ProcGlobal->custodianLatch) > + SetLatch(ProcGlobal->custodianLatch); Just signalling without letting the custodian know what it's expected to do strikes me as a bad idea. > From 9c2013d53cc5c857ef8aca3df044613e66215aee Mon Sep 17 00:00:00 2001 > From: Nathan Bossart <bossartn@amazon.com> > Date: Sun, 5 Dec 2021 22:02:40 -0800 > Subject: [PATCH v5 5/8] Move removal of old serialized snapshots to custodian. > > This was only done during checkpoints because it was a convenient > place to put it. However, if there are many snapshots to remove, > it can significantly extend checkpoint time. To avoid this, move > this work to the newly-introduced custodian process. > --- > src/backend/access/transam/xlog.c | 2 -- > src/backend/postmaster/custodian.c | 11 +++++++++++ > src/backend/replication/logical/snapbuild.c | 13 +++++++------ > src/include/replication/snapbuild.h | 2 +- > 4 files changed, 19 insertions(+), 9 deletions(-) Why does this not open us up to new xid wraparound issues? Before there was a hard bound on how long these files could linger around. Now there's not anymore. > - while ((snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL) > + while (!ShutdownRequestPending && > + (snap_de = ReadDir(snap_dir, "pg_logical/snapshots")) != NULL) I really really strenuously object to these checks. > Subject: [PATCH v5 6/8] Move removal of old logical rewrite mapping files to > custodian. > If there are many such files to remove, checkpoints can take much > longer. To avoid this, move this work to the newly-introduced > custodian process. Same wraparound concerns. > +#include "postmaster/bgwriter.h" I think it's a bad idea to put these functions into bgwriter.h > From cfca62dd55d7be7e0025e5625f18d3ab9180029c Mon Sep 17 00:00:00 2001 > From: Nathan Bossart <bossartn@amazon.com> > Date: Mon, 13 Dec 2021 20:20:12 -0800 > Subject: [PATCH v5 7/8] Use syncfs() in CheckPointLogicalRewriteHeap() for > shutdown and end-of-recovery checkpoints. > > This may save quite a bit of time when there are many mapping files > to flush to disk. Seems like an a mostly independent proposal. > +#ifdef HAVE_SYNCFS > + > + /* > + * If we are doing a shutdown or end-of-recovery checkpoint, let's use > + * syncfs() to flush the mappings to disk instead of flushing each one > + * individually. This may save us quite a bit of time when there are many > + * such files to flush. > + */ I am doubtful this is a good idea. This will cause all dirty files to be written back, even ones we don't need to be written back. At once. Very possibly *slowing down* the shutdown. What is even the theory of the case here? That there's so many dirty mapping files that fsyncing them will take too long? That iterating would take too long? > From b5923b1b76a1fab6c21d6aec086219160473f464 Mon Sep 17 00:00:00 2001 > From: Nathan Bossart <nathandbossart@gmail.com> > Date: Fri, 11 Feb 2022 09:43:57 -0800 > Subject: [PATCH v5 8/8] Move removal of spilled logical slot data to > custodian. > > If there are many such files, startup can take much longer than > necessary. To handle this, startup creates a new slot directory, > copies the state file, and swaps the new directory with the old > one. The custodian then asynchronously cleans up the old slot > directory. You guess it: I don't see what prevents wraparound issues. > 5 files changed, 317 insertions(+), 9 deletions(-) This seems such an increase in complexity and fragility that I really doubt this is a good idea. > +/* > + * This function renames the given directory with a special suffix that the > + * custodian will know to look for. An integer is appended to the end of the > + * new directory name in case previously staged slot directories have not yet > + * been removed. > + */ > +static void > +StageSlotDirForRemoval(const char *slotname, const char *slotpath) > +{ > + char stage_path[MAXPGPATH]; > + > + /* > + * Find a name for the stage directory. We just increment an integer at the > + * end of the name until we find one that doesn't exist. > + */ > + for (int n = 0; n <= INT_MAX; n++) > + { > + DIR *dir; > + > + sprintf(stage_path, "pg_replslot/%s.to_remove_%d", slotname, n); > + > + dir = AllocateDir(stage_path); > + if (dir == NULL) > + { > + if (errno == ENOENT) > + break; > + > + ereport(ERROR, > + (errcode_for_file_access(), > + errmsg("could not open directory \"%s\": %m", > + stage_path))); > + } > + FreeDir(dir); > + > + stage_path[0] = '\0'; > + } Copy of "find free name" logic. Greetings, Andres Freund
Hi Andres, I appreciate the feedback. On Wed, Feb 16, 2022 at 05:50:52PM -0800, Andres Freund wrote: >> + /* Since not using PG_TRY, must reset error stack by hand */ >> + if (sigsetjmp(local_sigjmp_buf, 1) != 0) >> + { > > I also think it's a bad idea to introduce even more copies of the error > handling body. I think we need to unify this. And yes, it's unfair to stick > you with it, but it's been a while since a new aux process has been added. +1, I think this is useful refactoring. I might spin this off to its own thread. >> + /* >> + * These operations are really just a minimal subset of >> + * AbortTransaction(). We don't have very many resources to worry >> + * about. >> + */ > > Given what you're proposing this for, are you actually confident that we don't > need more than this? I will give this a closer look. >> +extern void RemovePgTempDir(const char *tmpdirname, bool missing_ok, >> + bool unlink_all); > > I don't like functions with multiple consecutive booleans, they tend to get > swapped around. Why not just split unlink_all=true/false into different > functions? Will do. >> Subject: [PATCH v5 3/8] Split pgsql_tmp cleanup into two stages. >> >> First, pgsql_tmp directories will be renamed to stage them for >> removal. > > What if the target name already exists? The integer at the end of the target name is incremented until we find a unique name. >> Note that temporary relation files cannot be cleaned up via the >> aforementioned strategy and will not be offloaded to the custodian. > > This should be in the prior commit message, otherwise people will ask the same > question as I did. Will do. >> + /* >> + * Find a name for the stage directory. We just increment an integer at the >> + * end of the name until we find one that doesn't exist. >> + */ >> + for (int n = 0; n <= INT_MAX; n++) >> + { >> + snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path, >> + PG_TEMP_DIR_TO_REMOVE_PREFIX, n); > > Uninterruptible loops up to INT_MAX do not seem like a good idea. I modeled this after ChooseRelationName() in indexcmds.c. Looking again, I see that it loops forever until a unique name is found. I suspect this is unlikely to be a problem in practice. What strategy would you recommend for choosing a unique name? Should we just append a couple of random characters? >> + dir = AllocateDir(stage_path); >> + if (dir == NULL) >> + { > > Why not just use stat()? That's cheaper, and there's no > time-to-check-time-to-use issue here, we're the only one writing. I'm not sure why I didn't use stat(). I will update this. >> - while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL) >> + while (!ShutdownRequestPending && >> + (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL) > > Uh, huh? It strikes me as a supremely bad idea to have functions *silently* > not do their jobs when ShutdownRequestPending is set, particularly without a > huge fat comment. The idea was to avoid delaying shutdown because we're waiting for the custodian to finish relatively nonessential tasks. Another option might be to just exit immediately when the custodian receives a shutdown request. >> + /* >> + * If we just staged some pgsql_tmp directories for removal, wake up the >> + * custodian process so that it deletes all the files in the staged >> + * directories as well as the directories themselves. >> + */ >> + if (stage && ProcGlobal->custodianLatch) >> + SetLatch(ProcGlobal->custodianLatch); > > Just signalling without letting the custodian know what it's expected to do > strikes me as a bad idea. Good point. I will work on that. >> From 9c2013d53cc5c857ef8aca3df044613e66215aee Mon Sep 17 00:00:00 2001 >> From: Nathan Bossart <bossartn@amazon.com> >> Date: Sun, 5 Dec 2021 22:02:40 -0800 >> Subject: [PATCH v5 5/8] Move removal of old serialized snapshots to custodian. >> >> This was only done during checkpoints because it was a convenient >> place to put it. However, if there are many snapshots to remove, >> it can significantly extend checkpoint time. To avoid this, move >> this work to the newly-introduced custodian process. >> --- >> src/backend/access/transam/xlog.c | 2 -- >> src/backend/postmaster/custodian.c | 11 +++++++++++ >> src/backend/replication/logical/snapbuild.c | 13 +++++++------ >> src/include/replication/snapbuild.h | 2 +- >> 4 files changed, 19 insertions(+), 9 deletions(-) > > Why does this not open us up to new xid wraparound issues? Before there was a > hard bound on how long these files could linger around. Now there's not > anymore. Sorry, I'm probably missing something obvious, but I'm not sure how this adds transaction ID wraparound risk. These files are tied to LSNs, and AFAIK they won't impact slots' xmins. >> +#ifdef HAVE_SYNCFS >> + >> + /* >> + * If we are doing a shutdown or end-of-recovery checkpoint, let's use >> + * syncfs() to flush the mappings to disk instead of flushing each one >> + * individually. This may save us quite a bit of time when there are many >> + * such files to flush. >> + */ > > I am doubtful this is a good idea. This will cause all dirty files to be > written back, even ones we don't need to be written back. At once. Very > possibly *slowing down* the shutdown. > > What is even the theory of the case here? That there's so many dirty mapping > files that fsyncing them will take too long? That iterating would take too > long? Well, yes. My idea was to model this after 61752af, which allows using syncfs() instead of individually fsync-ing every file in the data directory. However, I would likely need to introduce a GUC because 1) as you pointed out, it might be slower and 2) syncfs() doesn't report errors on older versions of Linux. TBH I do feel like this one is a bit of a stretch, so I am okay with leaving it out for now. >> 5 files changed, 317 insertions(+), 9 deletions(-) > > This seems such an increase in complexity and fragility that I really doubt > this is a good idea. I think that's a fair point. I'm okay with leaving this one out for now, too. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2022-02-16 20:14:04 -0800, Nathan Bossart wrote: > >> - while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL) > >> + while (!ShutdownRequestPending && > >> + (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL) > > > > Uh, huh? It strikes me as a supremely bad idea to have functions *silently* > > not do their jobs when ShutdownRequestPending is set, particularly without a > > huge fat comment. > > The idea was to avoid delaying shutdown because we're waiting for the > custodian to finish relatively nonessential tasks. Another option might be > to just exit immediately when the custodian receives a shutdown request. I think we should just not do either of these and let the functions finish. For the cases where shutdown really needs to be immediate there's, uhm, immediate mode shutdowns. > > Why does this not open us up to new xid wraparound issues? Before there was a > > hard bound on how long these files could linger around. Now there's not > > anymore. > > Sorry, I'm probably missing something obvious, but I'm not sure how this > adds transaction ID wraparound risk. These files are tied to LSNs, and > AFAIK they won't impact slots' xmins. They're accessed by xid. The LSN is just for cleanup. Accessing files left over from a previous transaction with the same xid wouldn't be good - we'd read wrong catalog state for decoding... Andres
On Wed, Feb 16, 2022 at 10:59:38PM -0800, Andres Freund wrote: > On 2022-02-16 20:14:04 -0800, Nathan Bossart wrote: >> >> - while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL) >> >> + while (!ShutdownRequestPending && >> >> + (spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL) >> > >> > Uh, huh? It strikes me as a supremely bad idea to have functions *silently* >> > not do their jobs when ShutdownRequestPending is set, particularly without a >> > huge fat comment. >> >> The idea was to avoid delaying shutdown because we're waiting for the >> custodian to finish relatively nonessential tasks. Another option might be >> to just exit immediately when the custodian receives a shutdown request. > > I think we should just not do either of these and let the functions > finish. For the cases where shutdown really needs to be immediate > there's, uhm, immediate mode shutdowns. Alright. >> > Why does this not open us up to new xid wraparound issues? Before there was a >> > hard bound on how long these files could linger around. Now there's not >> > anymore. >> >> Sorry, I'm probably missing something obvious, but I'm not sure how this >> adds transaction ID wraparound risk. These files are tied to LSNs, and >> AFAIK they won't impact slots' xmins. > > They're accessed by xid. The LSN is just for cleanup. Accessing files > left over from a previous transaction with the same xid wouldn't be > good - we'd read wrong catalog state for decoding... Okay, that part makes sense to me. However, I'm still confused about how this is handled today and why moving cleanup to a separate auxiliary process makes matters worse. I've done quite a bit of reading, and I haven't found anything that seems intended to prevent this problem. Do you have any pointers? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2022-02-17 10:23:37 -0800, Nathan Bossart wrote: > On Wed, Feb 16, 2022 at 10:59:38PM -0800, Andres Freund wrote: > > They're accessed by xid. The LSN is just for cleanup. Accessing files > > left over from a previous transaction with the same xid wouldn't be > > good - we'd read wrong catalog state for decoding... > > Okay, that part makes sense to me. However, I'm still confused about how > this is handled today and why moving cleanup to a separate auxiliary > process makes matters worse. Right now cleanup happens every checkpoint. So cleanup can't be deferred all that far. We currently include a bunch of 32bit xids inside checkspoints, so if they're rarer than 2^31-1, we're in trouble independent of logical decoding. But with this patch cleanup of logical decoding mapping files (and other pieces) can be *indefinitely* deferred, without being noticeable. One possible way to improve this would be to switch the on-disk filenames to be based on 64bit xids. But that might also present some problems (file name length, cost of converting 32bit xids to 64bit xids). > I've done quite a bit of reading, and I haven't found anything that seems > intended to prevent this problem. Do you have any pointers? I don't know if we have an iron-clad enforcement of checkpoints happening every 2*31-1 xids. It's very unlikely to happen - you'd run out of space etc. But it'd be good to have something better than that. Greetings, Andres Freund
On Thu, Feb 17, 2022 at 11:27:09AM -0800, Andres Freund wrote: > On 2022-02-17 10:23:37 -0800, Nathan Bossart wrote: >> On Wed, Feb 16, 2022 at 10:59:38PM -0800, Andres Freund wrote: >> > They're accessed by xid. The LSN is just for cleanup. Accessing files >> > left over from a previous transaction with the same xid wouldn't be >> > good - we'd read wrong catalog state for decoding... >> >> Okay, that part makes sense to me. However, I'm still confused about how >> this is handled today and why moving cleanup to a separate auxiliary >> process makes matters worse. > > Right now cleanup happens every checkpoint. So cleanup can't be deferred all > that far. We currently include a bunch of 32bit xids inside checkspoints, so > if they're rarer than 2^31-1, we're in trouble independent of logical > decoding. > > But with this patch cleanup of logical decoding mapping files (and other > pieces) can be *indefinitely* deferred, without being noticeable. I see. The custodian should ordinarily remove the files as quickly as possible. In fact, I bet it will typically line up with checkpoints for most users, as the checkpointer will set the latch. However, if there are many temporary files to clean up, removing the logical decoding files could be delayed for some time, as you said. > One possible way to improve this would be to switch the on-disk filenames to > be based on 64bit xids. But that might also present some problems (file name > length, cost of converting 32bit xids to 64bit xids). Okay. >> I've done quite a bit of reading, and I haven't found anything that seems >> intended to prevent this problem. Do you have any pointers? > > I don't know if we have an iron-clad enforcement of checkpoints happening > every 2*31-1 xids. It's very unlikely to happen - you'd run out of space > etc. But it'd be good to have something better than that. Okay. So IIUC the problem might already exist today, but offloading these tasks to a separate process could make it more likely. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2022-02-17 13:00:22 -0800, Nathan Bossart wrote: > Okay. So IIUC the problem might already exist today, but offloading these > tasks to a separate process could make it more likely. Vastly more, yes. Before checkpoints not happening would be a (but not a great) form of backpressure. You can't cancel them without triggering a crash-restart. Whereas custodian can be cancelled etc. As I said before, I think this is tackling things from the wrong end. Instead of moving the sometimes expensive task out of the way, but still expensive, the focus should be to make the expensive task cheaper. As far as I understand, the primary concern are logical decoding serialized snapshots, because a lot of them can accumulate if there e.g. is an old unused / far behind slot. It should be easy to reduce the number of those snapshots by e.g. eliding some redundant ones. Perhaps we could also make backends in logical decoding occasionally do a bit of cleanup themselves. I've not seen reports of the number of mapping files to be an real issue? The improvements around deleting temporary files and serialized snapshots afaict don't require a dedicated process - they're only relevant during startup. We could use the approach of renaming the directory out of the way as done in this patchset but perform the cleanup in the startup process after we're up. Greetings, Andres Freund
On Thu, Feb 17, 2022 at 02:28:29PM -0800, Andres Freund wrote: > As far as I understand, the primary concern are logical decoding serialized > snapshots, because a lot of them can accumulate if there e.g. is an old unused > / far behind slot. It should be easy to reduce the number of those snapshots > by e.g. eliding some redundant ones. Perhaps we could also make backends in > logical decoding occasionally do a bit of cleanup themselves. > > I've not seen reports of the number of mapping files to be an real issue? I routinely see all four of these tasks impacting customers, but I'd say the most common one is the temporary file cleanup. Besides eliminating some redundant files and having backends perform some cleanup, what do you think about skipping the logical decoding cleanup during end-of-recovery/shutdown checkpoints? This was something that Bharath brought up a while back [0]. As I noted in that thread, startup and shutdown could still take a while if checkpoints are regularly delayed due to logical decoding cleanup, but that might still help avoid a bit of downtime. > The improvements around deleting temporary files and serialized snapshots > afaict don't require a dedicated process - they're only relevant during > startup. We could use the approach of renaming the directory out of the way as > done in this patchset but perform the cleanup in the startup process after > we're up. Perhaps this is a good place to start. As I mentioned above, IME the temporary file cleanup is the most common problem, so I think even getting that one fixed would be a huge improvement. [0] https://postgr.es/m/CALj2ACXkkSL8EBpR7m%3DMt%3DyRGBhevcCs3x4fsp3Bc-D13yyHOg%40mail.gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2022-02-17 14:58:38 -0800, Nathan Bossart wrote: > On Thu, Feb 17, 2022 at 02:28:29PM -0800, Andres Freund wrote: > > As far as I understand, the primary concern are logical decoding serialized > > snapshots, because a lot of them can accumulate if there e.g. is an old unused > > / far behind slot. It should be easy to reduce the number of those snapshots > > by e.g. eliding some redundant ones. Perhaps we could also make backends in > > logical decoding occasionally do a bit of cleanup themselves. > > > > I've not seen reports of the number of mapping files to be an real issue? > > I routinely see all four of these tasks impacting customers, but I'd say > the most common one is the temporary file cleanup. I took temp file cleanup and StartupReorderBuffer() "out of consideration" for custodian, because they're not needed during normal running. > Besides eliminating some redundant files and having backends perform some > cleanup, what do you think about skipping the logical decoding cleanup > during end-of-recovery/shutdown checkpoints? I strongly disagree with it. Then you might never get the cleanup done, but keep on operating until you hit corruption issues. > > The improvements around deleting temporary files and serialized snapshots > > afaict don't require a dedicated process - they're only relevant during > > startup. We could use the approach of renaming the directory out of the way as > > done in this patchset but perform the cleanup in the startup process after > > we're up. > > Perhaps this is a good place to start. As I mentioned above, IME the > temporary file cleanup is the most common problem, so I think even getting > that one fixed would be a huge improvement. Cool. Greetings, Andres Freund
On Thu, Feb 17, 2022 at 03:12:47PM -0800, Andres Freund wrote: >> > The improvements around deleting temporary files and serialized snapshots >> > afaict don't require a dedicated process - they're only relevant during >> > startup. We could use the approach of renaming the directory out of the way as >> > done in this patchset but perform the cleanup in the startup process after >> > we're up. >> >> Perhaps this is a good place to start. As I mentioned above, IME the >> temporary file cleanup is the most common problem, so I think even getting >> that one fixed would be a huge improvement. > > Cool. Hm. How should this work for standbys? I can think of the following options: 1. Do temporary file cleanup in the postmaster (as it does today). 2. Pause after allowing connections to clean up temporary files. 3. Do small amounts of temporary file cleanup whenever there is an opportunity during recovery. 4. Wait until recovery completes before cleaning up temporary files. I'm not too thrilled about any of these options. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
> On Thu, Feb 17, 2022 at 03:12:47PM -0800, Andres Freund wrote: >>> > The improvements around deleting temporary files and serialized snapshots >>> > afaict don't require a dedicated process - they're only relevant during >>> > startup. We could use the approach of renaming the directory out of the way as >>> > done in this patchset but perform the cleanup in the startup process after >>> > we're up. BTW I know you don't like the dedicated process approach, but one improvement to that approach could be to shut down the custodian process when it has nothing to do. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
It seems unlikely that anything discussed in this thread will be committed for v15, so I've adjusted the commitfest entry to v16 and moved it to the next commitfest. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Fri, 18 Feb 2022 at 20:51, Nathan Bossart <nathandbossart@gmail.com> wrote: > > > On Thu, Feb 17, 2022 at 03:12:47PM -0800, Andres Freund wrote: > >>> > The improvements around deleting temporary files and serialized snapshots > >>> > afaict don't require a dedicated process - they're only relevant during > >>> > startup. We could use the approach of renaming the directory out of the way as > >>> > done in this patchset but perform the cleanup in the startup process after > >>> > we're up. > > BTW I know you don't like the dedicated process approach, but one > improvement to that approach could be to shut down the custodian process > when it has nothing to do. Having a central cleanup process makes a lot of sense. There is a long list of potential tasks for such a process. My understanding is that autovacuum already has an interface for handling additional workload types, which is how BRIN indexes are handled. Do we really need a new process? If so, lets do this now. Nathan's point that certain tasks are blocking fast startup is a good one and higher availability is a critical end goal. The thought that we should complete these tasks during checkpoint is a good one, but checkpoints should NOT be delayed by long running tasks that are secondary to availability. Andres' point that it would be better to avoid long running tasks is good, if that is possible. That can be done better over time. This point does not block the higher level goal of better availability asap, so I support Nathan's overall proposals. -- Simon Riggs http://www.EnterpriseDB.com/
On Thu, Jun 23, 2022 at 7:58 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > Having a central cleanup process makes a lot of sense. There is a long > list of potential tasks for such a process. My understanding is that > autovacuum already has an interface for handling additional workload > types, which is how BRIN indexes are handled. Do we really need a new > process? It seems to me that if there's a long list of possible tasks for such a process, that's actually a trickier situation than if there were only one or two, because it may happen that when task X is really urgent, the process is already busy with task Y. I don't think that piggybacking more stuff onto autovacuum is a very good idea for this exact reason. We already know that autovacuum workers can get so busy that they can't keep up with the need to vacuum and analyze tables. If we give them more things to do, that figures to make it worse, at least on busy systems. I do agree that a general mechanism for getting cleanup tasks done in the background could be a useful thing to have, but I feel like it's hard to see exactly how to make it work well. We can't just allow it to spin up a million new processes, but at the same time, if it can't guarantee that time-critical tasks get performed relatively quickly, it's pretty worthless. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, 23 Jun 2022 at 14:46, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jun 23, 2022 at 7:58 AM Simon Riggs > <simon.riggs@enterprisedb.com> wrote: > > Having a central cleanup process makes a lot of sense. There is a long > > list of potential tasks for such a process. My understanding is that > > autovacuum already has an interface for handling additional workload > > types, which is how BRIN indexes are handled. Do we really need a new > > process? > > It seems to me that if there's a long list of possible tasks for such > a process, that's actually a trickier situation than if there were > only one or two, because it may happen that when task X is really > urgent, the process is already busy with task Y. > > I don't think that piggybacking more stuff onto autovacuum is a very > good idea for this exact reason. We already know that autovacuum > workers can get so busy that they can't keep up with the need to > vacuum and analyze tables. If we give them more things to do, that > figures to make it worse, at least on busy systems. > > I do agree that a general mechanism for getting cleanup tasks done in > the background could be a useful thing to have, but I feel like it's > hard to see exactly how to make it work well. We can't just allow it > to spin up a million new processes, but at the same time, if it can't > guarantee that time-critical tasks get performed relatively quickly, > it's pretty worthless. Most of the tasks mentioned aren't time critical. I have no objection to a new auxiliary process to execute those tasks, which can be spawned when needed. -- Simon Riggs http://www.EnterpriseDB.com/
On Thu, Jun 23, 2022 at 09:46:28AM -0400, Robert Haas wrote: > I do agree that a general mechanism for getting cleanup tasks done in > the background could be a useful thing to have, but I feel like it's > hard to see exactly how to make it work well. We can't just allow it > to spin up a million new processes, but at the same time, if it can't > guarantee that time-critical tasks get performed relatively quickly, > it's pretty worthless. My intent with this new auxiliary process is to offload tasks that aren't particularly time-critical. They are only time-critical in the sense that 1) you might eventually run out of space and 2) you might encounter wraparound with the logical replication files. But AFAICT these same risks exist today in the checkpointer approach, although maybe not to the same extent. In any case, 2 seems solvable to me outside of this patch set. I'm grateful for the discussion in this thread so far, but I'm not seeing a clear path forward. I'm glad to see threads like the one to stop doing end-of-recovery checkpoints [0], but I don't know if it will be possible to solve all of these availability concerns in a piecemeal fashion. I remain open to exploring other suggested approaches beyond creating a new auxiliary process. [0] https://postgr.es/m/CA%2BTgmobrM2jvkiccCS9NgFcdjNSgAvk1qcAPx5S6F%2BoJT3D2mQ%40mail.gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, 23 Jun 2022 at 18:15, Nathan Bossart <nathandbossart@gmail.com> wrote: > I'm grateful for the discussion in this thread so far, but I'm not seeing a > clear path forward. +1 to add the new auxiliary process. -- Simon Riggs http://www.EnterpriseDB.com/
On Fri, Jun 24, 2022 at 11:45:22AM +0100, Simon Riggs wrote: > On Thu, 23 Jun 2022 at 18:15, Nathan Bossart <nathandbossart@gmail.com> wrote: >> I'm grateful for the discussion in this thread so far, but I'm not seeing a >> clear path forward. > > +1 to add the new auxiliary process. I went ahead and put together a new patch set for this in which I've attempted to address most of the feedback from upthread. Notably, I've abandoned 0007 and 0008, added a way for processes to request specific tasks for the custodian, and removed all the checks for ShutdownRequestPending. I haven't addressed the existing transaction ID wraparound risk with the logical replication files. My instinct is that this deserveѕ its own thread, and it might need to be considered a prerequisite to this change based on the prior discussion here. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
- v6-0001-Introduce-custodian.patch
- v6-0002-Also-remove-pgsql_tmp-directories-during-startup.patch
- v6-0003-Split-pgsql_tmp-cleanup-into-two-stages.patch
- v6-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patch
- v6-0005-Move-removal-of-old-serialized-snapshots-to-custo.patch
- v6-0006-Move-removal-of-old-logical-rewrite-mapping-files.patch
Hi, On 2022-07-02 15:05:54 -0700, Nathan Bossart wrote: > + /* Obtain requested tasks */ > + SpinLockAcquire(&CustodianShmem->cust_lck); > + flags = CustodianShmem->cust_flags; > + CustodianShmem->cust_flags = 0; > + SpinLockRelease(&CustodianShmem->cust_lck); Just resetting the flags to 0 is problematic. Consider what happens if there's two tasks and and the one processed first errors out. You'll loose information about needing to run the second task. > + /* TODO: offloaded tasks go here */ Seems we're going to need some sorting of which tasks are most "urgent" / need to be processed next if we plan to make this into some generic facility. > +/* > + * RequestCustodian > + * Called to request a custodian task. > + */ > +void > +RequestCustodian(int flags) > +{ > + SpinLockAcquire(&CustodianShmem->cust_lck); > + CustodianShmem->cust_flags |= flags; > + SpinLockRelease(&CustodianShmem->cust_lck); > + > + if (ProcGlobal->custodianLatch) > + SetLatch(ProcGlobal->custodianLatch); > +} With this representation we can't really implement waiting for a task or such. And it doesn't seem like a great API for the caller to just specify a mix of flags. > + /* Calculate how long to sleep */ > + end_time = (pg_time_t) time(NULL); > + elapsed_secs = end_time - start_time; > + if (elapsed_secs >= CUSTODIAN_TIMEOUT_S) > + continue; /* no sleep for us */ > + cur_timeout = CUSTODIAN_TIMEOUT_S - elapsed_secs; > + > + (void) WaitLatch(MyLatch, > + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > + cur_timeout * 1000L /* convert to ms */ , > + WAIT_EVENT_CUSTODIAN_MAIN); > + } I don't think we should have this thing wake up on a regular basis. We're doing way too much of that already, and I don't think we should add more. Either we need a list of times when tasks need to be processed and wake up at that time, or just wake up if somebody requests a task. > From 5e95666efa31d6c8aa351e430c37ead6e27acb72 Mon Sep 17 00:00:00 2001 > From: Nathan Bossart <bossartn@amazon.com> > Date: Sun, 5 Dec 2021 21:16:44 -0800 > Subject: [PATCH v6 3/6] Split pgsql_tmp cleanup into two stages. > > First, pgsql_tmp directories will be renamed to stage them for > removal. Then, all files in pgsql_tmp are removed before removing > the staged directories themselves. This change is being made in > preparation for a follow-up change to offload most temporary file > cleanup to the new custodian process. > Note that temporary relation files cannot be cleaned up via the > aforementioned strategy and will not be offloaded to the custodian. > --- > src/backend/postmaster/postmaster.c | 8 +- > src/backend/storage/file/fd.c | 174 ++++++++++++++++++++++++---- > src/include/storage/fd.h | 2 +- > 3 files changed, 160 insertions(+), 24 deletions(-) > > diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c > index e67370012f..82aa0c6307 100644 > --- a/src/backend/postmaster/postmaster.c > +++ b/src/backend/postmaster/postmaster.c > @@ -1402,7 +1402,8 @@ PostmasterMain(int argc, char *argv[]) > * Remove old temporary files. At this point there can be no other > * Postgres processes running in this directory, so this should be safe. > */ > - RemovePgTempFiles(); > + RemovePgTempFiles(true, true); > + RemovePgTempFiles(false, false); This is imo hard to read and easy to get wrong. Make it multiple functions or pass named flags in. > + * StagePgTempDirForRemoval > + * > + * This function renames the given directory with a special prefix that > + * RemoveStagedPgTempDirs() will know to look for. An integer is appended to > + * the end of the new directory name in case previously staged pgsql_tmp > + * directories have not yet been removed. > + */ It doesn't seem great to need to iterate through a directory that contains other files, potentially a significant number. How about having a staged_for_removal/ directory, and then only scanning that? > +static void > +StagePgTempDirForRemoval(const char *tmp_dir) > +{ > + DIR *dir; > + char stage_path[MAXPGPATH * 2]; > + char parent_path[MAXPGPATH * 2]; > + struct stat statbuf; > + > + /* > + * If tmp_dir doesn't exist, there is nothing to stage. > + */ > + dir = AllocateDir(tmp_dir); > + if (dir == NULL) > + { > + if (errno != ENOENT) > + ereport(LOG, > + (errcode_for_file_access(), > + errmsg("could not open directory \"%s\": %m", tmp_dir))); > + return; > + } > + FreeDir(dir); > + > + strlcpy(parent_path, tmp_dir, MAXPGPATH * 2); > + get_parent_directory(parent_path); > + > + /* > + * get_parent_directory() returns an empty string if the input argument is > + * just a file name (see comments in path.c), so handle that as being the > + * current directory. > + */ > + if (strlen(parent_path) == 0) > + strlcpy(parent_path, ".", MAXPGPATH * 2); > + > + /* > + * Find a name for the stage directory. We just increment an integer at the > + * end of the name until we find one that doesn't exist. > + */ > + for (int n = 0; n <= INT_MAX; n++) > + { > + snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path, > + PG_TEMP_DIR_TO_REMOVE_PREFIX, n); > + > + if (stat(stage_path, &statbuf) != 0) > + { > + if (errno == ENOENT) > + break; > + > + ereport(LOG, > + (errcode_for_file_access(), > + errmsg("could not stat file \"%s\": %m", stage_path))); > + return; > + } > + > + stage_path[0] = '\0'; I still dislike this approach. Loops until INT_MAX, not interruptible... Can't we prevent conflicts by adding a timestamp or such? > + } > + > + /* > + * In the unlikely event that we couldn't find a name for the stage > + * directory, bail out. > + */ > + if (stage_path[0] == '\0') > + { > + ereport(LOG, > + (errmsg("could not stage \"%s\" for deletion", > + tmp_dir))); > + return; > + } That's imo very much not ok. Just continuing in unexpected situations is a recipe for introducing bugs / being hard to debug. > From 43042799b96b588a446c509637b5acf570e2a325 Mon Sep 17 00:00:00 2001 > From a58a6bb70785a557a150680b64cd8ce78ce1b73a Mon Sep 17 00:00:00 2001 > From: Nathan Bossart <bossartn@amazon.com> > Date: Sun, 5 Dec 2021 22:02:40 -0800 > Subject: [PATCH v6 5/6] Move removal of old serialized snapshots to custodian. > > This was only done during checkpoints because it was a convenient > place to put it. As mentioned before, having it done as part of checkpoints provides pretty decent wraparound protection - yes, it's not theoretically perfect, but in reality it's very unlikely you can have an xid wraparound within one checkpoint. I've mentioned this before, so at the very least I'd like to see this acknowledged in the commit message. > However, if there are many snapshots to remove, it can significantly extend > checkpoint time. I'd really like to see a reproducer or profile for this... > + /* > + * Remove serialized snapshots that are no longer required by any > + * logical replication slot. > + * > + * It is not important for these to be removed in single-user mode, so > + * we don't need any extra handling outside of the custodian process for > + * this. > + */ I don't think this claim is correct. > From 0add8bb19a4ee83c6a6ec1f313329d737bf304a5 Mon Sep 17 00:00:00 2001 > From: Nathan Bossart <bossartn@amazon.com> > Date: Sun, 12 Dec 2021 22:07:11 -0800 > Subject: [PATCH v6 6/6] Move removal of old logical rewrite mapping files to > custodian. > > If there are many such files to remove, checkpoints can take much > longer. To avoid this, move this work to the newly-introduced > custodian process. As above I'd like to know why this could take that long. What are you doing that there's so many mapping files (which only exist for catalog tables!) that this is a significant fraction of a checkpoint? > --- > src/backend/access/heap/rewriteheap.c | 79 +++++++++++++++++++++++---- > src/backend/postmaster/custodian.c | 44 +++++++++++++++ > src/include/access/rewriteheap.h | 1 + > src/include/postmaster/custodian.h | 5 ++ > 4 files changed, 119 insertions(+), 10 deletions(-) > > diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c > index 2a53826736..edeab65e60 100644 > --- a/src/backend/access/heap/rewriteheap.c > +++ b/src/backend/access/heap/rewriteheap.c > @@ -116,6 +116,7 @@ > #include "lib/ilist.h" > #include "miscadmin.h" > #include "pgstat.h" > +#include "postmaster/custodian.h" > #include "replication/logical.h" > #include "replication/slot.h" > #include "storage/bufmgr.h" > @@ -1182,7 +1183,8 @@ heap_xlog_logical_rewrite(XLogReaderState *r) > * Perform a checkpoint for logical rewrite mappings > * > * This serves two tasks: > - * 1) Remove all mappings not needed anymore based on the logical restart LSN > + * 1) Alert the custodian to remove all mappings not needed anymore based on the > + * logical restart LSN > * 2) Flush all remaining mappings to disk, so that replay after a checkpoint > * only has to deal with the parts of a mapping that have been written out > * after the checkpoint started. > @@ -1210,6 +1212,10 @@ CheckPointLogicalRewriteHeap(void) > if (cutoff != InvalidXLogRecPtr && redo < cutoff) > cutoff = redo; > > + /* let the custodian know what it can remove */ > + CustodianSetLogicalRewriteCutoff(cutoff); Setting this variable in a custodian datastructure and then fetching it from there seems architecturally wrong to me. > + RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS); What about single user mode? ISTM that RequestCustodian() needs to either assert out if called in single user mode, or execute tasks immediately in that context. > + > +/* > + * Remove all mappings not needed anymore based on the logical restart LSN saved > + * by the checkpointer. We use this saved value instead of calling > + * ReplicationSlotsComputeLogicalRestartLSN() so that we don't interfere with an > + * ongoing call to CheckPointLogicalRewriteHeap() that is flushing mappings to > + * disk. > + */ What interference could there be? > +void > +RemoveOldLogicalRewriteMappings(void) > +{ > + XLogRecPtr cutoff; > + DIR *mappings_dir; > + struct dirent *mapping_de; > + char path[MAXPGPATH + 20]; > + bool value_set = false; > + > + cutoff = CustodianGetLogicalRewriteCutoff(&value_set); > + if (!value_set) > + return; Afaics nothing clears values_set - is that a good idea? Greetings, Andres Freund
Hi Andres, Thanks for the prompt review. On Sat, Jul 02, 2022 at 03:54:56PM -0700, Andres Freund wrote: > On 2022-07-02 15:05:54 -0700, Nathan Bossart wrote: >> + /* Obtain requested tasks */ >> + SpinLockAcquire(&CustodianShmem->cust_lck); >> + flags = CustodianShmem->cust_flags; >> + CustodianShmem->cust_flags = 0; >> + SpinLockRelease(&CustodianShmem->cust_lck); > > Just resetting the flags to 0 is problematic. Consider what happens if there's > two tasks and and the one processed first errors out. You'll loose information > about needing to run the second task. I think we also want to retry any failed tasks. The way v6 handles this is by requesting all tasks after an exception. Another way to handle this could be to reset each individual flag before the task is executed, and then we could surround each one with a PG_CATCH block that resets the flag. I'll do it this way in the next revision. >> +/* >> + * RequestCustodian >> + * Called to request a custodian task. >> + */ >> +void >> +RequestCustodian(int flags) >> +{ >> + SpinLockAcquire(&CustodianShmem->cust_lck); >> + CustodianShmem->cust_flags |= flags; >> + SpinLockRelease(&CustodianShmem->cust_lck); >> + >> + if (ProcGlobal->custodianLatch) >> + SetLatch(ProcGlobal->custodianLatch); >> +} > > With this representation we can't really implement waiting for a task or > such. And it doesn't seem like a great API for the caller to just specify a > mix of flags. At the moment, the idea is that nothing should need to wait for a task because the custodian only handles things that are relatively non-critical. If that changes, this could probably be expanded to look more like RequestCheckpoint(). What would you suggest using instead of a mix of flags? >> + /* Calculate how long to sleep */ >> + end_time = (pg_time_t) time(NULL); >> + elapsed_secs = end_time - start_time; >> + if (elapsed_secs >= CUSTODIAN_TIMEOUT_S) >> + continue; /* no sleep for us */ >> + cur_timeout = CUSTODIAN_TIMEOUT_S - elapsed_secs; >> + >> + (void) WaitLatch(MyLatch, >> + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, >> + cur_timeout * 1000L /* convert to ms */ , >> + WAIT_EVENT_CUSTODIAN_MAIN); >> + } > > I don't think we should have this thing wake up on a regular basis. We're > doing way too much of that already, and I don't think we should add > more. Either we need a list of times when tasks need to be processed and wake > up at that time, or just wake up if somebody requests a task. I agree. I will remove the timeout in the next revision. >> diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c >> index e67370012f..82aa0c6307 100644 >> --- a/src/backend/postmaster/postmaster.c >> +++ b/src/backend/postmaster/postmaster.c >> @@ -1402,7 +1402,8 @@ PostmasterMain(int argc, char *argv[]) >> * Remove old temporary files. At this point there can be no other >> * Postgres processes running in this directory, so this should be safe. >> */ >> - RemovePgTempFiles(); >> + RemovePgTempFiles(true, true); >> + RemovePgTempFiles(false, false); > > This is imo hard to read and easy to get wrong. Make it multiple functions or > pass named flags in. Will do. >> + * StagePgTempDirForRemoval >> + * >> + * This function renames the given directory with a special prefix that >> + * RemoveStagedPgTempDirs() will know to look for. An integer is appended to >> + * the end of the new directory name in case previously staged pgsql_tmp >> + * directories have not yet been removed. >> + */ > > It doesn't seem great to need to iterate through a directory that contains > other files, potentially a significant number. How about having a > staged_for_removal/ directory, and then only scanning that? Yeah, that seems like a good idea. Will do. >> + /* >> + * Find a name for the stage directory. We just increment an integer at the >> + * end of the name until we find one that doesn't exist. >> + */ >> + for (int n = 0; n <= INT_MAX; n++) >> + { >> + snprintf(stage_path, sizeof(stage_path), "%s/%s%d", parent_path, >> + PG_TEMP_DIR_TO_REMOVE_PREFIX, n); >> + >> + if (stat(stage_path, &statbuf) != 0) >> + { >> + if (errno == ENOENT) >> + break; >> + >> + ereport(LOG, >> + (errcode_for_file_access(), >> + errmsg("could not stat file \"%s\": %m", stage_path))); >> + return; >> + } >> + >> + stage_path[0] = '\0'; > > I still dislike this approach. Loops until INT_MAX, not interruptible... Can't > we prevent conflicts by adding a timestamp or such? I suppose it's highly unlikely that we'd see a conflict if we used the timestamp instead. I'll do it this way in the next revision if that seems good enough. >> From a58a6bb70785a557a150680b64cd8ce78ce1b73a Mon Sep 17 00:00:00 2001 >> From: Nathan Bossart <bossartn@amazon.com> >> Date: Sun, 5 Dec 2021 22:02:40 -0800 >> Subject: [PATCH v6 5/6] Move removal of old serialized snapshots to custodian. >> >> This was only done during checkpoints because it was a convenient >> place to put it. > > As mentioned before, having it done as part of checkpoints provides pretty > decent wraparound protection - yes, it's not theoretically perfect, but in > reality it's very unlikely you can have an xid wraparound within one > checkpoint. I've mentioned this before, so at the very least I'd like to see > this acknowledged in the commit message. Will do. >> + /* let the custodian know what it can remove */ >> + CustodianSetLogicalRewriteCutoff(cutoff); > > Setting this variable in a custodian datastructure and then fetching it from > there seems architecturally wrong to me. Where do you think it should go? I previously had it in the checkpointer's shared memory, but you didn't like that the functions were declared in bgwriter.h (along with the other checkpoint stuff). If the checkpointer shared memory is the right place, should we create checkpointer.h and use that instead? >> + RequestCustodian(CUSTODIAN_REMOVE_REWRITE_MAPPINGS); > > What about single user mode? > > > ISTM that RequestCustodian() needs to either assert out if called in single > user mode, or execute tasks immediately in that context. I like the idea of executing the tasks immediately since that's what happens today in single-user mode. I will try doing it that way. >> +/* >> + * Remove all mappings not needed anymore based on the logical restart LSN saved >> + * by the checkpointer. We use this saved value instead of calling >> + * ReplicationSlotsComputeLogicalRestartLSN() so that we don't interfere with an >> + * ongoing call to CheckPointLogicalRewriteHeap() that is flushing mappings to >> + * disk. >> + */ > > What interference could there be? My concern is that the custodian could obtain a later cutoff than what the checkpointer does, which might cause files to be concurrently unlinked and fsync'd. If we always use the checkpointer's cutoff, that shouldn't be a problem. This could probably be better explained in this comment. >> +void >> +RemoveOldLogicalRewriteMappings(void) >> +{ >> + XLogRecPtr cutoff; >> + DIR *mappings_dir; >> + struct dirent *mapping_de; >> + char path[MAXPGPATH + 20]; >> + bool value_set = false; >> + >> + cutoff = CustodianGetLogicalRewriteCutoff(&value_set); >> + if (!value_set) >> + return; > > Afaics nothing clears values_set - is that a good idea? I'm using value_set to differentiate the case where InvalidXLogRecPtr means the checkpointer hasn't determined a value yet versus the case where it has. In the former, we don't want to take any action. In the latter, we want to unlink all the files. Since we're moving to a request model for the custodian, I might be able to remove this value_set stuff completely. If that's not possible, it probably deserves a better comment. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2022-07-03 10:07:54 -0700, Nathan Bossart wrote: > Thanks for the prompt review. > > On Sat, Jul 02, 2022 at 03:54:56PM -0700, Andres Freund wrote: > > On 2022-07-02 15:05:54 -0700, Nathan Bossart wrote: > >> + /* Obtain requested tasks */ > >> + SpinLockAcquire(&CustodianShmem->cust_lck); > >> + flags = CustodianShmem->cust_flags; > >> + CustodianShmem->cust_flags = 0; > >> + SpinLockRelease(&CustodianShmem->cust_lck); > > > > Just resetting the flags to 0 is problematic. Consider what happens if there's > > two tasks and and the one processed first errors out. You'll loose information > > about needing to run the second task. > > I think we also want to retry any failed tasks. I don't think so, at least not if it's just going to retry that task straight away - then we'll get stuck on that one task forever. If we had the ability to "queue" it the end, to be processed after other already dequeued tasks, it'd be a different story. > The way v6 handles this is by requesting all tasks after an exception. Ick. That strikes me as a bad idea. > >> +/* > >> + * RequestCustodian > >> + * Called to request a custodian task. > >> + */ > >> +void > >> +RequestCustodian(int flags) > >> +{ > >> + SpinLockAcquire(&CustodianShmem->cust_lck); > >> + CustodianShmem->cust_flags |= flags; > >> + SpinLockRelease(&CustodianShmem->cust_lck); > >> + > >> + if (ProcGlobal->custodianLatch) > >> + SetLatch(ProcGlobal->custodianLatch); > >> +} > > > > With this representation we can't really implement waiting for a task or > > such. And it doesn't seem like a great API for the caller to just specify a > > mix of flags. > > At the moment, the idea is that nothing should need to wait for a task > because the custodian only handles things that are relatively non-critical. Which is just plainly not true as the patchset stands... I think we're going to have to block if some cleanup as part of a checkpoint hasn't been completed by the next checkpoint - otherwise it'll just end up being way too confusing and there's absolutely no backpressure anymore. > If that changes, this could probably be expanded to look more like > RequestCheckpoint(). > > What would you suggest using instead of a mix of flags? I suspect an array of tasks with requested and completed counters or such? With a condition variable to wait on? > >> + /* let the custodian know what it can remove */ > >> + CustodianSetLogicalRewriteCutoff(cutoff); > > > > Setting this variable in a custodian datastructure and then fetching it from > > there seems architecturally wrong to me. > > Where do you think it should go? I previously had it in the checkpointer's > shared memory, but you didn't like that the functions were declared in > bgwriter.h (along with the other checkpoint stuff). If the checkpointer > shared memory is the right place, should we create checkpointer.h and use > that instead? Well, so far I have not understood what the whole point of the shared state is, so i have a bit of a hard time answering this ;) > >> +/* > >> + * Remove all mappings not needed anymore based on the logical restart LSN saved > >> + * by the checkpointer. We use this saved value instead of calling > >> + * ReplicationSlotsComputeLogicalRestartLSN() so that we don't interfere with an > >> + * ongoing call to CheckPointLogicalRewriteHeap() that is flushing mappings to > >> + * disk. > >> + */ > > > > What interference could there be? > > My concern is that the custodian could obtain a later cutoff than what the > checkpointer does, which might cause files to be concurrently unlinked and > fsync'd. If we always use the checkpointer's cutoff, that shouldn't be a > problem. This could probably be better explained in this comment. How about having a Datum argument to RequestCustodian() that is forwarded to the task? > >> +void > >> +RemoveOldLogicalRewriteMappings(void) > >> +{ > >> + XLogRecPtr cutoff; > >> + DIR *mappings_dir; > >> + struct dirent *mapping_de; > >> + char path[MAXPGPATH + 20]; > >> + bool value_set = false; > >> + > >> + cutoff = CustodianGetLogicalRewriteCutoff(&value_set); > >> + if (!value_set) > >> + return; > > > > Afaics nothing clears values_set - is that a good idea? > > I'm using value_set to differentiate the case where InvalidXLogRecPtr means > the checkpointer hasn't determined a value yet versus the case where it > has. In the former, we don't want to take any action. In the latter, we > want to unlink all the files. Since we're moving to a request model for > the custodian, I might be able to remove this value_set stuff completely. > If that's not possible, it probably deserves a better comment. It would. Greetings, Andres Freund
Here's a new revision where I've attempted to address all the feedback I've received thus far. Notably, the custodian now uses a queue for registering tasks and determining which tasks to execute. Other changes include splitting the temporary file functions apart to avoid consecutive boolean flags, using a timestamp instead of an integer for the staging name for temporary directories, moving temporary directories to a dedicated directory so that the custodian doesn't need to scan relation files, ERROR-ing when something goes wrong when cleaning up temporary files, executing requested tasks immediately in single-user mode, and more. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
- v7-0001-Introduce-custodian.patch
- v7-0002-Also-remove-pgsql_tmp-directories-during-startup.patch
- v7-0003-Split-pgsql_tmp-cleanup-into-two-stages.patch
- v7-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patch
- v7-0005-Move-removal-of-old-serialized-snapshots-to-custo.patch
- v7-0006-Move-removal-of-old-logical-rewrite-mapping-files.patch
On Wed, Jul 06, 2022 at 09:51:10AM -0700, Nathan Bossart wrote: > Here's a new revision where I've attempted to address all the feedback I've > received thus far. Notably, the custodian now uses a queue for registering > tasks and determining which tasks to execute. Other changes include > splitting the temporary file functions apart to avoid consecutive boolean > flags, using a timestamp instead of an integer for the staging name for > temporary directories, moving temporary directories to a dedicated > directory so that the custodian doesn't need to scan relation files, > ERROR-ing when something goes wrong when cleaning up temporary files, > executing requested tasks immediately in single-user mode, and more. Here is a rebased patch set for cfbot. There are no other differences between v7 and v8. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
- v8-0001-Introduce-custodian.patch
- v8-0002-Also-remove-pgsql_tmp-directories-during-startup.patch
- v8-0003-Split-pgsql_tmp-cleanup-into-two-stages.patch
- v8-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patch
- v8-0005-Move-removal-of-old-serialized-snapshots-to-custo.patch
- v8-0006-Move-removal-of-old-logical-rewrite-mapping-files.patch
On Thu, Aug 11, 2022 at 04:09:21PM -0700, Nathan Bossart wrote: > Here is a rebased patch set for cfbot. There are no other differences > between v7 and v8. Another rebase for cfbot. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
- v9-0001-Introduce-custodian.patch
- v9-0002-Also-remove-pgsql_tmp-directories-during-startup.patch
- v9-0003-Split-pgsql_tmp-cleanup-into-two-stages.patch
- v9-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patch
- v9-0005-Move-removal-of-old-serialized-snapshots-to-custo.patch
- v9-0006-Move-removal-of-old-logical-rewrite-mapping-files.patch
On Wed, Aug 24, 2022 at 09:46:24AM -0700, Nathan Bossart wrote: > Another rebase for cfbot. And another. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
- v10-0001-Introduce-custodian.patch
- v10-0002-Also-remove-pgsql_tmp-directories-during-startup.patch
- v10-0003-Split-pgsql_tmp-cleanup-into-two-stages.patch
- v10-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patch
- v10-0005-Move-removal-of-old-serialized-snapshots-to-cust.patch
- v10-0006-Move-removal-of-old-logical-rewrite-mapping-file.patch
On Fri, Sep 02, 2022 at 03:07:44PM -0700, Nathan Bossart wrote: > And another. v11 adds support for building with meson. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
- v11-0001-Introduce-custodian.patch
- v11-0002-Also-remove-pgsql_tmp-directories-during-startup.patch
- v11-0003-Split-pgsql_tmp-cleanup-into-two-stages.patch
- v11-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patch
- v11-0005-Move-removal-of-old-serialized-snapshots-to-cust.patch
- v11-0006-Move-removal-of-old-logical-rewrite-mapping-file.patch
On Fri, Sep 23, 2022 at 10:41:54AM -0700, Nathan Bossart wrote: > v11 adds support for building with meson. rebased -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
- v12-0001-Introduce-custodian.patch
- v12-0002-Also-remove-pgsql_tmp-directories-during-startup.patch
- v12-0003-Split-pgsql_tmp-cleanup-into-two-stages.patch
- v12-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patch
- v12-0005-Move-removal-of-old-serialized-snapshots-to-cust.patch
- v12-0006-Move-removal-of-old-logical-rewrite-mapping-file.patch
On Sun, Nov 06, 2022 at 02:38:42PM -0800, Nathan Bossart wrote: > rebased another rebase for cfbot -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
- v13-0001-Introduce-custodian.patch
- v13-0002-Also-remove-pgsql_tmp-directories-during-startup.patch
- v13-0003-Split-pgsql_tmp-cleanup-into-two-stages.patch
- v13-0004-Move-pgsql_tmp-file-removal-to-custodian-process.patch
- v13-0005-Move-removal-of-old-serialized-snapshots-to-cust.patch
- v13-0006-Move-removal-of-old-logical-rewrite-mapping-file.patch
On Thu, 24 Nov 2022 at 00:19, Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Sun, Nov 06, 2022 at 02:38:42PM -0800, Nathan Bossart wrote: > > rebased > > another rebase for cfbot 0001 seems good to me * I like that it sleeps forever until requested * not sure I believe that everything it does can always be aborted out of and shutdown - to achieve that you will need a CHECK_FOR_INTERRUPTS() calls in the loops in patches 5 and 6 at least * not sure why you want immediate execution of custodian tasks - I feel supporting two modes will be a lot harder. For me, I would run locally when !IsUnderPostmaster and also in an Assert build, so we can test it works right - i.e. running in its own process is just a production optimization for performance (which is the stated reason for having this) 0005 seems good from what I know * There is no check to see if it worked in any sane time * It seems possible that "Old" might change meaning - will that make it break/fail? 0006 seems good also * same comments for 5 Rather than explicitly use DEBUG1 everywhere I would have an #define CUSTODIAN_LOG_LEVEL LOG so we can run with it in LOG mode and then set it to DEBUG1 with a one line change in a later phase of Beta I can't really comment with knowledge on sub-patches 0002 to 0004. Perhaps you should aim to get 1, 5, 6 committed first and then return to the others in a later CF/separate thread? -- Simon Riggs http://www.EnterpriseDB.com/
Thanks for taking a look! On Thu, Nov 24, 2022 at 05:31:02PM +0000, Simon Riggs wrote: > * not sure I believe that everything it does can always be aborted out > of and shutdown - to achieve that you will need a > CHECK_FOR_INTERRUPTS() calls in the loops in patches 5 and 6 at least I did something like this earlier, but was advised to simply let the functions finish as usual during shutdown [0]. I think this is what the checkpointer process does today, anyway. > * not sure why you want immediate execution of custodian tasks - I > feel supporting two modes will be a lot harder. For me, I would run > locally when !IsUnderPostmaster and also in an Assert build, so we can > test it works right - i.e. running in its own process is just a > production optimization for performance (which is the stated reason > for having this) I added this because 0004 involves requesting a task from the postmaster, so checking for IsUnderPostmaster doesn't work. Those tasks would always run immediately. However, we could use IsPostmasterEnvironment instead, which would allow us to remove the "immediate" argument. I did it this way in v14. I'm not sure about running locally in Assert builds. It's true that would help ensure there's test coverage for the task logic, but it would also reduce coverage for the custodian logic. And in general, I'm worried about having Assert builds use a different code path than production builds. > 0005 seems good from what I know > * There is no check to see if it worked in any sane time What did you have in mind? Should the custodian begin emitting WARNINGs after a while? > * It seems possible that "Old" might change meaning - will that make > it break/fail? I don't believe so. > Rather than explicitly use DEBUG1 everywhere I would have an > #define CUSTODIAN_LOG_LEVEL LOG > so we can run with it in LOG mode and then set it to DEBUG1 with a one > line change in a later phase of Beta I can create a separate patch for this, but I don't think I've ever seen this sort of thing before. Is the idea just to help with debugging during the development phase? > I can't really comment with knowledge on sub-patches 0002 to 0004. > > Perhaps you should aim to get 1, 5, 6 committed first and then return > to the others in a later CF/separate thread? That seems like a good idea since those are all relatively self-contained. I removed 0002-0004 in v14. [0] https://postgr.es/m/20220217065938.x2esfdppzypegn5j%40alap3.anarazel.de -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Sun, 27 Nov 2022 at 23:34, Nathan Bossart <nathandbossart@gmail.com> wrote: > > Thanks for taking a look! > > On Thu, Nov 24, 2022 at 05:31:02PM +0000, Simon Riggs wrote: > > * not sure I believe that everything it does can always be aborted out > > of and shutdown - to achieve that you will need a > > CHECK_FOR_INTERRUPTS() calls in the loops in patches 5 and 6 at least > > I did something like this earlier, but was advised to simply let the > functions finish as usual during shutdown [0]. I think this is what the > checkpointer process does today, anyway. If we say "The custodian is not an essential process and can shutdown quickly when requested.", and yet we know its not true in all cases, then that will lead to misunderstandings and bugs. If we perform a restart and the custodian is performing extra work that delays shutdown, then it also delays restart. Given the title of the thread, we should be looking to improve that, or at least know it occurred. > > * not sure why you want immediate execution of custodian tasks - I > > feel supporting two modes will be a lot harder. For me, I would run > > locally when !IsUnderPostmaster and also in an Assert build, so we can > > test it works right - i.e. running in its own process is just a > > production optimization for performance (which is the stated reason > > for having this) > > I added this because 0004 involves requesting a task from the postmaster, > so checking for IsUnderPostmaster doesn't work. Those tasks would always > run immediately. However, we could use IsPostmasterEnvironment instead, > which would allow us to remove the "immediate" argument. I did it this way > in v14. Thanks > > 0005 seems good from what I know > > * There is no check to see if it worked in any sane time > > What did you have in mind? Should the custodian begin emitting WARNINGs > after a while? I think it might be useful if it logged anything that took an "extended period", TBD. Maybe that is already covered by startup process logging. Please tell me that still works? > > Rather than explicitly use DEBUG1 everywhere I would have an > > #define CUSTODIAN_LOG_LEVEL LOG > > so we can run with it in LOG mode and then set it to DEBUG1 with a one > > line change in a later phase of Beta > > I can create a separate patch for this, but I don't think I've ever seen > this sort of thing before. Much of recovery is coded that way, for the same reason. > Is the idea just to help with debugging during > the development phase? "Just", yes. Tests would be desirable also, under src/test/modules. -- Simon Riggs http://www.EnterpriseDB.com/
On 2022-11-28 13:08:57 +0000, Simon Riggs wrote: > On Sun, 27 Nov 2022 at 23:34, Nathan Bossart <nathandbossart@gmail.com> wrote: > > > Rather than explicitly use DEBUG1 everywhere I would have an > > > #define CUSTODIAN_LOG_LEVEL LOG > > > so we can run with it in LOG mode and then set it to DEBUG1 with a one > > > line change in a later phase of Beta > > > > I can create a separate patch for this, but I don't think I've ever seen > > this sort of thing before. > > Much of recovery is coded that way, for the same reason. I think that's not a good thing to copy without a lot more justification than "some old code also does it that way". It's sometimes justified, but also makes code harder to read (one doesn't know what it does without looking up the #define, line length).
On Mon, Nov 28, 2022 at 1:31 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-11-28 13:08:57 +0000, Simon Riggs wrote: > > On Sun, 27 Nov 2022 at 23:34, Nathan Bossart <nathandbossart@gmail.com> wrote: > > > > Rather than explicitly use DEBUG1 everywhere I would have an > > > > #define CUSTODIAN_LOG_LEVEL LOG > > > > so we can run with it in LOG mode and then set it to DEBUG1 with a one > > > > line change in a later phase of Beta > > > > > > I can create a separate patch for this, but I don't think I've ever seen > > > this sort of thing before. > > > > Much of recovery is coded that way, for the same reason. > > I think that's not a good thing to copy without a lot more justification than > "some old code also does it that way". It's sometimes justified, but also > makes code harder to read (one doesn't know what it does without looking up > the #define, line length). Yeah. If people need some of the log messages at a higher level during development, they can patch their own copies. I think there might be some argument for having a facility that lets you pick subsystems or even individual messages that you want to trace and pump up the log level for just those call sites. But I don't know exactly what that would look like, and I don't think inventing one-off mechanisms for particular cases is a good idea. -- Robert Haas EDB: http://www.enterprisedb.com
Okay, here is a new patch set. 0004 adds logic to prevent custodian tasks from delaying shutdown. I haven't added any logging for long-running tasks yet. Tasks might ordinarily take a while, so such logs wouldn't necessarily indicate something is wrong. Perhaps we could add a GUC for the amount of time to wait before logging. This feature would be off by default. Another option could be to create a log_custodian GUC that causes tasks to be logged when completed, similar to log_checkpoints. Thoughts? On Mon, Nov 28, 2022 at 01:37:01PM -0500, Robert Haas wrote: > On Mon, Nov 28, 2022 at 1:31 PM Andres Freund <andres@anarazel.de> wrote: >> On 2022-11-28 13:08:57 +0000, Simon Riggs wrote: >> > On Sun, 27 Nov 2022 at 23:34, Nathan Bossart <nathandbossart@gmail.com> wrote: >> > > > Rather than explicitly use DEBUG1 everywhere I would have an >> > > > #define CUSTODIAN_LOG_LEVEL LOG >> > > > so we can run with it in LOG mode and then set it to DEBUG1 with a one >> > > > line change in a later phase of Beta >> > > >> > > I can create a separate patch for this, but I don't think I've ever seen >> > > this sort of thing before. >> > >> > Much of recovery is coded that way, for the same reason. >> >> I think that's not a good thing to copy without a lot more justification than >> "some old code also does it that way". It's sometimes justified, but also >> makes code harder to read (one doesn't know what it does without looking up >> the #define, line length). > > Yeah. If people need some of the log messages at a higher level during > development, they can patch their own copies. > > I think there might be some argument for having a facility that lets > you pick subsystems or even individual messages that you want to trace > and pump up the log level for just those call sites. But I don't know > exactly what that would look like, and I don't think inventing one-off > mechanisms for particular cases is a good idea. Given this discussion, I haven't made any changes to the logging in the new patch set. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, 28 Nov 2022 at 23:40, Nathan Bossart <nathandbossart@gmail.com> wrote: > > Okay, here is a new patch set. 0004 adds logic to prevent custodian tasks > from delaying shutdown. That all seems good, thanks. The last important point for me is tests, in src/test/modules probably. It might be possible to reuse the final state of other modules' tests to test cleanup, or at least integrate a custodian test into each module. -- Simon Riggs http://www.EnterpriseDB.com/
On Tue, Nov 29, 2022 at 12:02:44PM +0000, Simon Riggs wrote: > The last important point for me is tests, in src/test/modules > probably. It might be possible to reuse the final state of other > modules' tests to test cleanup, or at least integrate a custodian test > into each module. Of course. I found some existing tests for the test_decoding plugin that appear to reliably generate the files we want the custodian to clean up, so I added them there. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Nov 29, 2022 at 07:56:53PM -0800, Nathan Bossart wrote: > On Tue, Nov 29, 2022 at 12:02:44PM +0000, Simon Riggs wrote: >> The last important point for me is tests, in src/test/modules >> probably. It might be possible to reuse the final state of other >> modules' tests to test cleanup, or at least integrate a custodian test >> into each module. > > Of course. I found some existing tests for the test_decoding plugin that > appear to reliably generate the files we want the custodian to clean up, so > I added them there. cfbot is not happy with v16. AFAICT this is just due to poor placement, so here is another attempt with the tests moved to a new location. Apologies for the noise. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, 30 Nov 2022 at 03:56, Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Tue, Nov 29, 2022 at 12:02:44PM +0000, Simon Riggs wrote: > > The last important point for me is tests, in src/test/modules > > probably. It might be possible to reuse the final state of other > > modules' tests to test cleanup, or at least integrate a custodian test > > into each module. > > Of course. I found some existing tests for the test_decoding plugin that > appear to reliably generate the files we want the custodian to clean up, so > I added them there. Thanks for adding the tests; I can see they run clean. The only minor thing I would personally add is a note in each piece of code to explain where the tests are for each one and/or something in the main custodian file that says tests exist within src/test/module. Otherwise, ready for committer. -- Simon Riggs http://www.EnterpriseDB.com/
On Wed, Nov 30, 2022 at 10:48 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > > cfbot is not happy with v16. AFAICT this is just due to poor placement, so > here is another attempt with the tests moved to a new location. Apologies > for the noise. Thanks for the patches. I spent some time on reviewing v17 patch set and here are my comments: 0001: 1. I think the custodian process needs documentation - it needs a definition in glossary.sgml and perhaps a dedicated page describing what tasks it takes care of. 2. + LWLockReleaseAll(); + ConditionVariableCancelSleep(); + AbortBufferIO(); + UnlockBuffers(); + ReleaseAuxProcessResources(false); + AtEOXact_Buffers(false); + AtEOXact_SMgr(); + AtEOXact_Files(false); + AtEOXact_HashTables(false); Do we need all of these in the exit path? Isn't the stuff that ShutdownAuxiliaryProcess() does enough for the custodian process? AFAICS, the custodian process uses LWLocks (which the ShutdownAuxiliaryProcess() takes care of) and it doesn't access shared buffers and so on. Having said that, I'm fine to keep them for future use and all of those cleanup functions exit if nothing related occurs. 3. + * Advertise out latch that backends can use to wake us up while we're Typo - %s/out/our 4. Is it a good idea to add log messages in the DoCustodianTasks() loop? Maybe at a debug level? The log message can say the current task the custodian is processing. And/Or setting the custodian's status on the ps display is also a good idea IMO. 0002 and 0003: 1. +CHECKPOINT; +DO $$ I think we need to ensure that there are some snapshot files before the checkpoint. Otherwise, it may happen that the above test case exits without the custodian process doing anything. 2. I think the best way to test the custodian process code is by adding a TAP test module to see actually the custodian process kicks in. Perhaps, add elog(DEBUGX,...) messages to various custodian process functions and see if we see the logs in server logs. 0004: I think the 0004 patch can be merged into 0001, 0002 and 0003 patches. Otherwise the patch LGTM. Few thoughts: 1. I think we can trivially extend the custodian process to remove any future WAL files on the old timeline, something like the attached 0001-Move-removal-of-future-WAL-files-on-the-old-timeline.text file). While this offloads the recovery a bit, the server may archive such WAL files before the custodian removes them. We can do a bit more to stop the server from archiving such WAL files, but that needs more coding. I don't think we need to do all that now, perhaps, we can give it a try once the basic custodian stuff gets in. 2. Moving RemovePgTempFiles() to the custodian can bring up the server soon. The idea is that the postmaster just renames the temp directories and informs the custodian so that it can go delete such temp files and directories. I have personally seen cases where the server spent a good amount of time cleaning up temp files. We can park it for later. 3. Moving RemoveOldXlogFiles() to the custodian can make checkpoints faster. 4. PreallocXlogFiles() - if we ever have plans to make pre-allocation more aggressive (pre-allocate more than 1 WAL file), perhaps letting custodian do that is a good idea. Again, too many tasks for a single process. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Nov 30, 2022 at 4:52 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Nov 30, 2022 at 10:48 AM Nathan Bossart > <nathandbossart@gmail.com> wrote: > > > > > > cfbot is not happy with v16. AFAICT this is just due to poor placement, so > > here is another attempt with the tests moved to a new location. Apologies > > for the noise. > > Thanks for the patches. I spent some time on reviewing v17 patch set > and here are my comments: > > 0001: > 1. I think the custodian process needs documentation - it needs a > definition in glossary.sgml and perhaps a dedicated page describing > what tasks it takes care of. > > 2. > + LWLockReleaseAll(); > + ConditionVariableCancelSleep(); > + AbortBufferIO(); > + UnlockBuffers(); > + ReleaseAuxProcessResources(false); > + AtEOXact_Buffers(false); > + AtEOXact_SMgr(); > + AtEOXact_Files(false); > + AtEOXact_HashTables(false); > Do we need all of these in the exit path? Isn't the stuff that > ShutdownAuxiliaryProcess() does enough for the custodian process? > AFAICS, the custodian process uses LWLocks (which the > ShutdownAuxiliaryProcess() takes care of) and it doesn't access shared > buffers and so on. > Having said that, I'm fine to keep them for future use and all of > those cleanup functions exit if nothing related occurs. > > 3. > + * Advertise out latch that backends can use to wake us up while we're > Typo - %s/out/our > > 4. Is it a good idea to add log messages in the DoCustodianTasks() > loop? Maybe at a debug level? The log message can say the current task > the custodian is processing. And/Or setting the custodian's status on > the ps display is also a good idea IMO. > > 0002 and 0003: > 1. > +CHECKPOINT; > +DO $$ > I think we need to ensure that there are some snapshot files before > the checkpoint. Otherwise, it may happen that the above test case > exits without the custodian process doing anything. > > 2. I think the best way to test the custodian process code is by > adding a TAP test module to see actually the custodian process kicks > in. Perhaps, add elog(DEBUGX,...) messages to various custodian > process functions and see if we see the logs in server logs. > > 0004: > I think the 0004 patch can be merged into 0001, 0002 and 0003 patches. > Otherwise the patch LGTM. > > Few thoughts: > 1. I think we can trivially extend the custodian process to remove any > future WAL files on the old timeline, something like the attached > 0001-Move-removal-of-future-WAL-files-on-the-old-timeline.text file). > While this offloads the recovery a bit, the server may archive such > WAL files before the custodian removes them. We can do a bit more to > stop the server from archiving such WAL files, but that needs more > coding. I don't think we need to do all that now, perhaps, we can give > it a try once the basic custodian stuff gets in. > 2. Moving RemovePgTempFiles() to the custodian can bring up the server > soon. The idea is that the postmaster just renames the temp > directories and informs the custodian so that it can go delete such > temp files and directories. I have personally seen cases where the > server spent a good amount of time cleaning up temp files. We can park > it for later. > 3. Moving RemoveOldXlogFiles() to the custodian can make checkpoints faster. > 4. PreallocXlogFiles() - if we ever have plans to make pre-allocation > more aggressive (pre-allocate more than 1 WAL file), perhaps letting > custodian do that is a good idea. Again, too many tasks for a single > process. Another comment: IIUC, there's no custodian_delay GUC as we want to avoid unnecessary wakeups for power savings (being discussed in the other thread). However, can it happen that the custodian missed to capture SetLatch wakeups by other backends? In other words, can the custodian process be sleeping when there's work to do? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Nov 30, 2022 at 05:27:10PM +0530, Bharath Rupireddy wrote: > On Wed, Nov 30, 2022 at 4:52 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: >> Thanks for the patches. I spent some time on reviewing v17 patch set >> and here are my comments: Thanks for reviewing! >> 0001: >> 1. I think the custodian process needs documentation - it needs a >> definition in glossary.sgml and perhaps a dedicated page describing >> what tasks it takes care of. Good catch. I added this in v18. I stopped short of adding a dedicated page to describe the tasks because 1) there are no parameters for the custodian and 2) AFAICT none of its tasks are described in the docs today. >> 2. >> + LWLockReleaseAll(); >> + ConditionVariableCancelSleep(); >> + AbortBufferIO(); >> + UnlockBuffers(); >> + ReleaseAuxProcessResources(false); >> + AtEOXact_Buffers(false); >> + AtEOXact_SMgr(); >> + AtEOXact_Files(false); >> + AtEOXact_HashTables(false); >> Do we need all of these in the exit path? Isn't the stuff that >> ShutdownAuxiliaryProcess() does enough for the custodian process? >> AFAICS, the custodian process uses LWLocks (which the >> ShutdownAuxiliaryProcess() takes care of) and it doesn't access shared >> buffers and so on. >> Having said that, I'm fine to keep them for future use and all of >> those cleanup functions exit if nothing related occurs. Yeah, I don't think we need a few of these. In v18, I've kept the following: * LWLockReleaseAll() * ConditionVariableCancelSleep() * ReleaseAuxProcessResources(false) * AtEOXact_Files(false) >> 3. >> + * Advertise out latch that backends can use to wake us up while we're >> Typo - %s/out/our fixed >> 4. Is it a good idea to add log messages in the DoCustodianTasks() >> loop? Maybe at a debug level? The log message can say the current task >> the custodian is processing. And/Or setting the custodian's status on >> the ps display is also a good idea IMO. I'd like to pick these up in a new thread if/when this initial patch set is committed. The tasks already do some logging, and the checkpointer process doesn't update the ps display for these tasks today. >> 0002 and 0003: >> 1. >> +CHECKPOINT; >> +DO $$ >> I think we need to ensure that there are some snapshot files before >> the checkpoint. Otherwise, it may happen that the above test case >> exits without the custodian process doing anything. >> >> 2. I think the best way to test the custodian process code is by >> adding a TAP test module to see actually the custodian process kicks >> in. Perhaps, add elog(DEBUGX,...) messages to various custodian >> process functions and see if we see the logs in server logs. The test appears to reliably create snapshot and mapping files, so if the directories are empty at some point after the checkpoint at the end, we can be reasonably certain the custodian took action. I didn't add explicit checks that there are files in the directories before the checkpoint because a concurrent checkpoint could make such checks unreliable. >> 0004: >> I think the 0004 patch can be merged into 0001, 0002 and 0003 patches. >> Otherwise the patch LGTM. I'm keeping this one separate because I've received conflicting feedback about the idea. >> 1. I think we can trivially extend the custodian process to remove any >> future WAL files on the old timeline, something like the attached >> 0001-Move-removal-of-future-WAL-files-on-the-old-timeline.text file). >> While this offloads the recovery a bit, the server may archive such >> WAL files before the custodian removes them. We can do a bit more to >> stop the server from archiving such WAL files, but that needs more >> coding. I don't think we need to do all that now, perhaps, we can give >> it a try once the basic custodian stuff gets in. >> 2. Moving RemovePgTempFiles() to the custodian can bring up the server >> soon. The idea is that the postmaster just renames the temp >> directories and informs the custodian so that it can go delete such >> temp files and directories. I have personally seen cases where the >> server spent a good amount of time cleaning up temp files. We can park >> it for later. >> 3. Moving RemoveOldXlogFiles() to the custodian can make checkpoints faster. >> 4. PreallocXlogFiles() - if we ever have plans to make pre-allocation >> more aggressive (pre-allocate more than 1 WAL file), perhaps letting >> custodian do that is a good idea. Again, too many tasks for a single >> process. I definitely want to do #2. І have some patches for that upthread, but I removed them for now based on Simon's feedback. I intend to pick that up in a new thread. I haven't thought too much about the others yet. > Another comment: > IIUC, there's no custodian_delay GUC as we want to avoid unnecessary > wakeups for power savings (being discussed in the other thread). > However, can it happen that the custodian missed to capture SetLatch > wakeups by other backends? In other words, can the custodian process > be sleeping when there's work to do? I'm not aware of any way this could happen, but if there is one, I think we should treat it as a bug instead of relying on the custodian process to periodically wake up and check for work to do. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Dec 2, 2022 at 3:10 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > >> 4. Is it a good idea to add log messages in the DoCustodianTasks() > >> loop? Maybe at a debug level? The log message can say the current task > >> the custodian is processing. And/Or setting the custodian's status on > >> the ps display is also a good idea IMO. > > I'd like to pick these up in a new thread if/when this initial patch set is > committed. The tasks already do some logging, and the checkpointer process > doesn't update the ps display for these tasks today. It'll be good to have some kind of dedicated monitoring for the custodian process as it can do a "good" amount of work at times and users will have a way to know what it currently is doing - it can be logs at debug level, progress reporting via ereport_startup_progress()-sort of mechanism, ps display, pg_stat_custodian or a special function that tells some details, or some other. In any case, I agree to park this for later. > >> 0002 and 0003: > >> 1. > >> +CHECKPOINT; > >> +DO $$ > >> I think we need to ensure that there are some snapshot files before > >> the checkpoint. Otherwise, it may happen that the above test case > >> exits without the custodian process doing anything. > >> > >> 2. I think the best way to test the custodian process code is by > >> adding a TAP test module to see actually the custodian process kicks > >> in. Perhaps, add elog(DEBUGX,...) messages to various custodian > >> process functions and see if we see the logs in server logs. > > The test appears to reliably create snapshot and mapping files, so if the > directories are empty at some point after the checkpoint at the end, we can > be reasonably certain the custodian took action. I didn't add explicit > checks that there are files in the directories before the checkpoint > because a concurrent checkpoint could make such checks unreliable. I think you're right. I added sqls to see if the snapshot and mapping files count > 0, see [1] and the cirrus-ci members are happy too - https://github.com/BRupireddy/postgres/tree/custodian_review_2. I think we can consider adding these count > 0 checks to tests. > >> 0004: > >> I think the 0004 patch can be merged into 0001, 0002 and 0003 patches. > >> Otherwise the patch LGTM. > > I'm keeping this one separate because I've received conflicting feedback > about the idea. If we classify custodian as a process doing non-critical tasks that have nothing to do with regular server functioning, then processing ShutdownRequestPending looks okay. However, delaying these non-critical tasks such as file removals which reclaims disk space might impact the server overall especially when it's reaching 100% disk usage and we want the custodian to do its job fully before we shutdown the server. If we delay processing shutdown requests, that can impact the server overall (might delay restarts, failovers etc.), because at times there can be a lot of tasks with a good amount of work pending in the custodian's task queue. Having said above, I'm okay to process ShutdownRequestPending as early as possible, however, should we also add CHECK_FOR_INTERRUPTS() alongside ShutdownRequestPending? Also, I think it's enough to just have ShutdownRequestPending check in DoCustodianTasks(void)'s main loop and we can let RemoveOldSerializedSnapshots() and RemoveOldLogicalRewriteMappings() do their jobs to the fullest as they do today. While thinking about this, one thing that really struck me is what happens if we let the custodian exit, say after processing ShutdownRequestPending immediately or after a restart, leaving other queued tasks? The custodian will never get to work on those tasks unless the requestors (say checkpoint or some other process) requests it to do so after restart. Maybe, we don't need to worry about it. Maybe we need to worry about it. Maybe it's an overkill to save the custodian's task state to disk so that it can come up and do the leftover tasks upon restart. > > Another comment: > > IIUC, there's no custodian_delay GUC as we want to avoid unnecessary > > wakeups for power savings (being discussed in the other thread). > > However, can it happen that the custodian missed to capture SetLatch > > wakeups by other backends? In other words, can the custodian process > > be sleeping when there's work to do? > > I'm not aware of any way this could happen, but if there is one, I think we > should treat it as a bug instead of relying on the custodian process to > periodically wake up and check for work to do. One possible scenario is that the requestor adds its task details to the queue and sets the latch, the custodian can miss this SetLatch() when it's in the midst of processing a task. However, it guarantees the requester that it'll process the added task after it completes the current task. And, I don't know the other reasons when the custodian can miss SetLatch(). [1] diff --git a/contrib/test_decoding/expected/rewrite.out b/contrib/test_decoding/expected/rewrite.out index 214a514a0a..0029e48852 100644 --- a/contrib/test_decoding/expected/rewrite.out +++ b/contrib/test_decoding/expected/rewrite.out @@ -163,6 +163,20 @@ DROP FUNCTION iamalongfunction(); DROP FUNCTION exec(text); DROP ROLE regress_justforcomments; -- make sure custodian cleans up files +-- make sure snapshot files exist for custodian to clean up +SELECT count(*) > 0 FROM pg_ls_logicalsnapdir(); + ?column? +---------- + t +(1 row) + +-- make sure rewrite mapping files exist for custodian to clean up +SELECT count(*) > 0 FROM pg_ls_logicalmapdir(); + ?column? +---------- + t +(1 row) + CHECKPOINT; DO $$ DECLARE diff --git a/contrib/test_decoding/sql/rewrite.sql b/contrib/test_decoding/sql/rewrite.sql index d66f70f837..c076809f37 100644 --- a/contrib/test_decoding/sql/rewrite.sql +++ b/contrib/test_decoding/sql/rewrite.sql @@ -107,6 +107,13 @@ DROP FUNCTION exec(text); DROP ROLE regress_justforcomments; -- make sure custodian cleans up files + +-- make sure snapshot files exist for custodian to clean up +SELECT count(*) > 0 FROM pg_ls_logicalsnapdir(); + +-- make sure rewrite mapping files exist for custodian to clean up +SELECT count(*) > 0 FROM pg_ls_logicalmapdir(); + CHECKPOINT; DO $$ DECLARE -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Dec 02, 2022 at 12:11:35PM +0530, Bharath Rupireddy wrote: > On Fri, Dec 2, 2022 at 3:10 AM Nathan Bossart <nathandbossart@gmail.com> wrote: >> The test appears to reliably create snapshot and mapping files, so if the >> directories are empty at some point after the checkpoint at the end, we can >> be reasonably certain the custodian took action. I didn't add explicit >> checks that there are files in the directories before the checkpoint >> because a concurrent checkpoint could make such checks unreliable. > > I think you're right. I added sqls to see if the snapshot and mapping > files count > 0, see [1] and the cirrus-ci members are happy too - > https://github.com/BRupireddy/postgres/tree/custodian_review_2. I > think we can consider adding these count > 0 checks to tests. My worry about adding "count > 0" checks is that a concurrent checkpoint could make them unreliable. In other words, those checks might ordinarily work, but if an automatic checkpoint causes the files be cleaned up just beforehand, they will fail. > Having said above, I'm okay to process ShutdownRequestPending as early > as possible, however, should we also add CHECK_FOR_INTERRUPTS() > alongside ShutdownRequestPending? I'm not seeing a need for CHECK_FOR_INTERRUPTS. Do you see one? > While thinking about this, one thing that really struck me is what > happens if we let the custodian exit, say after processing > ShutdownRequestPending immediately or after a restart, leaving other > queued tasks? The custodian will never get to work on those tasks > unless the requestors (say checkpoint or some other process) requests > it to do so after restart. Maybe, we don't need to worry about it. > Maybe we need to worry about it. Maybe it's an overkill to save the > custodian's task state to disk so that it can come up and do the > leftover tasks upon restart. Yes, tasks will need to be retried when the server starts again. The ones in this patch set should be requested again during the next checkpoint. Temporary file cleanup would always be requested during server start, so that should be handled as well. Even today, the server might abruptly shut down while executing these tasks, and we don't have any special handling for that. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Sat, Dec 3, 2022 at 12:45 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Fri, Dec 02, 2022 at 12:11:35PM +0530, Bharath Rupireddy wrote: > > On Fri, Dec 2, 2022 at 3:10 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > >> The test appears to reliably create snapshot and mapping files, so if the > >> directories are empty at some point after the checkpoint at the end, we can > >> be reasonably certain the custodian took action. I didn't add explicit > >> checks that there are files in the directories before the checkpoint > >> because a concurrent checkpoint could make such checks unreliable. > > > > I think you're right. I added sqls to see if the snapshot and mapping > > files count > 0, see [1] and the cirrus-ci members are happy too - > > https://github.com/BRupireddy/postgres/tree/custodian_review_2. I > > think we can consider adding these count > 0 checks to tests. > > My worry about adding "count > 0" checks is that a concurrent checkpoint > could make them unreliable. In other words, those checks might ordinarily > work, but if an automatic checkpoint causes the files be cleaned up just > beforehand, they will fail. Hm. It would have been better with a TAP test module for testing the custodian code reliably. Anyway, that mustn't stop the patch getting in. If required, we can park the TAP test module for later - IMO. Others may have different thoughts here. > > Having said above, I'm okay to process ShutdownRequestPending as early > > as possible, however, should we also add CHECK_FOR_INTERRUPTS() > > alongside ShutdownRequestPending? > > I'm not seeing a need for CHECK_FOR_INTERRUPTS. Do you see one? Since the custodian has SignalHandlerForShutdownRequest as SIGINT and SIGTERM handlers, unlike StatementCancelHandler and die respectively, no need of CFI I guess. And also none of the CFI signal handler flags applies to the custodian. > > While thinking about this, one thing that really struck me is what > > happens if we let the custodian exit, say after processing > > ShutdownRequestPending immediately or after a restart, leaving other > > queued tasks? The custodian will never get to work on those tasks > > unless the requestors (say checkpoint or some other process) requests > > it to do so after restart. Maybe, we don't need to worry about it. > > Maybe we need to worry about it. Maybe it's an overkill to save the > > custodian's task state to disk so that it can come up and do the > > leftover tasks upon restart. > > Yes, tasks will need to be retried when the server starts again. The ones > in this patch set should be requested again during the next checkpoint. > Temporary file cleanup would always be requested during server start, so > that should be handled as well. Even today, the server might abruptly shut > down while executing these tasks, and we don't have any special handling > for that. Right. The v18 patch set posted upthread https://www.postgresql.org/message-id/20221201214026.GA1799688%40nathanxps13 looks good to me. I see the CF entry is marked RfC - https://commitfest.postgresql.org/41/3448/. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
rebased for cfbot -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Feb 02, 2023 at 09:48:08PM -0800, Nathan Bossart wrote: > rebased for cfbot another rebase for cfbot -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Attachment
Nathan Bossart <nathandbossart@gmail.com> writes: > another rebase for cfbot I took a brief look through v20, and generally liked what I saw, but there are a few things troubling me: * The comments for CustodianEnqueueTask claim that it won't enqueue an already-queued task, but I don't think I believe that, because it stops scanning as soon as it finds an empty slot. That data structure seems quite oddly designed in any case. Why isn't it simply an array of need-to-run-this-one booleans indexed by the CustodianTask enum? Fairness of dispatch could be ensured by the same state variable that CustodianGetNextTask already uses to track which array element to inspect next. While that wouldn't guarantee that tasks A and B are dispatched in the same order they were requested in, I'm not sure why we should care. * I don't much like cust_lck, mainly because you didn't bother to document what it protects (in general, CustodianShmemStruct deserves more than zero commentary). Do we need it at all? If the task-needed flags were sig_atomic_t not bool, we probably don't need it for the basic job of tracking which tasks remain to be run. I see that some of the tasks have possibly-non-atomically-assigned parameters to be transmitted, but restricting cust_lck to protect those seems like a better idea. * Not quite convinced about handle_arg_func, mainly because the Datum API would be pretty inconvenient for any task with more than one arg. Why do we need that at all, rather than saying that callers should set up any required parameters separately before invoking RequestCustodian? * Why does LookupCustodianFunctions think it needs to search the constant array? * The original proposal included moving RemovePgTempFiles into this mechanism, which I thought was probably the most useful bit of the whole thing. I'm sad to see that gone, what became of it? regards, tom lane
Hi, On 2023-04-02 13:40:05 -0400, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: > > another rebase for cfbot > > I took a brief look through v20, and generally liked what I saw, > but there are a few things troubling me: Just want to note that I've repeatedly objected to 0002 and 0003, i.e. moving serialized logical decoding snapshots and mapping files, to custodian, and still do. Without further work it increases wraparound risks (the filenames contain xids), and afaict nothing has been done to ameliorate that. Without those, the current patch series does not have any tasks: > * The original proposal included moving RemovePgTempFiles into this > mechanism, which I thought was probably the most useful bit of the > whole thing. I'm sad to see that gone, what became of it? Greetings, Andres Freund
On Sun, Apr 02, 2023 at 01:40:05PM -0400, Tom Lane wrote: > I took a brief look through v20, and generally liked what I saw, > but there are a few things troubling me: Thanks for taking a look. > * The comments for CustodianEnqueueTask claim that it won't enqueue an > already-queued task, but I don't think I believe that, because it stops > scanning as soon as it finds an empty slot. That data structure seems > quite oddly designed in any case. Why isn't it simply an array of > need-to-run-this-one booleans indexed by the CustodianTask enum? > Fairness of dispatch could be ensured by the same state variable that > CustodianGetNextTask already uses to track which array element to > inspect next. While that wouldn't guarantee that tasks A and B are > dispatched in the same order they were requested in, I'm not sure why > we should care. That works. Will update. > * I don't much like cust_lck, mainly because you didn't bother to > document what it protects (in general, CustodianShmemStruct deserves > more than zero commentary). Do we need it at all? If the task-needed > flags were sig_atomic_t not bool, we probably don't need it for the > basic job of tracking which tasks remain to be run. I see that some > of the tasks have possibly-non-atomically-assigned parameters to be > transmitted, but restricting cust_lck to protect those seems like a > better idea. Will do. > * Not quite convinced about handle_arg_func, mainly because the Datum > API would be pretty inconvenient for any task with more than one arg. > Why do we need that at all, rather than saying that callers should > set up any required parameters separately before invoking > RequestCustodian? I had done it this way earlier, but added the Datum argument based on feedback upthread [0]. It presently has only one proposed use, anyway, so I think it would be fine to switch it back for now. > * Why does LookupCustodianFunctions think it needs to search the > constant array? The order of the tasks in the array isn't guaranteed to match the order in the CustodianTask enum. > * The original proposal included moving RemovePgTempFiles into this > mechanism, which I thought was probably the most useful bit of the > whole thing. I'm sad to see that gone, what became of it? I postponed that based on advice from upthread [1]. I was hoping to start a dedicated thread for that immediately after the custodian infrastructure was committed. FWIW I agree that it's the most useful task of what's proposed thus far. [0] https://postgr.es/m/20220703172732.wembjsb55xl63vuw%40awork3.anarazel.de [1] https://postgr.es/m/CANbhV-EagKLoUH7tLEfg__VcLu37LY78F8gvLMzHrRZyZKm6sw%40mail.gmail.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Sun, Apr 02, 2023 at 11:42:26AM -0700, Andres Freund wrote: > Just want to note that I've repeatedly objected to 0002 and 0003, i.e. moving > serialized logical decoding snapshots and mapping files, to custodian, and > still do. Without further work it increases wraparound risks (the filenames > contain xids), and afaict nothing has been done to ameliorate that. From your feedback earlier [0], I was under the (perhaps false) impression that adding a note about this existing issue in the commit message was sufficient, at least initially. I did add such a note in 0003, but it's missing from 0002 for some reason. I suspect I left it out because the serialized snapshot file names do not contain XIDs. You cleared that up earlier [1], so this is my bad. It's been a little while since I dug into this, but I do see your point that the wraparound risk could be higher in some cases. For example, if you have a billion temp files to clean up, the custodian could be stuck on that task for a long time. I will give this some further thought. I'm all ears if anyone has ideas about how to reduce this risk. [0] https://postgr.es/m/20220702225456.zit5kjdtdfqmjujt%40alap3.anarazel.de [1] https://postgr.es/m/20220217065938.x2esfdppzypegn5j%40alap3.anarazel.de -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Nathan Bossart <nathandbossart@gmail.com> writes: > On Sun, Apr 02, 2023 at 01:40:05PM -0400, Tom Lane wrote: >> * Why does LookupCustodianFunctions think it needs to search the >> constant array? > The order of the tasks in the array isn't guaranteed to match the order in > the CustodianTask enum. Why not? It's a constant array, we can surely manage to make its order match the enum. >> * The original proposal included moving RemovePgTempFiles into this >> mechanism, which I thought was probably the most useful bit of the >> whole thing. I'm sad to see that gone, what became of it? > I postponed that based on advice from upthread [1]. I was hoping to start > a dedicated thread for that immediately after the custodian infrastructure > was committed. FWIW I agree that it's the most useful task of what's > proposed thus far. Hmm, given Andres' objections there's little point in moving forward without that task. regards, tom lane
Nathan Bossart <nathandbossart@gmail.com> writes: > It's been a little while since I dug into this, but I do see your point > that the wraparound risk could be higher in some cases. For example, if > you have a billion temp files to clean up, the custodian could be stuck on > that task for a long time. I will give this some further thought. I'm all > ears if anyone has ideas about how to reduce this risk. I wonder if a single long-lived custodian task is the right model at all. At least for RemovePgTempFiles, it'd make more sense to write it as a background worker that spawns, does its work, and then exits, independently of anything else. Of course, then you need some mechanism for ensuring that a bgworker slot is available when needed, but that doesn't seem horridly difficult --- we could have a few "reserved bgworker" slots, perhaps. An idle bgworker slot doesn't cost much. regards, tom lane
On Sun, Apr 02, 2023 at 04:23:05PM -0400, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: >> On Sun, Apr 02, 2023 at 01:40:05PM -0400, Tom Lane wrote: >>> * Why does LookupCustodianFunctions think it needs to search the >>> constant array? > >> The order of the tasks in the array isn't guaranteed to match the order in >> the CustodianTask enum. > > Why not? It's a constant array, we can surely manage to make its > order match the enum. Alright. I'll change this. >>> * The original proposal included moving RemovePgTempFiles into this >>> mechanism, which I thought was probably the most useful bit of the >>> whole thing. I'm sad to see that gone, what became of it? > >> I postponed that based on advice from upthread [1]. I was hoping to start >> a dedicated thread for that immediately after the custodian infrastructure >> was committed. FWIW I agree that it's the most useful task of what's >> proposed thus far. > > Hmm, given Andres' objections there's little point in moving forward > without that task. Yeah. I should probably tackle that one first and leave the logical tasks for later, given there is some prerequisite work required. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Sun, Apr 02, 2023 at 04:37:38PM -0400, Tom Lane wrote: > Nathan Bossart <nathandbossart@gmail.com> writes: >> It's been a little while since I dug into this, but I do see your point >> that the wraparound risk could be higher in some cases. For example, if >> you have a billion temp files to clean up, the custodian could be stuck on >> that task for a long time. I will give this some further thought. I'm all >> ears if anyone has ideas about how to reduce this risk. > > I wonder if a single long-lived custodian task is the right model at all. > At least for RemovePgTempFiles, it'd make more sense to write it as a > background worker that spawns, does its work, and then exits, > independently of anything else. Of course, then you need some mechanism > for ensuring that a bgworker slot is available when needed, but that > doesn't seem horridly difficult --- we could have a few "reserved > bgworker" slots, perhaps. An idle bgworker slot doesn't cost much. This has crossed my mind. Even if we use the custodian for several different tasks, perhaps it could shut down while not in use. For many servers, the custodian process will be used sparingly, if at all. And if we introduce something like custodian_max_workers, perhaps we could dodge the wraparound issue a bit by setting the default to the number of supported tasks. That being said, this approach adds some complexity. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
I sent this one to the next commitfest and marked it as waiting-on-author and targeted for v17. I'm aiming to have something that addresses the latest feedback ready for the July commitfest. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
> On 4 Apr 2023, at 05:36, Nathan Bossart <nathandbossart@gmail.com> wrote: > > I sent this one to the next commitfest and marked it as waiting-on-author > and targeted for v17. I'm aiming to have something that addresses the > latest feedback ready for the July commitfest. Have you had a chance to look at this such that there is something ready? -- Daniel Gustafsson
On Tue, Jul 04, 2023 at 09:30:43AM +0200, Daniel Gustafsson wrote: >> On 4 Apr 2023, at 05:36, Nathan Bossart <nathandbossart@gmail.com> wrote: >> >> I sent this one to the next commitfest and marked it as waiting-on-author >> and targeted for v17. I'm aiming to have something that addresses the >> latest feedback ready for the July commitfest. > > Have you had a chance to look at this such that there is something ready? Not yet, sorry. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com