Thread: Race between KeepFileRestoredFromArchive() and restartpoint
A race with KeepFileRestoredFromArchive() can cause a restartpoint to fail, as seen once on the buildfarm[1]. The attached patch adds a test case; it applies atop the "stop events" patch[2]. We have two systems for adding long-term pg_wal directory entries. KeepFileRestoredFromArchive() adds them during archive recovery, while InstallXLogFileSegment() does so at all times. Unfortunately, InstallXLogFileSegment() happens throughout archive recovery, via the checkpointer recycling segments and calling PreallocXlogFiles(). Multiple processes can run InstallXLogFileSegment(), which uses ControlFileLock to represent the authority to modify the directory listing of pg_wal. KeepFileRestoredFromArchive() just assumes it controls pg_wal. Recycling and preallocation are wasteful during archive recovery, because KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to fix the race by adding an XLogCtl flag indicating which regime currently owns the right to add long-term pg_wal directory entries. In the archive recovery regime, the checkpointer will not preallocate and will unlink old segments instead of recycling them (like wal_recycle=off). XLogFileInit() will fail. Notable alternatives: - Release ControlFileLock at the end of XLogFileInit(), not at the end of InstallXLogFileSegment(). Add ControlFileLock acquisition to KeepFileRestoredFromArchive(). This provides adequate mutual exclusion, but XLogFileInit() could return a file descriptor for an unlinked file. That's fine for PreallocXlogFiles(), but it feels wrong. - During restartpoints, never preallocate or recycle segments. (Just delete obsolete WAL.) By denying those benefits, this presumably makes streaming recovery less efficient. - Make KeepFileRestoredFromArchive() call XLogFileInit() to open a segment, then copy bytes. This is simple, but it multiplies I/O. That might be negligible on account of caching, or it might not be. A variant, incurring extra fsyncs, would be to use durable_rename() to replace the segment we get from XLogFileInit(). - Make KeepFileRestoredFromArchive() rename without first unlinking. This avoids checkpoint failure, but a race could trigger noise from the LOG message in InstallXLogFileSegment -> durable_rename_excl. Does anyone prefer some alternative? It's probably not worth back-patching anything for a restartpoint failure this rare, because most restartpoint outcomes are not user-visible. Thanks, nm [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2020-10-05%2023%3A02%3A17 [2] https://postgr.es/m/CAPpHfdtSEOHX8dSk9Qp%2BZ%2B%2Bi4BGQoffKip6JDWngEA%2Bg7Z-XmQ%40mail.gmail.com
Attachment
On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote: > Recycling and preallocation are wasteful during archive recovery, because > KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to > fix the race by adding an XLogCtl flag indicating which regime currently owns > the right to add long-term pg_wal directory entries. In the archive recovery > regime, the checkpointer will not preallocate and will unlink old segments > instead of recycling them (like wal_recycle=off). XLogFileInit() will fail. Here's the implementation. Patches 1-4 suffice to stop the user-visible ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem writes, and it provides some future-proofing. I was tempted to (but did not) just remove preallocation. Creating one file per checkpoint seems tiny relative to the max_wal_size=1GB default, so I expect it's hard to isolate any benefit. Under the old checkpoint_segments=3 default, a preallocated segment covered a respectable third of the next checkpoint. Before commit 63653f7 (2002), preallocation created more files.
Attachment
Hi Noah, On 6/19/21 16:39, Noah Misch wrote: > On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote: >> Recycling and preallocation are wasteful during archive recovery, because >> KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to >> fix the race by adding an XLogCtl flag indicating which regime currently owns >> the right to add long-term pg_wal directory entries. In the archive recovery >> regime, the checkpointer will not preallocate and will unlink old segments >> instead of recycling them (like wal_recycle=off). XLogFileInit() will fail. > > Here's the implementation. Patches 1-4 suffice to stop the user-visible > ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem > writes, and it provides some future-proofing. > > I was tempted to (but did not) just remove preallocation. Creating one file > per checkpoint seems tiny relative to the max_wal_size=1GB default, so I > expect it's hard to isolate any benefit. Under the old checkpoint_segments=3 > default, a preallocated segment covered a respectable third of the next > checkpoint. Before commit 63653f7 (2002), preallocation created more files. This also seems like it would fix the link issues we are seeing in [1]. I wonder if that would make it worth a back patch? [1] https://www.postgresql.org/message-id/flat/CAKw-smBhLOGtRJTC5c%3DqKTPz8gz5%2BWPoVAXrHB6mY-1U4_N7-w%40mail.gmail.com
On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote: > On 6/19/21 16:39, Noah Misch wrote: > >On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote: > >>Recycling and preallocation are wasteful during archive recovery, because > >>KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to > >>fix the race by adding an XLogCtl flag indicating which regime currently owns > >>the right to add long-term pg_wal directory entries. In the archive recovery > >>regime, the checkpointer will not preallocate and will unlink old segments > >>instead of recycling them (like wal_recycle=off). XLogFileInit() will fail. > > > >Here's the implementation. Patches 1-4 suffice to stop the user-visible > >ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem > >writes, and it provides some future-proofing. > > > >I was tempted to (but did not) just remove preallocation. Creating one file > >per checkpoint seems tiny relative to the max_wal_size=1GB default, so I > >expect it's hard to isolate any benefit. Under the old checkpoint_segments=3 > >default, a preallocated segment covered a respectable third of the next > >checkpoint. Before commit 63653f7 (2002), preallocation created more files. > > This also seems like it would fix the link issues we are seeing in [1]. > > I wonder if that would make it worth a back patch? Perhaps. It's sad to have multiple people deep-diving into something fixed on HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on this. One alternative would be adding an errhint like "This is known to happen occasionally during archive recovery, where it is harmless." That has an unpolished look, but it's low-risk and may avoid deep-dive efforts. > [1] https://www.postgresql.org/message-id/flat/CAKw-smBhLOGtRJTC5c%3DqKTPz8gz5%2BWPoVAXrHB6mY-1U4_N7-w%40mail.gmail.com
On 7/31/22 02:17, Noah Misch wrote: > On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote: >> On 6/19/21 16:39, Noah Misch wrote: >>> On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote: >>>> Recycling and preallocation are wasteful during archive recovery, because >>>> KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to >>>> fix the race by adding an XLogCtl flag indicating which regime currently owns >>>> the right to add long-term pg_wal directory entries. In the archive recovery >>>> regime, the checkpointer will not preallocate and will unlink old segments >>>> instead of recycling them (like wal_recycle=off). XLogFileInit() will fail. >>> >>> Here's the implementation. Patches 1-4 suffice to stop the user-visible >>> ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem >>> writes, and it provides some future-proofing. >>> >>> I was tempted to (but did not) just remove preallocation. Creating one file >>> per checkpoint seems tiny relative to the max_wal_size=1GB default, so I >>> expect it's hard to isolate any benefit. Under the old checkpoint_segments=3 >>> default, a preallocated segment covered a respectable third of the next >>> checkpoint. Before commit 63653f7 (2002), preallocation created more files. >> >> This also seems like it would fix the link issues we are seeing in [1]. >> >> I wonder if that would make it worth a back patch? > > Perhaps. It's sad to have multiple people deep-diving into something fixed on > HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on > this. One alternative would be adding an errhint like "This is known to > happen occasionally during archive recovery, where it is harmless." That has > an unpolished look, but it's low-risk and may avoid deep-dive efforts. I think in this case a HINT might be sufficient to at least keep people from wasting time tracking down a problem that has already been fixed. However, there is another issue [1] that might argue for a back patch if this patch (as I believe) would fix the issue. Regards, -David [1] https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com
On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote: > On 7/31/22 02:17, Noah Misch wrote: > >On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote: > >>On 6/19/21 16:39, Noah Misch wrote: > >>>On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote: > >>>>Recycling and preallocation are wasteful during archive recovery, because > >>>>KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to > >>>>fix the race by adding an XLogCtl flag indicating which regime currently owns > >>>>the right to add long-term pg_wal directory entries. In the archive recovery > >>>>regime, the checkpointer will not preallocate and will unlink old segments > >>>>instead of recycling them (like wal_recycle=off). XLogFileInit() will fail. > >>> > >>>Here's the implementation. Patches 1-4 suffice to stop the user-visible > >>>ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem > >>>writes, and it provides some future-proofing. > >>> > >>>I was tempted to (but did not) just remove preallocation. Creating one file > >>>per checkpoint seems tiny relative to the max_wal_size=1GB default, so I > >>>expect it's hard to isolate any benefit. Under the old checkpoint_segments=3 > >>>default, a preallocated segment covered a respectable third of the next > >>>checkpoint. Before commit 63653f7 (2002), preallocation created more files. > >> > >>This also seems like it would fix the link issues we are seeing in [1]. > >> > >>I wonder if that would make it worth a back patch? > > > >Perhaps. It's sad to have multiple people deep-diving into something fixed on > >HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on > >this. One alternative would be adding an errhint like "This is known to > >happen occasionally during archive recovery, where it is harmless." That has > >an unpolished look, but it's low-risk and may avoid deep-dive efforts. > > I think in this case a HINT might be sufficient to at least keep people from > wasting time tracking down a problem that has already been fixed. > > However, there is another issue [1] that might argue for a back patch if > this patch (as I believe) would fix the issue. > [1] https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com That makes sense. Each iteration of the restartpoint recycle loop has a 1/N chance of failing. Recovery adds >N files between restartpoints. Hence, the WAL directory grows without bound. Is that roughly the theory in mind?
On 8/2/22 10:37, Noah Misch wrote: > On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote: >> On 7/31/22 02:17, Noah Misch wrote: >>> On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote: >>>> On 6/19/21 16:39, Noah Misch wrote: >>>>> On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote: >>>>>> Recycling and preallocation are wasteful during archive recovery, because >>>>>> KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to >>>>>> fix the race by adding an XLogCtl flag indicating which regime currently owns >>>>>> the right to add long-term pg_wal directory entries. In the archive recovery >>>>>> regime, the checkpointer will not preallocate and will unlink old segments >>>>>> instead of recycling them (like wal_recycle=off). XLogFileInit() will fail. >>>>> >>>>> Here's the implementation. Patches 1-4 suffice to stop the user-visible >>>>> ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem >>>>> writes, and it provides some future-proofing. >>>>> >>>>> I was tempted to (but did not) just remove preallocation. Creating one file >>>>> per checkpoint seems tiny relative to the max_wal_size=1GB default, so I >>>>> expect it's hard to isolate any benefit. Under the old checkpoint_segments=3 >>>>> default, a preallocated segment covered a respectable third of the next >>>>> checkpoint. Before commit 63653f7 (2002), preallocation created more files. >>>> >>>> This also seems like it would fix the link issues we are seeing in [1]. >>>> >>>> I wonder if that would make it worth a back patch? >>> >>> Perhaps. It's sad to have multiple people deep-diving into something fixed on >>> HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on >>> this. One alternative would be adding an errhint like "This is known to >>> happen occasionally during archive recovery, where it is harmless." That has >>> an unpolished look, but it's low-risk and may avoid deep-dive efforts. >> >> I think in this case a HINT might be sufficient to at least keep people from >> wasting time tracking down a problem that has already been fixed. >> >> However, there is another issue [1] that might argue for a back patch if >> this patch (as I believe) would fix the issue. > >> [1] https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com > > That makes sense. Each iteration of the restartpoint recycle loop has a 1/N > chance of failing. Recovery adds >N files between restartpoints. Hence, the > WAL directory grows without bound. Is that roughly the theory in mind? Yes, though you have formulated it better than I had in my mind. Let's see if Don can confirm that he is seeing the "could not link file" messages. Regards, -David
On Tue, Aug 2, 2022 at 10:01 AM David Steele <david@pgmasters.net> wrote:
> That makes sense. Each iteration of the restartpoint recycle loop has a 1/N
> chance of failing. Recovery adds >N files between restartpoints. Hence, the
> WAL directory grows without bound. Is that roughly the theory in mind?
Yes, though you have formulated it better than I had in my mind.
Let's see if Don can confirm that he is seeing the "could not link file"
messages.
During my latest incident, there was only one occurrence:
could not link file “pg_wal/xlogtemp.18799" to “pg_wal/000000010000D45300000010”: File exists
WAL restore/recovery seemed to continue on just fine then. And it would continue on until the pg_wal volume ran out of space unless I was manually rm'ing already-recovered WAL files from the side.
Don Seiler
www.seiler.us
www.seiler.us
At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler <don@seiler.us> wrote in > On Tue, Aug 2, 2022 at 10:01 AM David Steele <david@pgmasters.net> wrote: > > > > > > That makes sense. Each iteration of the restartpoint recycle loop has a > > 1/N > > > chance of failing. Recovery adds >N files between restartpoints. > > Hence, the > > > WAL directory grows without bound. Is that roughly the theory in mind? > > > > Yes, though you have formulated it better than I had in my mind. I'm not sure I understand it correctly, but isn't the cause of the issue in the other thread due to skipping many checkpoint records within the checkpoint_timeout? I remember that I proposed a GUC variable to disable that checkpoint skipping. As another measure for that issue, we could force replaying checkpoint if max_wal_size is already filled up or known to be filled in the next checkpoint cycle. If this is correct, this patch is irrelevant to the issue. > > Let's see if Don can confirm that he is seeing the "could not link file" > > messages. > > > During my latest incident, there was only one occurrence: > > could not link file “pg_wal/xlogtemp.18799" to > > “pg_wal/000000010000D45300000010”: File exists (I noticed that the patch in the other thread is broken:() Hmm. It seems like a race condition betwen StartupXLOG() and RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely taking ControlFileLock before deciding the target xlog file name in RemoveXlogFile() seems to prevent this happening. (If this is correct this is a live issue on the master branch.) > WAL restore/recovery seemed to continue on just fine then. And it would > continue on until the pg_wal volume ran out of space unless I was manually > rm'ing already-recovered WAL files from the side. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Wed, Aug 03, 2022 at 11:24:17AM +0900, Kyotaro Horiguchi wrote: > At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler <don@seiler.us> wrote in > > could not link file “pg_wal/xlogtemp.18799" to > > > “pg_wal/000000010000D45300000010”: File exists > Hmm. It seems like a race condition betwen StartupXLOG() and > RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely > taking ControlFileLock before deciding the target xlog file name in > RemoveXlogFile() seems to prevent this happening. (If this is correct > this is a live issue on the master branch.) RemoveXlogFile() calls InstallXLogFileSegment() with find_free=true. The intent of find_free=true is to make it okay to pass a target xlog file that ceases to be a good target. (InstallXLogFileSegment() searches for a good target while holding ControlFileLock.) Can you say more about how that proved to be insufficient?
At Wed, 3 Aug 2022 00:28:47 -0700, Noah Misch <noah@leadboat.com> wrote in > On Wed, Aug 03, 2022 at 11:24:17AM +0900, Kyotaro Horiguchi wrote: > > At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler <don@seiler.us> wrote in > > > could not link file “pg_wal/xlogtemp.18799" to > > > > “pg_wal/000000010000D45300000010”: File exists > > > Hmm. It seems like a race condition betwen StartupXLOG() and > > RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely > > taking ControlFileLock before deciding the target xlog file name in > > RemoveXlogFile() seems to prevent this happening. (If this is correct > > this is a live issue on the master branch.) > > RemoveXlogFile() calls InstallXLogFileSegment() with find_free=true. The > intent of find_free=true is to make it okay to pass a target xlog file that > ceases to be a good target. (InstallXLogFileSegment() searches for a good > target while holding ControlFileLock.) Can you say more about how that proved > to be insufficient? Ug.. No. I can't. I was confused by something. Sorry. PreallocXlogFiles() and checkpointer are mutually excluded by the same lock, too. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Aug 02, 2022 at 07:37:27AM -0700, Noah Misch wrote: > On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote: > > On 7/31/22 02:17, Noah Misch wrote: > > >On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote: > > >>On 6/19/21 16:39, Noah Misch wrote: > > >>>On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote: > > >>>>Recycling and preallocation are wasteful during archive recovery, because > > >>>>KeepFileRestoredFromArchive() unlinks every entry in its path. I propose to > > >>>>fix the race by adding an XLogCtl flag indicating which regime currently owns > > >>>>the right to add long-term pg_wal directory entries. In the archive recovery > > >>>>regime, the checkpointer will not preallocate and will unlink old segments > > >>>>instead of recycling them (like wal_recycle=off). XLogFileInit() will fail. > > >>> > > >>>Here's the implementation. Patches 1-4 suffice to stop the user-visible > > >>>ERROR. Patch 5 avoids a spurious LOG-level message and wasted filesystem > > >>>writes, and it provides some future-proofing. > > >>> > > >>>I was tempted to (but did not) just remove preallocation. Creating one file > > >>>per checkpoint seems tiny relative to the max_wal_size=1GB default, so I > > >>>expect it's hard to isolate any benefit. Under the old checkpoint_segments=3 > > >>>default, a preallocated segment covered a respectable third of the next > > >>>checkpoint. Before commit 63653f7 (2002), preallocation created more files. > > >> > > >>This also seems like it would fix the link issues we are seeing in [1]. > > >> > > >>I wonder if that would make it worth a back patch? > > > > > >Perhaps. It's sad to have multiple people deep-diving into something fixed on > > >HEAD. On the other hand, I'm not eager to spend risk-of-backpatch points on > > >this. One alternative would be adding an errhint like "This is known to > > >happen occasionally during archive recovery, where it is harmless." That has > > >an unpolished look, but it's low-risk and may avoid deep-dive efforts. > > > > I think in this case a HINT might be sufficient to at least keep people from > > wasting time tracking down a problem that has already been fixed. Here's a patch to add that HINT. I figure it's better to do this before next week's minor releases. In the absence of objections, I'll push this around 2022-08-05 14:00 UTC. > > However, there is another issue [1] that might argue for a back patch if > > this patch (as I believe) would fix the issue. > > > [1] https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com > > That makes sense. Each iteration of the restartpoint recycle loop has a 1/N > chance of failing. Recovery adds >N files between restartpoints. Hence, the > WAL directory grows without bound. Is that roughly the theory in mind? On further reflection, I don't expect it to happen that way. Each failure message is LOG-level, so the remaining recycles still happen. (The race condition can yield an ERROR under PreallocXlogFiles(), but recycling is already done at that point.)
Attachment
At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch <noah@leadboat.com> wrote in > > > I think in this case a HINT might be sufficient to at least keep people from > > > wasting time tracking down a problem that has already been fixed. > > Here's a patch to add that HINT. I figure it's better to do this before next > week's minor releases. In the absence of objections, I'll push this around > 2022-08-05 14:00 UTC. +1 > > > However, there is another issue [1] that might argue for a back patch if > > > this patch (as I believe) would fix the issue. > > > > > [1] https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com > > > > That makes sense. Each iteration of the restartpoint recycle loop has a 1/N > > chance of failing. Recovery adds >N files between restartpoints. Hence, the > > WAL directory grows without bound. Is that roughly the theory in mind? > > On further reflection, I don't expect it to happen that way. Each failure > message is LOG-level, so the remaining recycles still happen. (The race > condition can yield an ERROR under PreallocXlogFiles(), but recycling is > already done at that point.) I agree to this. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 8/4/22 04:06, Kyotaro Horiguchi wrote: > At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch <noah@leadboat.com> wrote in >>>> I think in this case a HINT might be sufficient to at least keep people from >>>> wasting time tracking down a problem that has already been fixed. >> >> Here's a patch to add that HINT. I figure it's better to do this before next >> week's minor releases. In the absence of objections, I'll push this around >> 2022-08-05 14:00 UTC. > > +1 Looks good to me as well. >>>> However, there is another issue [1] that might argue for a back patch if >>>> this patch (as I believe) would fix the issue. >>> >>>> [1] https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com >>> >>> That makes sense. Each iteration of the restartpoint recycle loop has a 1/N >>> chance of failing. Recovery adds >N files between restartpoints. Hence, the >>> WAL directory grows without bound. Is that roughly the theory in mind? >> >> On further reflection, I don't expect it to happen that way. Each failure >> message is LOG-level, so the remaining recycles still happen. (The race >> condition can yield an ERROR under PreallocXlogFiles(), but recycling is >> already done at that point.) > > I agree to this. Hmmm, OK. We certainly have a fairly serious issue here, i.e. pg_wal growing without bound. Even if we are not sure what is causing it, how confident are we that the patches applied to v15 would fix it? Regards, -David
On Thu, Aug 04, 2022 at 06:32:38AM -0400, David Steele wrote: > On 8/4/22 04:06, Kyotaro Horiguchi wrote: > >At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch <noah@leadboat.com> wrote in > >>>>I think in this case a HINT might be sufficient to at least keep people from > >>>>wasting time tracking down a problem that has already been fixed. > >> > >>Here's a patch to add that HINT. I figure it's better to do this before next > >>week's minor releases. In the absence of objections, I'll push this around > >>2022-08-05 14:00 UTC. > > > >+1 > > Looks good to me as well. Thanks for reviewing. > >>>>However, there is another issue [1] that might argue for a back patch if > >>>>this patch (as I believe) would fix the issue. > >>> > >>>>[1] https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com > >>> > >>>That makes sense. Each iteration of the restartpoint recycle loop has a 1/N > >>>chance of failing. Recovery adds >N files between restartpoints. Hence, the > >>>WAL directory grows without bound. Is that roughly the theory in mind? > >> > >>On further reflection, I don't expect it to happen that way. Each failure > >>message is LOG-level, so the remaining recycles still happen. (The race > >>condition can yield an ERROR under PreallocXlogFiles(), but recycling is > >>already done at that point.) > > > >I agree to this. > > Hmmm, OK. We certainly have a fairly serious issue here, i.e. pg_wal growing > without bound. Even if we are not sure what is causing it, how confident are > we that the patches applied to v15 would fix it? I'll say 7%; certainly too low to stop investigating. The next things I'd want to see are: - select name, setting, source from pg_settings where setting <> boot_val; - log_checkpoints log entries, and other log entries having the same PID If the theory about checkpoint skipping holds, there should be a period where the volume of WAL replayed greatly exceeds max_wal_size, yet 0-1 restartpoints begin and 0 restartpoints complete.