Thread: Race between KeepFileRestoredFromArchive() and restartpoint

Race between KeepFileRestoredFromArchive() and restartpoint

From

Noah Misch

Date:

02 February 2021, 18:14:16

A race with KeepFileRestoredFromArchive() can cause a restartpoint to fail, as
seen once on the buildfarm[1].  The attached patch adds a test case; it
applies atop the "stop events" patch[2].  We have two systems for adding
long-term pg_wal directory entries.  KeepFileRestoredFromArchive() adds them
during archive recovery, while InstallXLogFileSegment() does so at all times.
Unfortunately, InstallXLogFileSegment() happens throughout archive recovery,
via the checkpointer recycling segments and calling PreallocXlogFiles().
Multiple processes can run InstallXLogFileSegment(), which uses
ControlFileLock to represent the authority to modify the directory listing of
pg_wal.  KeepFileRestoredFromArchive() just assumes it controls pg_wal.

Recycling and preallocation are wasteful during archive recovery, because
KeepFileRestoredFromArchive() unlinks every entry in its path.  I propose to
fix the race by adding an XLogCtl flag indicating which regime currently owns
the right to add long-term pg_wal directory entries.  In the archive recovery
regime, the checkpointer will not preallocate and will unlink old segments
instead of recycling them (like wal_recycle=off).  XLogFileInit() will fail.

Notable alternatives:

- Release ControlFileLock at the end of XLogFileInit(), not at the end of
  InstallXLogFileSegment().  Add ControlFileLock acquisition to
  KeepFileRestoredFromArchive().  This provides adequate mutual exclusion, but
  XLogFileInit() could return a file descriptor for an unlinked file.  That's
  fine for PreallocXlogFiles(), but it feels wrong.

- During restartpoints, never preallocate or recycle segments.  (Just delete
  obsolete WAL.)  By denying those benefits, this presumably makes streaming
  recovery less efficient.

- Make KeepFileRestoredFromArchive() call XLogFileInit() to open a segment,
  then copy bytes.  This is simple, but it multiplies I/O.  That might be
  negligible on account of caching, or it might not be.  A variant, incurring
  extra fsyncs, would be to use durable_rename() to replace the segment we get
  from XLogFileInit().

- Make KeepFileRestoredFromArchive() rename without first unlinking.  This
  avoids checkpoint failure, but a race could trigger noise from the LOG
  message in InstallXLogFileSegment -> durable_rename_excl.

Does anyone prefer some alternative?  It's probably not worth back-patching
anything for a restartpoint failure this rare, because most restartpoint
outcomes are not user-visible.

Thanks,
nm

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2020-10-05%2023%3A02%3A17
[2] https://postgr.es/m/CAPpHfdtSEOHX8dSk9Qp%2BZ%2B%2Bi4BGQoffKip6JDWngEA%2Bg7Z-XmQ%40mail.gmail.com

Attachment

repro-KeepFileRestoredFromArchive-v0.patch

Re: Race between KeepFileRestoredFromArchive() and restartpoint

From

Noah Misch

Date:

19 June 2021, 23:39:18

On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:
> Recycling and preallocation are wasteful during archive recovery, because
> KeepFileRestoredFromArchive() unlinks every entry in its path.  I propose to
> fix the race by adding an XLogCtl flag indicating which regime currently owns
> the right to add long-term pg_wal directory entries.  In the archive recovery
> regime, the checkpointer will not preallocate and will unlink old segments
> instead of recycling them (like wal_recycle=off).  XLogFileInit() will fail.

Here's the implementation.  Patches 1-4 suffice to stop the user-visible
ERROR.  Patch 5 avoids a spurious LOG-level message and wasted filesystem
writes, and it provides some future-proofing.

I was tempted to (but did not) just remove preallocation.  Creating one file
per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
expect it's hard to isolate any benefit.  Under the old checkpoint_segments=3
default, a preallocated segment covered a respectable third of the next
checkpoint.  Before commit 63653f7 (2002), preallocation created more files.

Attachment

Re: Race between KeepFileRestoredFromArchive() and restartpoint

From

David Steele

Date:

26 July 2022, 14:21:29

Hi Noah,

On 6/19/21 16:39, Noah Misch wrote:
> On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:
>> Recycling and preallocation are wasteful during archive recovery, because
>> KeepFileRestoredFromArchive() unlinks every entry in its path.  I propose to
>> fix the race by adding an XLogCtl flag indicating which regime currently owns
>> the right to add long-term pg_wal directory entries.  In the archive recovery
>> regime, the checkpointer will not preallocate and will unlink old segments
>> instead of recycling them (like wal_recycle=off).  XLogFileInit() will fail.
> 
> Here's the implementation.  Patches 1-4 suffice to stop the user-visible
> ERROR.  Patch 5 avoids a spurious LOG-level message and wasted filesystem
> writes, and it provides some future-proofing.
> 
> I was tempted to (but did not) just remove preallocation.  Creating one file
> per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
> expect it's hard to isolate any benefit.  Under the old checkpoint_segments=3
> default, a preallocated segment covered a respectable third of the next
> checkpoint.  Before commit 63653f7 (2002), preallocation created more files.

This also seems like it would fix the link issues we are seeing in [1].

I wonder if that would make it worth a back patch?

[1] 
https://www.postgresql.org/message-id/flat/CAKw-smBhLOGtRJTC5c%3DqKTPz8gz5%2BWPoVAXrHB6mY-1U4_N7-w%40mail.gmail.com

Re: Race between KeepFileRestoredFromArchive() and restartpoint

From

Noah Misch

Date:

31 July 2022, 09:17:47

On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:
> On 6/19/21 16:39, Noah Misch wrote:
> >On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:
> >>Recycling and preallocation are wasteful during archive recovery, because
> >>KeepFileRestoredFromArchive() unlinks every entry in its path.  I propose to
> >>fix the race by adding an XLogCtl flag indicating which regime currently owns
> >>the right to add long-term pg_wal directory entries.  In the archive recovery
> >>regime, the checkpointer will not preallocate and will unlink old segments
> >>instead of recycling them (like wal_recycle=off).  XLogFileInit() will fail.
> >
> >Here's the implementation.  Patches 1-4 suffice to stop the user-visible
> >ERROR.  Patch 5 avoids a spurious LOG-level message and wasted filesystem
> >writes, and it provides some future-proofing.
> >
> >I was tempted to (but did not) just remove preallocation.  Creating one file
> >per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
> >expect it's hard to isolate any benefit.  Under the old checkpoint_segments=3
> >default, a preallocated segment covered a respectable third of the next
> >checkpoint.  Before commit 63653f7 (2002), preallocation created more files.
> 
> This also seems like it would fix the link issues we are seeing in [1].
> 
> I wonder if that would make it worth a back patch?

Perhaps.  It's sad to have multiple people deep-diving into something fixed on
HEAD.  On the other hand, I'm not eager to spend risk-of-backpatch points on
this.  One alternative would be adding an errhint like "This is known to
happen occasionally during archive recovery, where it is harmless."  That has
an unpolished look, but it's low-risk and may avoid deep-dive efforts.

> [1]
https://www.postgresql.org/message-id/flat/CAKw-smBhLOGtRJTC5c%3DqKTPz8gz5%2BWPoVAXrHB6mY-1U4_N7-w%40mail.gmail.com

Re: Race between KeepFileRestoredFromArchive() and restartpoint

From

David Steele

Date:

02 August 2022, 17:14:22

On 7/31/22 02:17, Noah Misch wrote:
> On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:
>> On 6/19/21 16:39, Noah Misch wrote:
>>> On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:
>>>> Recycling and preallocation are wasteful during archive recovery, because
>>>> KeepFileRestoredFromArchive() unlinks every entry in its path.  I propose to
>>>> fix the race by adding an XLogCtl flag indicating which regime currently owns
>>>> the right to add long-term pg_wal directory entries.  In the archive recovery
>>>> regime, the checkpointer will not preallocate and will unlink old segments
>>>> instead of recycling them (like wal_recycle=off).  XLogFileInit() will fail.
>>>
>>> Here's the implementation.  Patches 1-4 suffice to stop the user-visible
>>> ERROR.  Patch 5 avoids a spurious LOG-level message and wasted filesystem
>>> writes, and it provides some future-proofing.
>>>
>>> I was tempted to (but did not) just remove preallocation.  Creating one file
>>> per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
>>> expect it's hard to isolate any benefit.  Under the old checkpoint_segments=3
>>> default, a preallocated segment covered a respectable third of the next
>>> checkpoint.  Before commit 63653f7 (2002), preallocation created more files.
>>
>> This also seems like it would fix the link issues we are seeing in [1].
>>
>> I wonder if that would make it worth a back patch?
> 
> Perhaps.  It's sad to have multiple people deep-diving into something fixed on
> HEAD.  On the other hand, I'm not eager to spend risk-of-backpatch points on
> this.  One alternative would be adding an errhint like "This is known to
> happen occasionally during archive recovery, where it is harmless."  That has
> an unpolished look, but it's low-risk and may avoid deep-dive efforts.

I think in this case a HINT might be sufficient to at least keep people 
from wasting time tracking down a problem that has already been fixed.

However, there is another issue [1] that might argue for a back patch if 
this patch (as I believe) would fix the issue.

Regards,
-David

[1] 
https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com

Re: Race between KeepFileRestoredFromArchive() and restartpoint

From

Noah Misch

Date:

02 August 2022, 17:37:27

On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote:
> On 7/31/22 02:17, Noah Misch wrote:
> >On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:
> >>On 6/19/21 16:39, Noah Misch wrote:
> >>>On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:
> >>>>Recycling and preallocation are wasteful during archive recovery, because
> >>>>KeepFileRestoredFromArchive() unlinks every entry in its path.  I propose to
> >>>>fix the race by adding an XLogCtl flag indicating which regime currently owns
> >>>>the right to add long-term pg_wal directory entries.  In the archive recovery
> >>>>regime, the checkpointer will not preallocate and will unlink old segments
> >>>>instead of recycling them (like wal_recycle=off).  XLogFileInit() will fail.
> >>>
> >>>Here's the implementation.  Patches 1-4 suffice to stop the user-visible
> >>>ERROR.  Patch 5 avoids a spurious LOG-level message and wasted filesystem
> >>>writes, and it provides some future-proofing.
> >>>
> >>>I was tempted to (but did not) just remove preallocation.  Creating one file
> >>>per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
> >>>expect it's hard to isolate any benefit.  Under the old checkpoint_segments=3
> >>>default, a preallocated segment covered a respectable third of the next
> >>>checkpoint.  Before commit 63653f7 (2002), preallocation created more files.
> >>
> >>This also seems like it would fix the link issues we are seeing in [1].
> >>
> >>I wonder if that would make it worth a back patch?
> >
> >Perhaps.  It's sad to have multiple people deep-diving into something fixed on
> >HEAD.  On the other hand, I'm not eager to spend risk-of-backpatch points on
> >this.  One alternative would be adding an errhint like "This is known to
> >happen occasionally during archive recovery, where it is harmless."  That has
> >an unpolished look, but it's low-risk and may avoid deep-dive efforts.
> 
> I think in this case a HINT might be sufficient to at least keep people from
> wasting time tracking down a problem that has already been fixed.
> 
> However, there is another issue [1] that might argue for a back patch if
> this patch (as I believe) would fix the issue.

> [1] https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com

That makes sense.  Each iteration of the restartpoint recycle loop has a 1/N
chance of failing.  Recovery adds >N files between restartpoints.  Hence, the
WAL directory grows without bound.  Is that roughly the theory in mind?

Re: Race between KeepFileRestoredFromArchive() and restartpoint

From

David Steele

Date:

02 August 2022, 18:01:19


On 8/2/22 10:37, Noah Misch wrote:
> On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote:
>> On 7/31/22 02:17, Noah Misch wrote:
>>> On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:
>>>> On 6/19/21 16:39, Noah Misch wrote:
>>>>> On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:
>>>>>> Recycling and preallocation are wasteful during archive recovery, because
>>>>>> KeepFileRestoredFromArchive() unlinks every entry in its path.  I propose to
>>>>>> fix the race by adding an XLogCtl flag indicating which regime currently owns
>>>>>> the right to add long-term pg_wal directory entries.  In the archive recovery
>>>>>> regime, the checkpointer will not preallocate and will unlink old segments
>>>>>> instead of recycling them (like wal_recycle=off).  XLogFileInit() will fail.
>>>>>
>>>>> Here's the implementation.  Patches 1-4 suffice to stop the user-visible
>>>>> ERROR.  Patch 5 avoids a spurious LOG-level message and wasted filesystem
>>>>> writes, and it provides some future-proofing.
>>>>>
>>>>> I was tempted to (but did not) just remove preallocation.  Creating one file
>>>>> per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
>>>>> expect it's hard to isolate any benefit.  Under the old checkpoint_segments=3
>>>>> default, a preallocated segment covered a respectable third of the next
>>>>> checkpoint.  Before commit 63653f7 (2002), preallocation created more files.
>>>>
>>>> This also seems like it would fix the link issues we are seeing in [1].
>>>>
>>>> I wonder if that would make it worth a back patch?
>>>
>>> Perhaps.  It's sad to have multiple people deep-diving into something fixed on
>>> HEAD.  On the other hand, I'm not eager to spend risk-of-backpatch points on
>>> this.  One alternative would be adding an errhint like "This is known to
>>> happen occasionally during archive recovery, where it is harmless."  That has
>>> an unpolished look, but it's low-risk and may avoid deep-dive efforts.
>>
>> I think in this case a HINT might be sufficient to at least keep people from
>> wasting time tracking down a problem that has already been fixed.
>>
>> However, there is another issue [1] that might argue for a back patch if
>> this patch (as I believe) would fix the issue.
> 
>> [1]
https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com
> 
> That makes sense.  Each iteration of the restartpoint recycle loop has a 1/N
> chance of failing.  Recovery adds >N files between restartpoints.  Hence, the
> WAL directory grows without bound.  Is that roughly the theory in mind?

Yes, though you have formulated it better than I had in my mind.

Let's see if Don can confirm that he is seeing the "could not link file" 
messages.

Regards,
-David

Re: Race between KeepFileRestoredFromArchive() and restartpoint

From

Don Seiler

Date:

03 August 2022, 00:03:42

On Tue, Aug 2, 2022 at 10:01 AM David Steele <david@pgmasters.net> wrote:

> That makes sense. Each iteration of the restartpoint recycle loop has a 1/N
> chance of failing. Recovery adds >N files between restartpoints. Hence, the
> WAL directory grows without bound. Is that roughly the theory in mind?

Yes, though you have formulated it better than I had in my mind.

Let's see if Don can confirm that he is seeing the "could not link file"
messages.

During my latest incident, there was only one occurrence:

could not link file “pg_wal/xlogtemp.18799" to “pg_wal/000000010000D45300000010”: File exists

WAL restore/recovery seemed to continue on just fine then. And it would continue on until the pg_wal volume ran out of space unless I was manually rm'ing already-recovered WAL files from the side.

Don Seiler
www.seiler.us

Re: Race between KeepFileRestoredFromArchive() and restartpoint

From

Kyotaro Horiguchi

Date:

03 August 2022, 05:24:17

At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler <don@seiler.us> wrote in 
> On Tue, Aug 2, 2022 at 10:01 AM David Steele <david@pgmasters.net> wrote:
> 
> >
> > > That makes sense.  Each iteration of the restartpoint recycle loop has a
> > 1/N
> > > chance of failing.  Recovery adds >N files between restartpoints.
> > Hence, the
> > > WAL directory grows without bound.  Is that roughly the theory in mind?
> >
> > Yes, though you have formulated it better than I had in my mind.

I'm not sure I understand it correctly, but isn't the cause of the
issue in the other thread due to skipping many checkpoint records
within the checkpoint_timeout?  I remember that I proposed a GUC
variable to disable that checkpoint skipping.  As another measure for
that issue, we could force replaying checkpoint if max_wal_size is
already filled up or known to be filled in the next checkpoint cycle.
If this is correct, this patch is irrelevant to the issue.

> > Let's see if Don can confirm that he is seeing the "could not link file"
> > messages.
> 
> 
> During my latest incident, there was only one occurrence:
> 
> could not link file “pg_wal/xlogtemp.18799" to
> > “pg_wal/000000010000D45300000010”: File exists

(I noticed that the patch in the other thread is broken:()

Hmm.  It seems like a race condition betwen StartupXLOG() and
RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely
taking ControlFileLock before deciding the target xlog file name in
RemoveXlogFile() seems to prevent this happening. (If this is correct
this is a live issue on the master branch.)

> WAL restore/recovery seemed to continue on just fine then. And it would
> continue on until the pg_wal volume ran out of space unless I was manually
> rm'ing already-recovered WAL files from the side.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Race between KeepFileRestoredFromArchive() and restartpoint

From

Noah Misch

Date:

03 August 2022, 10:28:47

On Wed, Aug 03, 2022 at 11:24:17AM +0900, Kyotaro Horiguchi wrote:
> At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler <don@seiler.us> wrote in 
> > could not link file “pg_wal/xlogtemp.18799" to
> > > “pg_wal/000000010000D45300000010”: File exists

> Hmm.  It seems like a race condition betwen StartupXLOG() and
> RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely
> taking ControlFileLock before deciding the target xlog file name in
> RemoveXlogFile() seems to prevent this happening. (If this is correct
> this is a live issue on the master branch.)

RemoveXlogFile() calls InstallXLogFileSegment() with find_free=true.  The
intent of find_free=true is to make it okay to pass a target xlog file that
ceases to be a good target.  (InstallXLogFileSegment() searches for a good
target while holding ControlFileLock.)  Can you say more about how that proved
to be insufficient?

Re: Race between KeepFileRestoredFromArchive() and restartpoint

From

Kyotaro Horiguchi

Date:

03 August 2022, 11:15:17

At Wed, 3 Aug 2022 00:28:47 -0700, Noah Misch <noah@leadboat.com> wrote in 
> On Wed, Aug 03, 2022 at 11:24:17AM +0900, Kyotaro Horiguchi wrote:
> > At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler <don@seiler.us> wrote in 
> > > could not link file “pg_wal/xlogtemp.18799" to
> > > > “pg_wal/000000010000D45300000010”: File exists
> 
> > Hmm.  It seems like a race condition betwen StartupXLOG() and
> > RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely
> > taking ControlFileLock before deciding the target xlog file name in
> > RemoveXlogFile() seems to prevent this happening. (If this is correct
> > this is a live issue on the master branch.)
> 
> RemoveXlogFile() calls InstallXLogFileSegment() with find_free=true.  The
> intent of find_free=true is to make it okay to pass a target xlog file that
> ceases to be a good target.  (InstallXLogFileSegment() searches for a good
> target while holding ControlFileLock.)  Can you say more about how that proved
> to be insufficient?

Ug.. No. I can't. I was confused by something. Sorry.

PreallocXlogFiles() and checkpointer are mutually excluded by the same
lock, too.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Race between KeepFileRestoredFromArchive() and restartpoint

From

Noah Misch

Date:

04 August 2022, 09:24:56

On Tue, Aug 02, 2022 at 07:37:27AM -0700, Noah Misch wrote:
> On Tue, Aug 02, 2022 at 10:14:22AM -0400, David Steele wrote:
> > On 7/31/22 02:17, Noah Misch wrote:
> > >On Tue, Jul 26, 2022 at 07:21:29AM -0400, David Steele wrote:
> > >>On 6/19/21 16:39, Noah Misch wrote:
> > >>>On Tue, Feb 02, 2021 at 07:14:16AM -0800, Noah Misch wrote:
> > >>>>Recycling and preallocation are wasteful during archive recovery, because
> > >>>>KeepFileRestoredFromArchive() unlinks every entry in its path.  I propose to
> > >>>>fix the race by adding an XLogCtl flag indicating which regime currently owns
> > >>>>the right to add long-term pg_wal directory entries.  In the archive recovery
> > >>>>regime, the checkpointer will not preallocate and will unlink old segments
> > >>>>instead of recycling them (like wal_recycle=off).  XLogFileInit() will fail.
> > >>>
> > >>>Here's the implementation.  Patches 1-4 suffice to stop the user-visible
> > >>>ERROR.  Patch 5 avoids a spurious LOG-level message and wasted filesystem
> > >>>writes, and it provides some future-proofing.
> > >>>
> > >>>I was tempted to (but did not) just remove preallocation.  Creating one file
> > >>>per checkpoint seems tiny relative to the max_wal_size=1GB default, so I
> > >>>expect it's hard to isolate any benefit.  Under the old checkpoint_segments=3
> > >>>default, a preallocated segment covered a respectable third of the next
> > >>>checkpoint.  Before commit 63653f7 (2002), preallocation created more files.
> > >>
> > >>This also seems like it would fix the link issues we are seeing in [1].
> > >>
> > >>I wonder if that would make it worth a back patch?
> > >
> > >Perhaps.  It's sad to have multiple people deep-diving into something fixed on
> > >HEAD.  On the other hand, I'm not eager to spend risk-of-backpatch points on
> > >this.  One alternative would be adding an errhint like "This is known to
> > >happen occasionally during archive recovery, where it is harmless."  That has
> > >an unpolished look, but it's low-risk and may avoid deep-dive efforts.
> > 
> > I think in this case a HINT might be sufficient to at least keep people from
> > wasting time tracking down a problem that has already been fixed.

Here's a patch to add that HINT.  I figure it's better to do this before next
week's minor releases.  In the absence of objections, I'll push this around
2022-08-05 14:00 UTC.

> > However, there is another issue [1] that might argue for a back patch if
> > this patch (as I believe) would fix the issue.
> 
> > [1]
https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com
> 
> That makes sense.  Each iteration of the restartpoint recycle loop has a 1/N
> chance of failing.  Recovery adds >N files between restartpoints.  Hence, the
> WAL directory grows without bound.  Is that roughly the theory in mind?

On further reflection, I don't expect it to happen that way.  Each failure
message is LOG-level, so the remaining recycles still happen.  (The race
condition can yield an ERROR under PreallocXlogFiles(), but recycling is
already done at that point.)

Attachment

XLogFileInit6-hint-v1.patch

Re: Race between KeepFileRestoredFromArchive() and restartpoint

From

Kyotaro Horiguchi

Date:

04 August 2022, 11:06:40

At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch <noah@leadboat.com> wrote in 
> > > I think in this case a HINT might be sufficient to at least keep people from
> > > wasting time tracking down a problem that has already been fixed.
> 
> Here's a patch to add that HINT.  I figure it's better to do this before next
> week's minor releases.  In the absence of objections, I'll push this around
> 2022-08-05 14:00 UTC.

+1

> > > However, there is another issue [1] that might argue for a back patch if
> > > this patch (as I believe) would fix the issue.
> > 
> > > [1]
https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com
> > 
> > That makes sense.  Each iteration of the restartpoint recycle loop has a 1/N
> > chance of failing.  Recovery adds >N files between restartpoints.  Hence, the
> > WAL directory grows without bound.  Is that roughly the theory in mind?
> 
> On further reflection, I don't expect it to happen that way.  Each failure
> message is LOG-level, so the remaining recycles still happen.  (The race
> condition can yield an ERROR under PreallocXlogFiles(), but recycling is
> already done at that point.)

I agree to this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Race between KeepFileRestoredFromArchive() and restartpoint

From

David Steele

Date:

04 August 2022, 13:32:38

On 8/4/22 04:06, Kyotaro Horiguchi wrote:
> At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch <noah@leadboat.com> wrote in
>>>> I think in this case a HINT might be sufficient to at least keep people from
>>>> wasting time tracking down a problem that has already been fixed.
>>
>> Here's a patch to add that HINT.  I figure it's better to do this before next
>> week's minor releases.  In the absence of objections, I'll push this around
>> 2022-08-05 14:00 UTC.
> 
> +1

Looks good to me as well.

>>>> However, there is another issue [1] that might argue for a back patch if
>>>> this patch (as I believe) would fix the issue.
>>>
>>>> [1]
https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com
>>>
>>> That makes sense.  Each iteration of the restartpoint recycle loop has a 1/N
>>> chance of failing.  Recovery adds >N files between restartpoints.  Hence, the
>>> WAL directory grows without bound.  Is that roughly the theory in mind?
>>
>> On further reflection, I don't expect it to happen that way.  Each failure
>> message is LOG-level, so the remaining recycles still happen.  (The race
>> condition can yield an ERROR under PreallocXlogFiles(), but recycling is
>> already done at that point.)
> 
> I agree to this.

Hmmm, OK. We certainly have a fairly serious issue here, i.e. pg_wal 
growing without bound. Even if we are not sure what is causing it, how 
confident are we that the patches applied to v15 would fix it?

Regards,
-David

Re: Race between KeepFileRestoredFromArchive() and restartpoint

From

Noah Misch

Date:

04 August 2022, 17:21:10

On Thu, Aug 04, 2022 at 06:32:38AM -0400, David Steele wrote:
> On 8/4/22 04:06, Kyotaro Horiguchi wrote:
> >At Wed, 3 Aug 2022 23:24:56 -0700, Noah Misch <noah@leadboat.com> wrote in
> >>>>I think in this case a HINT might be sufficient to at least keep people from
> >>>>wasting time tracking down a problem that has already been fixed.
> >>
> >>Here's a patch to add that HINT.  I figure it's better to do this before next
> >>week's minor releases.  In the absence of objections, I'll push this around
> >>2022-08-05 14:00 UTC.
> >
> >+1
> 
> Looks good to me as well.

Thanks for reviewing.

> >>>>However, there is another issue [1] that might argue for a back patch if
> >>>>this patch (as I believe) would fix the issue.
> >>>
> >>>>[1]
https://www.postgresql.org/message-id/CAHJZqBDxWfcd53jm0bFttuqpK3jV2YKWx%3D4W7KxNB4zzt%2B%2BqFg%40mail.gmail.com
> >>>
> >>>That makes sense.  Each iteration of the restartpoint recycle loop has a 1/N
> >>>chance of failing.  Recovery adds >N files between restartpoints.  Hence, the
> >>>WAL directory grows without bound.  Is that roughly the theory in mind?
> >>
> >>On further reflection, I don't expect it to happen that way.  Each failure
> >>message is LOG-level, so the remaining recycles still happen.  (The race
> >>condition can yield an ERROR under PreallocXlogFiles(), but recycling is
> >>already done at that point.)
> >
> >I agree to this.
> 
> Hmmm, OK. We certainly have a fairly serious issue here, i.e. pg_wal growing
> without bound. Even if we are not sure what is causing it, how confident are
> we that the patches applied to v15 would fix it?

I'll say 7%; certainly too low to stop investigating.  The next things I'd want
to see are:

- select name, setting, source from pg_settings where setting <> boot_val;
- log_checkpoints log entries, and other log entries having the same PID

If the theory about checkpoint skipping holds, there should be a period where
the volume of WAL replayed greatly exceeds max_wal_size, yet 0-1 restartpoints
begin and 0 restartpoints complete.