Thread: archive status ".ready" files may be created too early

archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
Hi hackers,

I believe I've uncovered a bug that may cause archive status ".ready"
files to be created too early, which in turn may cause an incorrect
version of the corresponding WAL segment to be archived.

The crux of the issue seems to be that XLogWrite() does not wait for
the entire record to be written to disk before creating the ".ready"
file.  Instead, it just waits for the last page of the segment to be
written before notifying the archiver.  If PostgreSQL crashes before
it is able to write the rest of the record, it will end up reusing the
".ready" segment at the end of crash recovery.  In the meantime, the
archiver process may have already processed the old version of the
segment.

This issue seems to most often manifest as WAL corruption on standby
servers after the primary server has crashed because it ran out of
disk space.  I have attached a proof-of-concept patch
(ready_file_fix.patch) that waits to create any ".ready" files until
closer to the end of XLogWrite().  The patch is incorrect for a few
reasons, but I hope it helps illustrate the problem.  I have also
attached another patch (repro_helper.patch) to be used in conjunction
with the following steps to reproduce the issue:

        initdb .
        pg_ctl -D . -o "-c archive_mode=on -c archive_command='exit 0'" -l log.log start
        pgbench -i -s 1000 postgres
        psql postgres -c "SELECT pg_switch_wal();"

With just repro_helper.patch applied, these commands should produce
both of the following log statements:

        PANIC:  failing at inconvenient time
        LOG:  status file already exists for "000000010000000000000017"

With both patches applied, the commands will only produce the first
PANIC statement.

Another thing I am exploring is whether a crash in between writing the
last page of a segment and creating the ".ready" file could cause the
archiver process to skip processing it altogether.  In the scenario I
mention earlier, the server seems to recreate the ".ready" file since
it rewrites a portion of the segment.  However, if a WAL record fits
perfectly into the last section of the segment, I am not sure whether
the ".ready" file would be created after restart.

I am admittedly in the early stages of working on this problem, but I
thought it would be worth reporting to the community early on in case
anyone has any thoughts on or past experiences with this issue.

Nathan


Attachment

Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
Hello.

At Thu, 12 Dec 2019 22:50:20 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> Hi hackers,
> 
> I believe I've uncovered a bug that may cause archive status ".ready"
> files to be created too early, which in turn may cause an incorrect
> version of the corresponding WAL segment to be archived.
> 
> The crux of the issue seems to be that XLogWrite() does not wait for
> the entire record to be written to disk before creating the ".ready"
> file.  Instead, it just waits for the last page of the segment to be
> written before notifying the archiver.  If PostgreSQL crashes before
> it is able to write the rest of the record, it will end up reusing the
> ".ready" segment at the end of crash recovery.  In the meantime, the
> archiver process may have already processed the old version of the
> segment.

Year, that can happen if the server restarted after the crash.

> This issue seems to most often manifest as WAL corruption on standby
> servers after the primary server has crashed because it ran out of
> disk space.

In the first place, it's quite bad to set restart_after_crash to on,
or just restart crashed master in replication set. The standby can be
incosistent at the time of master crash, so it should be fixed using
pg_rewind or should be recreated from a base backup.

Even without that archiving behavior, a standby may receive wal bytes
inconsistent to the bytes from the same master just before crash. It
is not limited to segment boundary. It can happen on every block
boundary and could happen everywhere with more complecated steps.

What you are calling as a "problem" seems coming from allowing the
restart_after_crash behavior. On the other hand, as recommended in the
documentation, archive_command can refuse overwriting of the same
segment, but we don't impose to do that.

As the result the patch doesn't seem to save anything than setting up
and operating correctly.

> Another thing I am exploring is whether a crash in between writing the
> last page of a segment and creating the ".ready" file could cause the
> archiver process to skip processing it altogether.  In the scenario I
> mention earlier, the server seems to recreate the ".ready" file since
> it rewrites a portion of the segment.  However, if a WAL record fits
> perfectly into the last section of the segment, I am not sure whether
> the ".ready" file would be created after restart.

Why that segment needs .ready after restart, even though nothing could
be written to the old segment?

> I am admittedly in the early stages of working on this problem, but I
> thought it would be worth reporting to the community early on in case
> anyone has any thoughts on or past experiences with this issue.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 12/12/19, 8:08 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> At Thu, 12 Dec 2019 22:50:20 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in
>> Hi hackers,
>>
>> I believe I've uncovered a bug that may cause archive status ".ready"
>> files to be created too early, which in turn may cause an incorrect
>> version of the corresponding WAL segment to be archived.
>>
>> The crux of the issue seems to be that XLogWrite() does not wait for
>> the entire record to be written to disk before creating the ".ready"
>> file.  Instead, it just waits for the last page of the segment to be
>> written before notifying the archiver.  If PostgreSQL crashes before
>> it is able to write the rest of the record, it will end up reusing the
>> ".ready" segment at the end of crash recovery.  In the meantime, the
>> archiver process may have already processed the old version of the
>> segment.
>
> Year, that can happen if the server restarted after the crash.
>
>> This issue seems to most often manifest as WAL corruption on standby
>> servers after the primary server has crashed because it ran out of
>> disk space.
>
> In the first place, it's quite bad to set restart_after_crash to on,
> or just restart crashed master in replication set. The standby can be
> incosistent at the time of master crash, so it should be fixed using
> pg_rewind or should be recreated from a base backup.
>
> Even without that archiving behavior, a standby may receive wal bytes
> inconsistent to the bytes from the same master just before crash. It
> is not limited to segment boundary. It can happen on every block
> boundary and could happen everywhere with more complecated steps.
>
> What you are calling as a "problem" seems coming from allowing the
> restart_after_crash behavior. On the other hand, as recommended in the
> documentation, archive_command can refuse overwriting of the same
> segment, but we don't impose to do that.
>
> As the result the patch doesn't seem to save anything than setting up
> and operating correctly.

Disregarding the behavior of standby servers for a minute, I think
that what I've described is still a problem for archiving.  If the
segment is archived too early, point-in-time restores that require it
will fail.  If the server refuses to overwrite existing archive files,
the archiver process may fail to process the "good" version of the
segment until someone takes action to fix it.  I think this is
especially troubling for backup utilities like pgBackRest that check
the archive_status directory independently since it is difficult to
know if the segment is truly ".ready".

I've attached a slightly improved patch to show how this might be
fixed.  I am curious what concerns there are about doing something
like it to prevent this scenario.

>> Another thing I am exploring is whether a crash in between writing the
>> last page of a segment and creating the ".ready" file could cause the
>> archiver process to skip processing it altogether.  In the scenario I
>> mention earlier, the server seems to recreate the ".ready" file since
>> it rewrites a portion of the segment.  However, if a WAL record fits
>> perfectly into the last section of the segment, I am not sure whether
>> the ".ready" file would be created after restart.
>
> Why that segment needs .ready after restart, even though nothing could
> be written to the old segment?

If a ".ready" file is never created for a segment, I don't think it
will be archived.

Nathan


Attachment

Re: archive status ".ready" files may be created too early

From
Alvaro Herrera
Date:
On 2019-Dec-13, Kyotaro Horiguchi wrote:

> At Thu, 12 Dec 2019 22:50:20 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 

> > The crux of the issue seems to be that XLogWrite() does not wait for
> > the entire record to be written to disk before creating the ".ready"
> > file.  Instead, it just waits for the last page of the segment to be
> > written before notifying the archiver.  If PostgreSQL crashes before
> > it is able to write the rest of the record, it will end up reusing the
> > ".ready" segment at the end of crash recovery.  In the meantime, the
> > archiver process may have already processed the old version of the
> > segment.
> 
> Year, that can happen if the server restarted after the crash.

... which is the normal way to run things, no?

> > servers after the primary server has crashed because it ran out of
> > disk space.
> 
> In the first place, it's quite bad to set restart_after_crash to on,
> or just restart crashed master in replication set.

Why is it bad?  It's the default value.

> The standby can be incosistent at the time of master crash, so it
> should be fixed using pg_rewind or should be recreated from a base
> backup.

Surely the master will just come up and replay its WAL, and there should
be no inconsistency.

You seem to be thinking that a standby is promoted immediately on crash
of the master, but this is not a given.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
Uggg. I must apologyze for the last bogus comment.

At Fri, 13 Dec 2019 21:24:36 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> On 12/12/19, 8:08 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> > As the result the patch doesn't seem to save anything than setting up
> > and operating correctly.
> 
> Disregarding the behavior of standby servers for a minute, I think

I'm sorry. a continuation record split beyond a segment boundary
doesn't seem to harm replication. Please forget it.

> that what I've described is still a problem for archiving.  If the

Yeah, I think that happens and it seems a problem.

> segment is archived too early, point-in-time restores that require it
> will fail.  If the server refuses to overwrite existing archive files,
> the archiver process may fail to process the "good" version of the
> segment until someone takes action to fix it.  I think this is
> especially troubling for backup utilities like pgBackRest that check
> the archive_status directory independently since it is difficult to
> know if the segment is truly ".ready".
> 
> I've attached a slightly improved patch to show how this might be
> fixed.  I am curious what concerns there are about doing something
> like it to prevent this scenario.

Basically, I agree to the direction, where the .ready notification is
delayed until all requested WAL bytes are written out.

But I think I found a corner case where the patch doesn't work. As I
mentioned in another message, if WAL buffer was full,
AdvanceXLInsertBuffer calls XLogWrite to write out the victim buffer
regardless whether the last record in the page was the first half of a
continuation record. XLogWrite can mark the segment as .ready even
with the patch.

Is that correct? And do you think the corner case is worth amending?

If so, we could amend also that case by marking the last segment as
.ready when XLogWrite writes the first bytes of the next segment. (As
the further corner-case, it still doesn't work if a contination record
spans over trhee or more segments.. But I don't (or want not to) think
we don't need to consider that case..)


Thoughts?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
Thank you Alvaro for the comment (on my comment).

At Fri, 13 Dec 2019 18:33:44 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in 
> On 2019-Dec-13, Kyotaro Horiguchi wrote:
> 
> > At Thu, 12 Dec 2019 22:50:20 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> 
> > > The crux of the issue seems to be that XLogWrite() does not wait for
> > > the entire record to be written to disk before creating the ".ready"
> > > file.  Instead, it just waits for the last page of the segment to be
> > > written before notifying the archiver.  If PostgreSQL crashes before
> > > it is able to write the rest of the record, it will end up reusing the
> > > ".ready" segment at the end of crash recovery.  In the meantime, the
> > > archiver process may have already processed the old version of the
> > > segment.
> > 
> > Year, that can happen if the server restarted after the crash.
> 
> ... which is the normal way to run things, no?

Yes. In older version (< 10), the default value for wal_level was
minimal. In 10, the default only for wal_level was changed to
replica. Still I'm not sure if restart_after_crash can be recommended
for streaming replcation...

> Why is it bad?  It's the default value.

I reconsider it more deeply. And concluded that's not harm replication
as I thought.

WAL-buffer overflow may write partial continuation record and it can
be flushed immediately. That made me misunderstood that standby can
receive only the first half of a continuation record. Actually, that
write doesn't advance LogwrtResult.Flush. So standby doesn't receive a
split record on page boundary. (The cases where crashed mater is used
as new standby as-is might contaminate my thought..)

Sorry for the bogus comment.  My conclusion here is that
restart_after_crash doesn't seem to harm standby immediately.

> > The standby can be incosistent at the time of master crash, so it
> > should be fixed using pg_rewind or should be recreated from a base
> > backup.
> 
> Surely the master will just come up and replay its WAL, and there should
> be no inconsistency.
> 
> You seem to be thinking that a standby is promoted immediately on crash
> of the master, but this is not a given.

Basically no, but it might be mixed a bit. Anyway returning to the
porposal, I think that XLogWrite can be called during at
WAL-buffer-full and it can go into the last page in a segment. The
proposed patch doesn't work since the XLogWrite call didn't write the
whole continuation record. But I'm not sure that corner-case is worth
amendint..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: archive status ".ready" files may be created too early

From
Fujii Masao
Date:
On Fri, Dec 13, 2019 at 7:50 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>
> Hi hackers,
>
> I believe I've uncovered a bug that may cause archive status ".ready"
> files to be created too early, which in turn may cause an incorrect
> version of the corresponding WAL segment to be archived.
>
> The crux of the issue seems to be that XLogWrite() does not wait for
> the entire record to be written to disk before creating the ".ready"
> file.  Instead, it just waits for the last page of the segment to be
> written before notifying the archiver.  If PostgreSQL crashes before
> it is able to write the rest of the record, it will end up reusing the
> ".ready" segment at the end of crash recovery.  In the meantime, the
> archiver process may have already processed the old version of the
> segment.

Maybe I'm missing something... But since XLogWrite() seems to
call issue_xlog_fsync() before XLogArchiveNotifySeg(), ISTM that
this trouble shouldn't happen. No?

Regards,

-- 
Fujii Masao



Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
Hello.

At Wed, 18 Dec 2019 14:10:04 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in 
> On Fri, Dec 13, 2019 at 7:50 AM Bossart, Nathan <bossartn@amazon.com> wrote:
> >
> > Hi hackers,
> >
> > I believe I've uncovered a bug that may cause archive status ".ready"
> > files to be created too early, which in turn may cause an incorrect
> > version of the corresponding WAL segment to be archived.
> >
> > The crux of the issue seems to be that XLogWrite() does not wait for
> > the entire record to be written to disk before creating the ".ready"
> > file.  Instead, it just waits for the last page of the segment to be
> > written before notifying the archiver.  If PostgreSQL crashes before
> > it is able to write the rest of the record, it will end up reusing the
> > ".ready" segment at the end of crash recovery.  In the meantime, the
> > archiver process may have already processed the old version of the
> > segment.
> 
> Maybe I'm missing something... But since XLogWrite() seems to
> call issue_xlog_fsync() before XLogArchiveNotifySeg(), ISTM that
> this trouble shouldn't happen. No?

The trouble happens after the synced file is archived.  If the last
record in the arcvhied segment was the first half of a continuation
record and crash before writing the last half, crash recovery stops
just before the first half and different record can be overwitten. As
the result the archived version of the segment becomes rotten.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 12/17/19, 2:26 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> But I think I found a corner case where the patch doesn't work. As I
> mentioned in another message, if WAL buffer was full,
> AdvanceXLInsertBuffer calls XLogWrite to write out the victim buffer
> regardless whether the last record in the page was the first half of a
> continuation record. XLogWrite can mark the segment as .ready even
> with the patch.
>
> Is that correct? And do you think the corner case is worth amending?

I certainly think it is worth trying to prevent potential WAL archive
corruption in known corner cases.  Your comment highlights a potential
shortcoming of my patch.  AFAICT there is no guarantee that
XLogWrite() is called with a complete WAL record.  Even if that
assumption is true at the moment, it might not hold up over time.

> If so, we could amend also that case by marking the last segment as
> .ready when XLogWrite writes the first bytes of the next segment. (As
> the further corner-case, it still doesn't work if a contination record
> spans over trhee or more segments.. But I don't (or want not to) think
> we don't need to consider that case..)

I'm working on a new version of the patch that will actually look at
the WAL page metadata to determine when it is safe to mark a segment
as ready for archival.  It seems relatively easy to figure out whether
a page is the last one for the current WAL record.

Nathan


Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 12/18/19, 8:34 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> On 12/17/19, 2:26 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
>> If so, we could amend also that case by marking the last segment as
>> .ready when XLogWrite writes the first bytes of the next segment. (As
>> the further corner-case, it still doesn't work if a contination record
>> spans over trhee or more segments.. But I don't (or want not to) think
>> we don't need to consider that case..)
>
> I'm working on a new version of the patch that will actually look at
> the WAL page metadata to determine when it is safe to mark a segment
> as ready for archival.  It seems relatively easy to figure out whether
> a page is the last one for the current WAL record.

I stand corrected.  My attempts to add logic to check the WAL records
added quite a bit more complexity than seemed reasonable to maintain
in this code path.  For example, I didn't anticipate things like
XLOG_SWITCH records.

I am still concerned about the corner case you noted, but I have yet
to find a practical way to handle it.  You suggested waiting until
writing the first bytes of the next segment before marking a segment
as ready, but I'm not sure that fixes this problem either, and I
wonder if it could result in waiting arbitrarily long before creating
a ".ready" file in some cases.  Perhaps I am misunderstanding your
suggestion.

Another thing I noticed is that any changes in this area could impact
archive_timeout.  If we reset the archive_timeout timer when we mark
the segments ready, we could force WAL switches more often.  If we do
not move the timer logic, we could be resetting it before the file is
ready for the archiver.  However, these differences might be subtle
enough to be okay.

Nathan


Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
At Sat, 21 Dec 2019 01:18:24 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> On 12/18/19, 8:34 AM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> > On 12/17/19, 2:26 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> >> If so, we could amend also that case by marking the last segment as
> >> .ready when XLogWrite writes the first bytes of the next segment. (As
> >> the further corner-case, it still doesn't work if a contination record
> >> spans over trhee or more segments.. But I don't (or want not to) think
> >> we don't need to consider that case..)
> >
> > I'm working on a new version of the patch that will actually look at
> > the WAL page metadata to determine when it is safe to mark a segment
> > as ready for archival.  It seems relatively easy to figure out whether
> > a page is the last one for the current WAL record.
> 
> I stand corrected.  My attempts to add logic to check the WAL records
> added quite a bit more complexity than seemed reasonable to maintain
> in this code path.  For example, I didn't anticipate things like
> XLOG_SWITCH records.
> 
> I am still concerned about the corner case you noted, but I have yet
> to find a practical way to handle it.  You suggested waiting until
> writing the first bytes of the next segment before marking a segment
> as ready, but I'm not sure that fixes this problem either, and I
> wonder if it could result in waiting arbitrarily long before creating
> a ".ready" file in some cases.  Perhaps I am misunderstanding your
> suggestion.
> 
> Another thing I noticed is that any changes in this area could impact
> archive_timeout.  If we reset the archive_timeout timer when we mark
> the segments ready, we could force WAL switches more often.  If we do
> not move the timer logic, we could be resetting it before the file is
> ready for the archiver.  However, these differences might be subtle
> enough to be okay.

You're right. That doesn't seem to work. Another thing I had in my
mind was that XLogWrite had an additional flag so that
AdvanceXLInsertBuffer can tell not to mark .ready. The function is
called while it *is* writing a complete record. So even if
AdvanceXLInsertBuffer inhibit marking .ready the succeeding bytes
comes soon and we can mark the old segment as .ready at the time.

..
+ * If record_write == false, we don't mark the last segment as .ready
+ * if the caller requested to write up to segment boundary.
..
  static void
- XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
+ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool record_write)

When XLogWrite is called with record_write = false, we don't mark
.ready and don't advance lastSegSwitchTime/LSN. At the next time
XLogWrite is called with record_write=true, if lastSegSwitchLSN is
behind the latest segment boundary before or equal to
LogwrtResult.Write, mark the skipped segments as .ready and update
lastSegSwitchTime/LSN.

Does the above make sense?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 12/23/19, 6:09 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> You're right. That doesn't seem to work. Another thing I had in my
> mind was that XLogWrite had an additional flag so that
> AdvanceXLInsertBuffer can tell not to mark .ready. The function is
> called while it *is* writing a complete record. So even if
> AdvanceXLInsertBuffer inhibit marking .ready the succeeding bytes
> comes soon and we can mark the old segment as .ready at the time.
>
> ..
> + * If record_write == false, we don't mark the last segment as .ready
> + * if the caller requested to write up to segment boundary.
> ..
>   static void
> - XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
> + XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool record_write)
>
> When XLogWrite is called with record_write = false, we don't mark
> .ready and don't advance lastSegSwitchTime/LSN. At the next time
> XLogWrite is called with record_write=true, if lastSegSwitchLSN is
> behind the latest segment boundary before or equal to
> LogwrtResult.Write, mark the skipped segments as .ready and update
> lastSegSwitchTime/LSN.

Thanks for the suggestion.  I explored this proposal a bit today.
It looks like there are three places where XLogWrite() is called:
AdvanceXLInsertBuffer(), XLogFlush(), and XLogBackgroundFlush().  IIUC
while XLogFlush() generally seems to be used to write complete records
to disk, this might not be true for XLogBackgroundFlush(), and we're
reasonably sure we cannot make such an assumption for
AdvanceXLInsertBuffer().  Therefore, we would likely only set
record_write to true for XLogFlush() and for certain calls to
XLogBackgroundFlush (e.g. flushing asynchronous commits).

I'm worried that this approach could be fragile and that we could end
up waiting an arbitrarily long time before marking segments as ready
for archival.  Even if we pay very close attention to the latest
flushed LSN, it seems possible that a non-record_write call to
XLogWrite() advances things such that we avoid ever calling it with
record_write = true.  For example, XLogBackgroundFlush() may have
flushed the completed blocks, which we cannot assume are complete
records.  Then, XLogFlush() would skip calling XLogWrite() if
LogwrtResult.Flush is sufficiently far ahead.  In this scenario, I
don't think we would mark any eligible segments as ".ready" until the
next call to XLogWrite() with record_write = true, which may never
happen.

The next approach I'm going to try is having the callers of
XLogWrite() manage marking segments ready.  That might make it easier
to mitigate some of my concerns above, but I'm not tremendously
optimistic that this approach will fare any better.

Nathan


Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
Sorry for the long delay.

I've finally gotten to a new approach that I think is promising.  My
previous attempts to fix this within XLogWrite() or within the
associated code paths all seemed to miss corner cases or to add far
too much complexity.  The new proof-of-concept patch that I have
attached is much different.  Instead of trying to adjust the ready-
for-archive logic in the XLogWrite() code paths, I propose relocating
the ready-for-archive logic to a separate process.

The v3 patch is a proof-of-concept patch that moves the ready-for-
archive logic to the WAL writer process.  We mark files as ready-for-
archive when the WAL flush pointer has advanced beyond a known WAL
record boundary.  In this patch, I am using the WAL insert location as
the known WAL record boundary.  The main idea is that it should be
safe to archive a segment once we know the last WAL record for the WAL
segment, which may overflow into the following segment, has been
completely written to disk.

There are many things missing from this proof-of-concept patch that
will need to be handled if this approach seems reasonable.  For
example, I haven't looked into any adjustments needed for the
archive_timeout parameter, I haven't added a way to persist the
"latest segment marked ready-for-archive" through crashes, I haven't
tried reducing the frequency of retrieving the WAL locations, and I'm
not sure the WAL writer process is even the right location for this
logic.  However, these remaining problems all seem tractable to me.

I would appreciate your feedback on whether you believe this approach
is worth pursuing.

Nathan


Attachment

RE: archive status ".ready" files may be created too early

From
"matsumura.ryo@fujitsu.com"
Date:
2020-03-26 18:50:24 Bossart, Nathan <bossartn(at)amazon(dot)com> wrote:
> The v3 patch is a proof-of-concept patch that moves the ready-for-
> archive logic to the WAL writer process.  We mark files as ready-for-
> archive when the WAL flush pointer has advanced beyond a known WAL
> record boundary.


I like such a simple resolution, but I cannot agree it.

1.
This patch makes wal_writer_delay to have two meanings. For example,
an user setting the parameter to a bigger value gets a archived file
later.

2.
Even if we create a new parameter, we and users cannot determine the
best value.

3.
PostgreSQL guarantees that if a database cluster stopped smartly,
the cluster flushed and archived all WAL record as follows.

 [xlog.c]
  * If archiving is enabled, rotate the last XLOG file so that all the
  * remaining records are archived (postmaster wakes up the archiver
  * process one more time at the end of shutdown). The checkpoint
  * record will go to the next XLOG file and won't be archived (yet).

Therefore, the idea may need that end-synchronization between WalWriter
and archiver(pgarch).  I cannot agree it because processing for stopping
system has complexity inherently and the syncronization makes it more 
complicated.  Your idea gives up currency of the notifying instead of simplicity,
but I think that the synchronization may ruin its merit.

4.
I found the patch spills a chance for notifying.  We have to be more careful.
At the following case, WalWriter will notify after a little less than 3 times
of wal_writer_delay in worst case.  It may not be allowed depending on value
of wal_writer_delay. If we create a new parameter, we cannot explain to user about it.

Premise:
- Seg1 has been already notified.
- FlushedPtr is 0/2D00000 (= all WAL record is flushed).

-----
Step 1.
Backend-A updates InsertPtr to 0/2E00000, but does not
copy WAL record to buffer.

Step 2. (sleep)
WalWriter memorize InsertPtr 0/2E00000 to the local variable
(LocalInsertPtr) and sleep because FlushedPtr has not passed
InsertPtr.

Step 3.
Backend-A copies WAL record to buffer.

Step 4.
Backend-B process updates InsertPtr to 0/3100000,
copies their record to buffer, commits (flushes it by itself),
and updates FlushedPtr to 0/3100000.

Step 5.
WalWriter detects that FlushedPtr(0/3100000) passes
LocalInsertPtr(0/2E00000), but WalWriter cannot notify Seg2
though it should be notified.

It is caused by that WalWriter does not know that
which record is crossing segment boundary.

Then, after two sleeping for cheking that InsertPtr passes
FlushedPtr again in worst case, Seg2 is notified.

Step 6. (sleep)
WalWriter sleep.

Step 7.
Backend-C inserts WAL record, flush, and updates as follows:
InsertPtr --> 0/3200000
FlushedPtr --> 0/3200000

Step 8.
Backend-D updates InsertPtr to 0/3300000, but does not copy
record to buffer.

Step 9. (sleep)
WalWriter memorize InsertPtr 0/3300000 to LocalInsertPtr
and sleep because FlushedPtr has been 0/3200000.

Step 10.
Backend-D copies its record.

Step 11.
Someone(Backend-X or WalWriter) flushes and updates FlushedPtr
to 0/3300000.

Step 12.
WalWriter detects that FlushedPtr(0/3300000) passes
LocalInsertPtr(0/3300000) and notify Seg2.
-----


I'm preparing a patch that backend inserting segment-crossboundary
WAL record leaves its EndRecPtr and someone flushing it checks
the EndRecPtr and notifies..


Regards
Ryo Matsumura

Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 5/28/20, 11:42 PM, "matsumura.ryo@fujitsu.com" <matsumura.ryo@fujitsu.com> wrote:
> I'm preparing a patch that backend inserting segment-crossboundary
> WAL record leaves its EndRecPtr and someone flushing it checks
> the EndRecPtr and notifies..

Thank you for sharing your thoughts.  I will be happy to take a look
at your patch.

Nathan


RE: archive status ".ready" files may be created too early

From
"matsumura.ryo@fujitsu.com"
Date:
> On 5/28/20, 11:42 PM, "matsumura.ryo@fujitsu.com" <matsumura.ryo@fujitsu.com>
> wrote:
> > I'm preparing a patch that backend inserting segment-crossboundary
> > WAL record leaves its EndRecPtr and someone flushing it checks
> > the EndRecPtr and notifies..


I'm sorry for my slow work.

I attach a patch.
I also attach a simple target test for primary.


1. Description in primary side

[Basic problem]
  A process flushing WAL record doesn't know whether the flushed RecPtr is 
  EndRecPtr of cross-segment-boundary WAL record or not because only process 
  inserting the WAL record knows it and it never memorizes the information to anywhere.

[Basic concept of the patch in primary]
  A process inserting a cross-segment-boundary WAL record memorizes its EndRecPtr
  (I call it CrossBoundaryEndRecPtr) to a new structure in XLogCtl.
  A flushing process creates .ready (Later, I call it just 'notify'.) against 
  a segment that is previous one including CrossBoundaryEndRecPtr only when its 
  flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr.

[Detail of implementation in primary]
* Structure of CrossBoundaryEndRecPtrs
  Requirement of structure is as the following:
  - System must memorize multiple CrossBoundaryEndRecPtr.
  - A flushing process must determine to notify or not with only flushed RecPtr briefly.

  Therefore, I implemented the structure as an array (I call it CrossBoundaryEndRecPtr array)
  that is same as xlblck array.  Strictly, it is enogh that the length is
  'xbuffers/wal_segment_size', but I choose 'xbuffers' for simplicity that makes
  enable the flushing process to use XLogRecPtrToBufIdx().
  See also the definition of XLogCtl, XLOGShmemSize(), and XLOGShmemInit() in my patch.

* Action of inserting process
  A inserting process memorie its CrossBoundaryEndRecPtr to CrossBoundaryEndRecPtr
  array element calculated by XLogRecPtrToBufIdx with its CrossBoundaryEndRecPtr.
  If the WAL record crosses many segments, only element against last segment
  including the EndRecPtr is set and others are not set.
  See also CopyXLogRecordToWAL() in my patch.

* Action of flushing process
  Overview has been already written as the follwing.
    A flushing process creates .ready (Later, I call it just 'notify'.) against 
    a segment that is previous one including CrossBoundaryEndRecPtr only when its 
    flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr.

  An additional detail is as the following.  The flushing process may notify
  many segments if the record crosses many segments, so the process memorizes
  latest notified segment number to latestArchiveNotifiedSegNo in XLogCtl.
  The process notifies from latestArchiveNotifiedSegNo + 1 to
  flushing segment number - 1.

  And latestArchiveNotifiedSegNo is set to EndOfLog after Startup process exits
  replay-loop.  Standby also set same timing (= before promoting).

  Mutual exlusion about latestArchiveNotifiedSegNo is not required because
  the timing of accessing has been already included in WALWriteLock critical section.


2. Description in standby side

* Who notifies?
  walreceiver also doesn't know whether the flushed RecPtr is EndRecPtr of
  cross-segment-boundary WAL record or not.  In standby, only Startup process
  knows the information because it is hidden in WAL record itself and only
  Startup process reads and builds WAL record.

* Action of Statup process
  Therefore, I implemented that walreceiver never notify and Startup process does it.
  In detail, when Startup process reads one full-length WAL record, it notifies
  from a segment that includes head(ReadRecPtr) of the record to a previous segment that 
  includes EndRecPtr of the record.

  Now, we must pay attention about switching time line.
  The last segment of previous TimeLineID must be notified before switching.
  This case is considered when RM_XLOG_ID is detected.


3. About other notifying for segment
Two notifyings for segment are remain.  They are not needed to fix.

(1) Notifying for partial segment
It is not needed to be completed, so it's OK to notify without special consideration.

(2) Re-notifying
Currently, Checkpointer has notified through XLogArchiveCheckDone().
It is a safe-net for failure of notifying by backend or WAL writer.
Backend or WAL writer doesn't retry to notify if falis, but Checkpointer retries
to notify when it removes old segment. If it fails to notify, then it does not
remove the segment.  It makes Checkpointer to retry notify until the notifying suceeeds.
Also, in this case, we can just notify whithout special consideration
because Checkpointer guarantees that all WAL record included in the segment have been already flushed.


Please, your review and comments.


Regards
Ryo Matsumura

Attachment

Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
Hello.  Matsumura-san.

I agree that WAL writer is not the place to notify segmnet. And the
direction you suggested would work.

At Fri, 19 Jun 2020 10:18:34 +0000, "matsumura.ryo@fujitsu.com" <matsumura.ryo@fujitsu.com> wrote in 
> 1. Description in primary side
> 
> [Basic problem]
>   A process flushing WAL record doesn't know whether the flushed RecPtr is 
>   EndRecPtr of cross-segment-boundary WAL record or not because only process 
>   inserting the WAL record knows it and it never memorizes the information to anywhere.
> 
> [Basic concept of the patch in primary]
>   A process inserting a cross-segment-boundary WAL record memorizes its EndRecPtr
>   (I call it CrossBoundaryEndRecPtr) to a new structure in XLogCtl.
>   A flushing process creates .ready (Later, I call it just 'notify'.) against 
>   a segment that is previous one including CrossBoundaryEndRecPtr only when its 
>   flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr.
...
>   See also the definition of XLogCtl, XLOGShmemSize(), and XLOGShmemInit() in my patch.

I think we don't need most of that shmem stuff.  XLogWrite is called
after WAL buffer is filled up to the requested position. So when it
crosses segment boundary we know the all past corss segment-boundary
records are stable. That means all we need to remember is only the
position of the latest corss-boundary record.

> * Action of inserting process
>   A inserting process memorie its CrossBoundaryEndRecPtr to CrossBoundaryEndRecPtr
>   array element calculated by XLogRecPtrToBufIdx with its CrossBoundaryEndRecPtr.
>   If the WAL record crosses many segments, only element against last segment
>   including the EndRecPtr is set and others are not set.
>   See also CopyXLogRecordToWAL() in my patch.

If we call XLogMarkEndRecPtrIfNeeded() there, the function is called
every time a record is written, most of which are wasteful.
XLogInsertRecord already has a code block executed only at every page
boundary.

> * Action of flushing process
>   Overview has been already written as the follwing.
>     A flushing process creates .ready (Later, I call it just 'notify'.) against 
>     a segment that is previous one including CrossBoundaryEndRecPtr only when its 
>     flushed RecPtr is equal or greater than the CrossBoundaryEndRecPtr.
> 
>   An additional detail is as the following.  The flushing process may notify
>   many segments if the record crosses many segments, so the process memorizes
>   latest notified segment number to latestArchiveNotifiedSegNo in XLogCtl.
>   The process notifies from latestArchiveNotifiedSegNo + 1 to
>   flushing segment number - 1.
>
>   And latestArchiveNotifiedSegNo is set to EndOfLog after Startup process exits
>   replay-loop.  Standby also set same timing (= before promoting).
> 
>   Mutual exlusion about latestArchiveNotifiedSegNo is not required because
>   the timing of accessing has been already included in WALWriteLock critical section.

Looks reasonable.

> 2. Description in standby side
> 
> * Who notifies?
>   walreceiver also doesn't know whether the flushed RecPtr is EndRecPtr of
>   cross-segment-boundary WAL record or not.  In standby, only Startup process
>   knows the information because it is hidden in WAL record itself and only
>   Startup process reads and builds WAL record.

Standby doesn't write it's own WAL records.  Even if primary sent an
immature record on segment boundary, it just would promote to a new
TLI and starts its own history. Nothing breaks.  However it could be a
problem if a standby that crashed the problematic way were started
as-is as a primary, such scenario is out of our scope.

Now we can identify stable portion of WAL stream. It's enough to
prevent walsender from sending data that can be overwritten
afterwards. GetReplicationTargetRecPtr() in the attached does that.

> * Action of Statup process
>   Therefore, I implemented that walreceiver never notify and Startup process does it.
>   In detail, when Startup process reads one full-length WAL record, it notifies
>   from a segment that includes head(ReadRecPtr) of the record to a previous segment that 
>   includes EndRecPtr of the record.

I don't like that archiving on standby relies on replay progress.  We
should avoid that and fortunately I think we dont need it.

>   Now, we must pay attention about switching time line.
>   The last segment of previous TimeLineID must be notified before switching.
>   This case is considered when RM_XLOG_ID is detected.

That segment is archived after renamed as ".partial" later. We don't
archive the last incomplete segment of the previous timeline as-is.

> 3. About other notifying for segment
> Two notifyings for segment are remain.  They are not needed to fix.
> 
> (1) Notifying for partial segment
> It is not needed to be completed, so it's OK to notify without special consideration.
> 
> (2) Re-notifying
> Currently, Checkpointer has notified through XLogArchiveCheckDone().
> It is a safe-net for failure of notifying by backend or WAL writer.
> Backend or WAL writer doesn't retry to notify if falis, but Checkpointer retries
> to notify when it removes old segment. If it fails to notify, then it does not
> remove the segment.  It makes Checkpointer to retry notify until the notifying suceeeds.
> Also, in this case, we can just notify whithout special consideration
> because Checkpointer guarantees that all WAL record included in the segment have been already flushed.


So it can be simplified as the attached. Any thoughts?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 6a2475aec9a871def5f194058f62f3f6991777e9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Thu, 25 Jun 2020 08:50:54 +0900
Subject: [PATCH] Avoid to archive immature records

For a segment-spanning record, if primary crashes after the first
segment is archived and before finishing the full record, crash
recovery causes the last record of the first segment overwritten and
history diverges between pg_wal and archive. Avoid that corruption by
preventing immature records from being archived.  Prevent walsender
from sending immature records for the same reason.
---
 src/backend/access/transam/xlog.c   | 124 +++++++++++++++++++++++++++-
 src/backend/replication/walsender.c |  14 ++--
 src/include/access/xlog.h           |   1 +
 3 files changed, 131 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a1256a103b..b3d49fbc8b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -724,6 +724,16 @@ typedef struct XLogCtlData
      */
     XLogRecPtr    lastFpwDisableRecPtr;
 
+    /* The last segment notified to be archived. Protected by WALWriteLock */
+    XLogSegNo    lastNotifiedSeg;
+
+    /*
+     * Remember the range of the last segment-spanning record. Protected by
+     * info_lck
+     */
+    XLogRecPtr    lastSegContRecStart;
+    XLogRecPtr    lastSegContRecEnd;
+
     slock_t        info_lck;        /* locks shared variables shown above */
 } XLogCtlData;
 
@@ -1158,6 +1168,9 @@ XLogInsertRecord(XLogRecData *rdata,
      */
     if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
     {
+        XLogSegNo startseg;
+        XLogSegNo endseg;
+
         SpinLockAcquire(&XLogCtl->info_lck);
         /* advance global request to include new block(s) */
         if (XLogCtl->LogwrtRqst.Write < EndPos)
@@ -1165,6 +1178,21 @@ XLogInsertRecord(XLogRecData *rdata,
         /* update local result copy while I have the chance */
         LogwrtResult = XLogCtl->LogwrtResult;
         SpinLockRelease(&XLogCtl->info_lck);
+
+        /* Remember the range of the record if it spans over segments */
+        XLByteToSeg(StartPos, startseg, wal_segment_size);
+        XLByteToPrevSeg(EndPos, endseg, wal_segment_size);
+
+        if (startseg != endseg)
+        {
+            SpinLockAcquire(&XLogCtl->info_lck);
+            if (XLogCtl->lastSegContRecEnd < StartPos)
+            {
+                XLogCtl->lastSegContRecStart = StartPos;
+                XLogCtl->lastSegContRecEnd = EndPos;
+            }
+            SpinLockRelease(&XLogCtl->info_lck);
+        }
     }
 
     /*
@@ -2396,6 +2424,56 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
     return false;
 }
 
+/*
+ * Notify segments that are surely stable.
+ *
+ * If the last segment in pg_wal is complete and ended with a continuation
+ * record, crash recovery results in a diverged historiy from archive.  Don't
+ * archive a segment until the whole record is finished writing.
+ */
+static void
+NotifyStableSegments(XLogSegNo notifySegNo)
+{
+    XLogRecPtr    archiveTargetRecPtr;
+    XLogSegNo i;
+
+    if (XLogCtl->lastNotifiedSeg < notifySegNo)
+    {
+        XLogRecPtr lastSegContRecStart;
+        XLogRecPtr lastSegContRecEnd;
+        XLogSegNo    notifyUpTo = 0;
+
+        SpinLockAcquire(&XLogCtl->info_lck);
+        lastSegContRecStart = XLogCtl->lastSegContRecStart;
+        lastSegContRecEnd = XLogCtl->lastSegContRecEnd;
+        SpinLockRelease(&XLogCtl->info_lck);
+
+        /*
+         * Use start position of the last segmenet-spanning continuation record
+         * when the record is not flushed completely.
+         */
+        if (lastSegContRecStart < LogwrtResult.Flush &&
+            LogwrtResult.Flush <= lastSegContRecEnd)
+            archiveTargetRecPtr = lastSegContRecStart;
+        else
+            archiveTargetRecPtr = LogwrtResult.Flush;
+
+        XLByteToSeg(archiveTargetRecPtr, notifyUpTo, wal_segment_size);
+
+        /* back to the last complete segment */
+        notifyUpTo--;
+
+        /* cap by given segment */
+        if (notifyUpTo > notifySegNo)
+            notifyUpTo = notifySegNo;
+
+        for (i = XLogCtl->lastNotifiedSeg + 1 ; i <= notifyUpTo ; i++)
+            XLogArchiveNotifySeg(i);
+
+        XLogCtl->lastNotifiedSeg = notifyUpTo;
+    }
+}
+
 /*
  * Write and/or fsync the log at least as far as WriteRqst indicates.
  *
@@ -2583,7 +2661,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
                 LogwrtResult.Flush = LogwrtResult.Write;    /* end of page */
 
                 if (XLogArchivingActive())
-                    XLogArchiveNotifySeg(openLogSegNo);
+                    NotifyStableSegments(openLogSegNo);
 
                 XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
                 XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
@@ -2653,6 +2731,10 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
         WalSndWakeupRequest();
 
         LogwrtResult.Flush = LogwrtResult.Write;
+
+        /* Now the record is fully written, try to notify stable segments */
+        if (XLogArchivingActive())
+            NotifyStableSegments(openLogSegNo - 1);
     }
 
     /*
@@ -7703,6 +7785,18 @@ StartupXLOG(void)
     XLogCtl->LogwrtRqst.Write = EndOfLog;
     XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+    /*
+     * We have archived up to the previous segment of EndOfLog so far.
+     * Initialize lastNotifiedSeg if needed.
+     */
+    if (XLogArchivingActive())
+    {
+        XLogSegNo    endLogSegNo;
+
+        XLByteToSeg(EndOfLog, endLogSegNo, wal_segment_size);
+        XLogCtl->lastNotifiedSeg = endLogSegNo - 1;
+    }
+
     /*
      * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
      * record before resource manager writes cleanup WAL records or checkpoint
@@ -8426,6 +8520,34 @@ GetFlushRecPtr(void)
     return LogwrtResult.Flush;
 }
 
+/*
+ * GetReplicationTargetRecPtr -- Returns the latest position that can be
+ * replicated.  WAL records up to this position won't be overwritten even after
+ * a crash of primary.
+ */
+XLogRecPtr
+GetReplicationTargetRecPtr(void)
+{
+    XLogRecPtr lastSegContRecStart;
+    XLogRecPtr lastSegContRecEnd;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    LogwrtResult = XLogCtl->LogwrtResult;
+    lastSegContRecStart = XLogCtl->lastSegContRecStart;
+    lastSegContRecEnd = XLogCtl->lastSegContRecEnd;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * Use start position of the last segmenet-spanning continuation record
+     * when the record is not flushed completely.
+     */
+    if (lastSegContRecStart < LogwrtResult.Flush &&
+        LogwrtResult.Flush <= lastSegContRecEnd)
+        return lastSegContRecStart;
+
+    return LogwrtResult.Flush;
+}
+
 /*
  * GetLastImportantRecPtr -- Returns the LSN of the last important record
  * inserted. All records not explicitly marked as unimportant are considered
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e2477c47e0..c5682a8836 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2630,14 +2630,14 @@ XLogSendPhysical(void)
         /*
          * Streaming the current timeline on a master.
          *
-         * Attempt to send all data that's already been written out and
-         * fsync'd to disk.  We cannot go further than what's been written out
-         * given the current implementation of WALRead().  And in any case
-         * it's unsafe to send WAL that is not securely down to disk on the
-         * master: if the master subsequently crashes and restarts, standbys
-         * must not have applied any WAL that got lost on the master.
+         * Attempt to send all data that's can be replicated.  We cannot go
+         * further than what's been written out given the current
+         * implementation of WALRead().  And in any case it's unsafe to send
+         * WAL that is not securely down to disk on the master: if the master
+         * subsequently crashes and restarts, standbys must not have applied
+         * any WAL that got lost on the master.
          */
-        SendRqstPtr = GetFlushRecPtr();
+        SendRqstPtr = GetReplicationTargetRecPtr();
     }
 
     /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 347a38f57c..ef21418093 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -335,6 +335,7 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetReplicationTargetRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void RemovePromoteSignalFiles(void);
 
-- 
2.18.4


RE: archive status ".ready" files may be created too early

From
"matsumura.ryo@fujitsu.com"
Date:
Hello, Horiguchi-san

Thank you for your comment and patch.

At Thursday, June 25, 2020 3:36 PM(JST),  "Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>" wrote in
> I think we don't need most of that shmem stuff.  XLogWrite is called

I wanted no more shmem stuff too, but other ideas need more lock
that excludes inserter and writer each other.

> after WAL buffer is filled up to the requested position. So when it
> crosses segment boundary we know the all past corss segment-boundary
> records are stable. That means all we need to remember is only the
> position of the latest corss-boundary record.

I could not agree. In the following case, it may not work well.
- record-A and record-B (record-B is a newer one) is copied, and
- lastSegContRecStart/End points to record-B's, and
- FlushPtr is proceeded to in the middle of record-A.

In the above case, the writer should notify segments before record-A,
but it notifies ones before record-B. If the writer notifies
only when it flushes the latest record completely, it works well.
But the writer may not be enable to notify any segment forever when
WAL records crossing-segment-boundary are inserted contiunuously.

So I think that we must remeber all such cross-segement-boundary records's EndRecPtr in buffer.


> If we call XLogMarkEndRecPtrIfNeeded() there, the function is called
> every time a record is written, most of which are wasteful.
> XLogInsertRecord already has a code block executed only at every page
> boundary.

I agree.
XLogMarkEndRecPtrIfNeeded() is moved into the code block before updating
LogwrtRqst.Write for avoiding passing-each-other with writer.


> Now we can identify stable portion of WAL stream. It's enough to
> prevent walsender from sending data that can be overwritten
> afterwards. GetReplicationTargetRecPtr() in the attached does that.

I didn't notice it.
I agree basically, but it is based on lastSegContRecStart/End.

So, first of all, we have to agree what should be remebered.


Regards
Ryo Matsumura



Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
Hello, Matsumura-san.

At Mon, 6 Jul 2020 04:02:23 +0000, "matsumura.ryo@fujitsu.com" <matsumura.ryo@fujitsu.com> wrote in 
> Hello, Horiguchi-san
> 
> Thank you for your comment and patch.
> 
> At Thursday, June 25, 2020 3:36 PM(JST),  "Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>" wrote in
> > I think we don't need most of that shmem stuff.  XLogWrite is called
> 
> I wanted no more shmem stuff too, but other ideas need more lock
> that excludes inserter and writer each other.
> 
> > after WAL buffer is filled up to the requested position. So when it
> > crosses segment boundary we know the all past corss segment-boundary
> > records are stable. That means all we need to remember is only the
> > position of the latest corss-boundary record.
> 
> I could not agree. In the following case, it may not work well.
> - record-A and record-B (record-B is a newer one) is copied, and
> - lastSegContRecStart/End points to record-B's, and
> - FlushPtr is proceeded to in the middle of record-A.

IIUC, that means record-B is a cross segment-border record and we hav e
flushed beyond the recrod-B. In that case crash recovery afterwards
can read the complete record-B and will finish recovery *after* the
record-B. That's what we need here.

> In the above case, the writer should notify segments before record-A,
> but it notifies ones before record-B. If the writer notifies

If you mean that NotifyStableSegments notifies up-to the previous
segment of the segment where record-A is placed, that's wrong. The
issue here is crash recovery sees an incomplete record at a
segment-border. So it is sufficient that crash recoery can read the
last record by looking pg_wal.

> only when it flushes the latest record completely, it works well.

It confirms that "lastSegContRecEnd < LogwrtResult.Flush", that means
the last record(B) is completely flushed-out, isn't that? So it works
well.

> But the writer may not be enable to notify any segment forever when
> WAL records crossing-segment-boundary are inserted contiunuously.

No. As I mentioned in the preivous main, if we see a
cross-segment-boundary record, the previous cross-segment-boundary
record is flushed completely, and the segment containing the
first-half of the previous cross-segment-boundary record has already
been flushed.  I didin't that but we can put an assertion in
XLogInsertRecord like this:

 +      /* Remember the range of the record if it spans over segments */
 +      XLByteToSeg(StartPos, startseg, wal_segment_size);
 +      XLByteToPrevSeg(EndPos, endseg, wal_segment_size);
 +
 +      if (startseg != endseg)
 +      {
++          /* we shouldn't have a record spanning over three or more segments */
++          Assert(endseg = startseg + 1);
 +          SpinLockAcquire(&XLogCtl->info_lck);
 +          if (XLogCtl->lastSegContRecEnd < StartPos)
 +          {
 +              XLogCtl->lastSegContRecStart = StartPos;
 +              XLogCtl->lastSegContRecEnd = EndPos;

> So I think that we must remeber all such cross-segement-boundary records's EndRecPtr in buffer.
> 
> 
> > If we call XLogMarkEndRecPtrIfNeeded() there, the function is called
> > every time a record is written, most of which are wasteful.
> > XLogInsertRecord already has a code block executed only at every page
> > boundary.
> 
> I agree.
> XLogMarkEndRecPtrIfNeeded() is moved into the code block before updating
> LogwrtRqst.Write for avoiding passing-each-other with writer.
> 
> 
> > Now we can identify stable portion of WAL stream. It's enough to
> > prevent walsender from sending data that can be overwritten
> > afterwards. GetReplicationTargetRecPtr() in the attached does that.
> 
> I didn't notice it.
> I agree basically, but it is based on lastSegContRecStart/End.
> 
> So, first of all, we have to agree what should be remebered.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: archive status ".ready" files may be created too early

From
"matsumura.ryo@fujitsu.com"
Date:
Hello, Horiguchi-san

At Monday, July 6, 2020 05:13:40 +0000,  "Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>" wrote in
> > > after WAL buffer is filled up to the requested position. So when it
> > > crosses segment boundary we know the all past corss segment-boundary
> > > records are stable. That means all we need to remember is only the
> > > position of the latest corss-boundary record.
> >
> > I could not agree. In the following case, it may not work well.
> > - record-A and record-B (record-B is a newer one) is copied, and
> > - lastSegContRecStart/End points to record-B's, and
> > - FlushPtr is proceeded to in the middle of record-A.
>
> IIUC, that means record-B is a cross segment-border record and we hav e
> flushed beyond the recrod-B. In that case crash recovery afterwards
> can read the complete record-B and will finish recovery *after* the
> record-B. That's what we need here.

I'm sorry I didn't explain enough.

Record-A and Record-B are cross segment-border records.
Record-A spans segment X and X+1
Record-B spans segment X+2 and X+3.
If both records have been inserted to WAL buffer, lastSegContRecStart/End points to Record-B.
If a writer flushes upto the middle of segment-X+1, NotifyStableSegments() allows the writer to notify segment-X.
Is my understanding correct?

Regards
Ryo Matsumrua



Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
Hello.

# Sorry, I wrongly thought that I replied to this thread..

At Tue, 7 Jul 2020 09:02:56 +0000, "matsumura.ryo@fujitsu.com" <matsumura.ryo@fujitsu.com> wrote in 
> At Monday, July 6, 2020 05:13:40 +0000,  "Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>" wrote in
> > > > after WAL buffer is filled up to the requested position. So when it
> > > > crosses segment boundary we know the all past corss segment-boundary
> > > > records are stable. That means all we need to remember is only the
> > > > position of the latest corss-boundary record.
> > > 
> > > I could not agree. In the following case, it may not work well.
> > > - record-A and record-B (record-B is a newer one) is copied, and
> > > - lastSegContRecStart/End points to record-B's, and
> > > - FlushPtr is proceeded to in the middle of record-A.
> >
> > IIUC, that means record-B is a cross segment-border record and we hav e
> > flushed beyond the recrod-B. In that case crash recovery afterwards
> > can read the complete record-B and will finish recovery *after* the
> > record-B. That's what we need here.
> 
> I'm sorry I didn't explain enough.
> 
> Record-A and Record-B are cross segment-border records.
> Record-A spans segment X and X+1
> Record-B spans segment X+2 and X+3.

Ok.


> If both records have been inserted to WAL buffer, lastSegContRecStart/End points to Record-B.
> If a writer flushes upto the middle of segment-X+1, NotifyStableSegments() allows the writer to notify segment-X.
> Is my understanding correct?

I think that that cannot happen since the segment X must have been
flushed at the time Record-A is completely flushed out. When we write
to the next segment, we have already flushed and closed the whole last
segment. If it is not the case we are to archive segment files not
fully flushed, and would get broken archive files.

Am I missing something here?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: archive status ".ready" files may be created too early

From
"matsumura.ryo@fujitsu.com"
Date:
Hello,

> At Mon, 13 Jul 2020 01:57:36 +0000, "Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>" wrote in
> Am I missing something here?

I write more detail(*).

  Record-A and Record-B are cross segment-border records.
  Record-A spans segment X and X+1.
  Record-B spans segment X+2 and X+3.
  If both records have been inserted to WAL buffer, lastSegContRecStart/End points to Record-B
* If a writer flushes segment X and a part of X+1 but record-A is not flushed completely,
  NotifyStableSegments() allows the writer to notify segment-X.

Then, Record-A may be invalidated by crash-recovery and overwritten by new WAL record.
The segment-X is not same as the archived one.

Regard
Ryo Matsumura



Re: archive status ".ready" files may be created too early

From
Michael Paquier
Date:
On Wed, Jul 22, 2020 at 02:53:49AM +0000, matsumura.ryo@fujitsu.com wrote:
> Then, Record-A may be invalidated by crash-recovery and overwritten by new WAL record.
> The segment-X is not same as the archived one.

Please note that the latest patch fails to apply per the CF bot, so a
rebase would be in order to have at least some automated tests for the
last patch.
--
Michael

Attachment

Re: archive status ".ready" files may be created too early

From
Heikki Linnakangas
Date:
On 07/07/2020 12:02, matsumura.ryo@fujitsu.com wrote:
> At Monday, July 6, 2020 05:13:40 +0000,  "Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>" wrote in
>>>> after WAL buffer is filled up to the requested position. So when it
>>>> crosses segment boundary we know the all past corss segment-boundary
>>>> records are stable. That means all we need to remember is only the
>>>> position of the latest corss-boundary record.
>>>
>>> I could not agree. In the following case, it may not work well.
>>> - record-A and record-B (record-B is a newer one) is copied, and
>>> - lastSegContRecStart/End points to record-B's, and
>>> - FlushPtr is proceeded to in the middle of record-A.
>>
>> IIUC, that means record-B is a cross segment-border record and we hav e
>> flushed beyond the recrod-B. In that case crash recovery afterwards
>> can read the complete record-B and will finish recovery *after* the
>> record-B. That's what we need here.
> 
> I'm sorry I didn't explain enough.
> 
> Record-A and Record-B are cross segment-border records.
> Record-A spans segment X and X+1
> Record-B spans segment X+2 and X+3.
> If both records have been inserted to WAL buffer, lastSegContRecStart/End points to Record-B.
> If a writer flushes upto the middle of segment-X+1, NotifyStableSegments() allows the writer to notify segment-X.
> Is my understanding correct?

I think this little ASCII drawing illustrates the above scenario:

         AAAAA  F  BBBBB
|---------|---------|---------|
    seg X    seg X+1   seg X+2

AAAAA and BBBBB are Record-A and Record-B. F is the current flush pointer.

In this case, it would be OK to notify segment X, as long as F is 
greater than the end of record A. And if I'm reading Kyotaro's patch 
correctly, that's what would happen with the patch.

The patch seems correct to me. I'm a bit sad that we have to track yet 
another WAL position (two, actually) to fix this, but I don't see a 
better way.

I wonder if we should arrange things so that XLogwrtResult.Flush never 
points in the middle of a record? I'm not totally convinced that all the 
current callers of GetFlushRecPtr() are OK with a middle-of-WAL record 
value. Could we get into similar trouble if a standby replicates half of 
a cross-segment record to a cascaded standby, and the cascaded standby 
has WAL archiving enabled?

- Heikki



Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
Thanks for visiting this thread.

At Mon, 12 Oct 2020 15:04:40 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in 
> On 07/07/2020 12:02, matsumura.ryo@fujitsu.com wrote:
> > At Monday, July 6, 2020 05:13:40 +0000, "Kyotaro Horiguchi
> > <horikyota(dot)ntt(at)gmail(dot)com>" wrote in
> >>>> after WAL buffer is filled up to the requested position. So when it
> >>>> crosses segment boundary we know the all past corss segment-boundary
> >>>> records are stable. That means all we need to remember is only the
> >>>> position of the latest corss-boundary record.
> >>>
> >>> I could not agree. In the following case, it may not work well.
> >>> - record-A and record-B (record-B is a newer one) is copied, and
> >>> - lastSegContRecStart/End points to record-B's, and
> >>> - FlushPtr is proceeded to in the middle of record-A.
> >>
> >> IIUC, that means record-B is a cross segment-border record and we hav
> >> e
> >> flushed beyond the recrod-B. In that case crash recovery afterwards
> >> can read the complete record-B and will finish recovery *after* the
> >> record-B. That's what we need here.
> > I'm sorry I didn't explain enough.
> > Record-A and Record-B are cross segment-border records.
> > Record-A spans segment X and X+1
> > Record-B spans segment X+2 and X+3.
> > If both records have been inserted to WAL buffer,
> > lastSegContRecStart/End points to Record-B.
> > If a writer flushes upto the middle of segment-X+1,
> > NotifyStableSegments() allows the writer to notify segment-X.
> > Is my understanding correct?
> 
> I think this little ASCII drawing illustrates the above scenario:
> 
>         AAAAA  F  BBBBB
> |---------|---------|---------|
>    seg X    seg X+1   seg X+2
> 
> AAAAA and BBBBB are Record-A and Record-B. F is the current flush
> pointer.

I modified the figure a bit for the explanation below.

          F0        F1
        AAAAA  F  BBBBB
|---------|---------|---------|
   seg X    seg X+1   seg X+2

Matsumura-san has a concern about the case where there are two (or
more) partially-flushed segment-spanning records at the same time.

This patch remembers only the last cross-segment record. If we were
going to flush up to F0 after Record-B had been written, we would fail
to hold-off archiving seg-X. This patch is based on a assumption that
that case cannot happen because we don't leave a pending page at the
time of segment switch and no records don't span over three or more
segments.

> In this case, it would be OK to notify segment X, as long as F is
> greater than the end of record A. And if I'm reading Kyotaro's patch
> correctly, that's what would happen with the patch.
> 
> The patch seems correct to me. I'm a bit sad that we have to track yet
> another WAL position (two, actually) to fix this, but I don't see a
> better way.

Is the two means Record-A and B? Is it needed even with having the
assumption above?

> I wonder if we should arrange things so that XLogwrtResult.Flush never
> points in the middle of a record? I'm not totally convinced that all

That happens at good percentage of page-boundary. And a record can
span over three or more pages. Do we need to avoid all such cases?

I did that only for the cross-segment case.

> the current callers of GetFlushRecPtr() are OK with a middle-of-WAL
> record value. Could we get into similar trouble if a standby
> replicates half of a cross-segment record to a cascaded standby, and
> the cascaded standby has WAL archiving enabled?

The patch includes a fix for primary->standby case. But I'm not sure
we can do that in the cascaded case. A standby is not aware of the
structure of a WAL blob and has no idea of up-to-where to send the
received blobs. However, if we can rely on the behavior of CopyData
that we always receive a blob as a whole sent from the sender at once,
the cascaded standbys are free from the trouble (as far as the
cascaded-standby doesn't crash just before writing the last-half of a
record into pg_wal and after archiving the last full-segment, which
seems unlikely.).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From e8320f7486c057ccb97be56ce1859296b8072256 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 13 Oct 2020 20:41:33 +0900
Subject: [PATCH v2] Avoid archiving a WAL segment that continues to the next
 segment

If the last record of a finshed segment continues to the next segment,
we need to defer archiving of the segment until the record is flushed
to the end. Otherwise crash recovery can overwrite the last record of
a segment and history diverges between archive and pg_wal.
---
 src/backend/access/transam/xlog.c   | 160 +++++++++++++++++++++++++++-
 src/backend/replication/walsender.c |  14 +--
 src/include/access/xlog.h           |   1 +
 3 files changed, 163 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 52a67b1170..8fd0fdb598 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -723,6 +723,16 @@ typedef struct XLogCtlData
      */
     XLogRecPtr    lastFpwDisableRecPtr;
 
+    /* The last segment notified to be archived. Protected by WALWriteLock */
+    XLogSegNo    lastNotifiedSeg;
+
+    /*
+     * Remember the range of the last segment-spanning record. Protected by
+     * info_lck
+     */
+    XLogRecPtr    lastSegContRecStart;
+    XLogRecPtr    lastSegContRecEnd;
+
     slock_t        info_lck;        /* locks shared variables shown above */
 } XLogCtlData;
 
@@ -1160,6 +1170,9 @@ XLogInsertRecord(XLogRecData *rdata,
      */
     if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
     {
+        XLogSegNo startseg;
+        XLogSegNo endseg;
+
         SpinLockAcquire(&XLogCtl->info_lck);
         /* advance global request to include new block(s) */
         if (XLogCtl->LogwrtRqst.Write < EndPos)
@@ -1167,6 +1180,21 @@ XLogInsertRecord(XLogRecData *rdata,
         /* update local result copy while I have the chance */
         LogwrtResult = XLogCtl->LogwrtResult;
         SpinLockRelease(&XLogCtl->info_lck);
+
+        /* Remember the range of the record if it spans over segments */
+        XLByteToSeg(StartPos, startseg, wal_segment_size);
+        XLByteToPrevSeg(EndPos, endseg, wal_segment_size);
+
+        if (startseg != endseg)
+        {
+            SpinLockAcquire(&XLogCtl->info_lck);
+            if (XLogCtl->lastSegContRecEnd < StartPos)
+            {
+                XLogCtl->lastSegContRecStart = StartPos;
+                XLogCtl->lastSegContRecEnd = EndPos;
+            }
+            SpinLockRelease(&XLogCtl->info_lck);
+        }
     }
 
     /*
@@ -2399,6 +2427,43 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
     return false;
 }
 
+/*
+ * Returns last notified segment.
+ */
+static XLogSegNo
+GetLastNotifiedSegment(void)
+{
+    XLogSegNo last_notified;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    last_notified = XLogCtl->lastNotifiedSeg;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    return last_notified;
+}
+
+/*
+ * Notify segments that are not yet notified.
+ */
+static void
+NotifySegmentsUpTo(XLogSegNo notifySegNo)
+{
+    XLogSegNo last_notified = GetLastNotifiedSegment();
+    XLogSegNo i;
+
+    if (notifySegNo <= last_notified)
+        return;
+
+    for (i = XLogCtl->lastNotifiedSeg + 1 ; i <= notifySegNo ; i++)
+        XLogArchiveNotifySeg(i);
+
+    /* Don't go back in the case someone else has made it go further. */
+    SpinLockAcquire(&XLogCtl->info_lck);
+    if (XLogCtl->lastNotifiedSeg < notifySegNo)
+        XLogCtl->lastNotifiedSeg = notifySegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+}
+
 /*
  * Write and/or fsync the log at least as far as WriteRqst indicates.
  *
@@ -2422,6 +2487,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
     int            npages;
     int            startidx;
     uint32        startoffset;
+    bool        extended = false;
 
     /* We should always be inside a critical section here */
     Assert(CritSectionCount > 0);
@@ -2578,15 +2644,48 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
              */
             if (finishing_seg)
             {
+                XLogRecPtr lastSegContRecStart;
+                XLogRecPtr lastSegContRecEnd;
+
                 issue_xlog_fsync(openLogFile, openLogSegNo);
 
                 /* signal that we need to wakeup walsenders later */
                 WalSndWakeupRequest();
 
-                LogwrtResult.Flush = LogwrtResult.Write;    /* end of page */
+                SpinLockAcquire(&XLogCtl->info_lck);
+                lastSegContRecStart = XLogCtl->lastSegContRecStart;
+                lastSegContRecEnd = XLogCtl->lastSegContRecEnd;
+                SpinLockRelease(&XLogCtl->info_lck);
 
-                if (XLogArchivingActive())
-                    XLogArchiveNotifySeg(openLogSegNo);
+                /*
+                 * Don't expose flush location at middle of a corss-segment WAL
+                 * record.
+                 */
+                if (LogwrtResult.Write < lastSegContRecStart ||
+                    lastSegContRecEnd <= LogwrtResult.Write)
+                {
+                    LogwrtResult.Flush = LogwrtResult.Write; /* end of page */
+
+                    if (XLogArchivingActive())
+                        NotifySegmentsUpTo(openLogSegNo);
+
+                    extended = false;
+                }
+                else
+                {
+                    /*
+                     * Sometimes we are told to flush up not to the end of a
+                     * record but to WALbuffer page boundary. We advance the
+                     * request LSN to the end of the record in the case the
+                     * record at the requested LSN continues to the next segment.
+                     */
+                    if (lastSegContRecStart <= WriteRqst.Write &&
+                        WriteRqst.Write <= lastSegContRecEnd)
+                        WriteRqst.Write = lastSegContRecEnd;
+
+                    /* Continue to the next iteration ignoring flexible flag */
+                    extended = true;
+                }
 
                 XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
                 XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
@@ -2615,11 +2714,23 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
         }
         curridx = NextBufIdx(curridx);
 
-        /* If flexible, break out of loop as soon as we wrote something */
-        if (flexible && npages == 0)
+        /*
+         * If flexible, break out of loop as soon as we wrote something.
+         * However, we don't leave the loop if the last record in the just
+         * finished segment needs to be finished.
+         */
+        if (flexible && !extended && npages == 0)
             break;
     }
 
+    /*
+     * We have extended the write request to the next segment if the record at
+     * the initial WriteRqst.Write continues to the next segment.  In that case
+     * need to notify the last segment here.
+     */
+    if (extended)
+        NotifySegmentsUpTo(openLogSegNo - 1);
+
     Assert(npages == 0);
 
     /*
@@ -7710,6 +7821,18 @@ StartupXLOG(void)
     XLogCtl->LogwrtRqst.Write = EndOfLog;
     XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+    /*
+     * We have archived up to the previous segment of EndOfLog so far.
+     * Initialize lastNotifiedSeg if needed.
+     */
+    if (XLogArchivingActive())
+    {
+        XLogSegNo    endLogSegNo;
+
+        XLByteToSeg(EndOfLog, endLogSegNo, wal_segment_size);
+        XLogCtl->lastNotifiedSeg = endLogSegNo - 1;
+    }
+
     /*
      * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
      * record before resource manager writes cleanup WAL records or checkpoint
@@ -8434,6 +8557,33 @@ GetFlushRecPtr(void)
     return LogwrtResult.Flush;
 }
 
+/*
+ * GetReplicationTargetRecPtr -- Returns the latest position that is safe to
+ * replicate.
+ */
+XLogRecPtr
+GetReplicationTargetRecPtr(void)
+{
+    XLogRecPtr lastSegContRecStart;
+    XLogRecPtr lastSegContRecEnd;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    LogwrtResult = XLogCtl->LogwrtResult;
+    lastSegContRecStart = XLogCtl->lastSegContRecStart;
+    lastSegContRecEnd = XLogCtl->lastSegContRecEnd;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * Use start position of the last segmenet-spanning continuation record
+     * when the record is not flushed completely.
+     */
+    if (lastSegContRecStart < LogwrtResult.Flush &&
+        LogwrtResult.Flush <= lastSegContRecEnd)
+        return lastSegContRecStart;
+
+    return LogwrtResult.Flush;
+}
+
 /*
  * GetLastImportantRecPtr -- Returns the LSN of the last important record
  * inserted. All records not explicitly marked as unimportant are considered
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7c9d1b67df..e9eaca5ffa 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2642,14 +2642,14 @@ XLogSendPhysical(void)
         /*
          * Streaming the current timeline on a primary.
          *
-         * Attempt to send all data that's already been written out and
-         * fsync'd to disk.  We cannot go further than what's been written out
-         * given the current implementation of WALRead().  And in any case
-         * it's unsafe to send WAL that is not securely down to disk on the
-         * primary: if the primary subsequently crashes and restarts, standbys
-         * must not have applied any WAL that got lost on the primary.
+         * Attempt to send all data that's can be replicated.  We cannot go
+         * further than what's been written out given the current
+         * implementation of WALRead().  And in any case it's unsafe to send
+         * WAL that is not securely down to disk on the primary: if the primary
+         * subsequently crashes and restarts, standbys must not have applied
+         * any WAL that got lost on the primary.
          */
-        SendRqstPtr = GetFlushRecPtr();
+        SendRqstPtr = GetReplicationTargetRecPtr();
     }
 
     /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..94876f628c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -338,6 +338,7 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetReplicationTargetRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void RemovePromoteSignalFiles(void);
 
-- 
2.18.4


Re: archive status ".ready" files may be created too early

From
Anastasia Lubennikova
Date:
On 14.10.2020 03:06, Kyotaro Horiguchi wrote:
> The patch includes a fix for primary->standby case. But I'm not sure
> we can do that in the cascaded case. A standby is not aware of the
> structure of a WAL blob and has no idea of up-to-where to send the
> received blobs. However, if we can rely on the behavior of CopyData
> that we always receive a blob as a whole sent from the sender at once,
> the cascaded standbys are free from the trouble (as far as the
> cascaded-standby doesn't crash just before writing the last-half of a
> record into pg_wal and after archiving the last full-segment, which
> seems unlikely.).
>
> regards.
>

Status update for a commitfest entry.

This entry was "Waiting on author" during this CF. As I see, the latest 
message contains new version of the patch.
Does it need more work? Are you going to continue working on it?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
Apologies for the long delay.

I've spent a good amount of time thinking about this bug and trying
out a few different approaches for fixing it.  I've attached a work-
in-progress patch for my latest attempt.

On 10/13/20, 5:07 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
>           F0        F1
>         AAAAA  F  BBBBB
> |---------|---------|---------|
>    seg X    seg X+1   seg X+2
>
> Matsumura-san has a concern about the case where there are two (or
> more) partially-flushed segment-spanning records at the same time.
>
> This patch remembers only the last cross-segment record. If we were
> going to flush up to F0 after Record-B had been written, we would fail
> to hold-off archiving seg-X. This patch is based on a assumption that
> that case cannot happen because we don't leave a pending page at the
> time of segment switch and no records don't span over three or more
> segments.

I wonder if these are safe assumptions to make.  For your example, if
we've written record B to the WAL buffers, but neither record A nor B
have been written to disk or flushed, aren't we still in trouble?
Also, is there actually any limit on WAL record length that means that
it is impossible for a record to span over three or more segments?
Perhaps these assumptions are true, but it doesn't seem obvious to me
that they are, and they might be pretty fragile.

The attached patch doesn't make use of these assumptions.  Instead, we
track the positions of the records that cross segment boundaries in a
small hash map, and we use that to determine when it is safe to mark a
segment as ready for archival.  I think this approach resembles
Matsumura-san's patch from June.

As before, I'm not handling replication, archive_timeout, and
persisting latest-marked-ready through crashes yet.  For persisting
the latest-marked-ready segment through crashes, I was thinking of
using a new file that stores the segment number.

Nathan


Attachment

Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
At Mon, 14 Dec 2020 18:25:23 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> Apologies for the long delay.
> 
> I've spent a good amount of time thinking about this bug and trying
> out a few different approaches for fixing it.  I've attached a work-
> in-progress patch for my latest attempt.
> 
> On 10/13/20, 5:07 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> >           F0        F1
> >         AAAAA  F  BBBBB
> > |---------|---------|---------|
> >    seg X    seg X+1   seg X+2
> >
> > Matsumura-san has a concern about the case where there are two (or
> > more) partially-flushed segment-spanning records at the same time.
> >
> > This patch remembers only the last cross-segment record. If we were
> > going to flush up to F0 after Record-B had been written, we would fail
> > to hold-off archiving seg-X. This patch is based on a assumption that
> > that case cannot happen because we don't leave a pending page at the
> > time of segment switch and no records don't span over three or more
> > segments.
> 
> I wonder if these are safe assumptions to make.  For your example, if
> we've written record B to the WAL buffers, but neither record A nor B
> have been written to disk or flushed, aren't we still in trouble?

You're right in that regard. There's a window where partial record is
written when write location passes F0 after insertion location passes
F1. However, remembering all spanning records seems overkilling to me.

I modifed the previous patch so that it remembers the start LSN of the
*oldest* corss-segment continuation record in the last consecutive
bonded segments, and the end LSN of the latest cross-segmetn
continuation record. This doesn't foreget past segment boundaries.
The region is cleard when WAL-write LSN goes beyond the remembered end
LSN.  So the region may contain several wal-segments that are not
connected to the next one, but that doesn't matter so much.


> Also, is there actually any limit on WAL record length that means that
> it is impossible for a record to span over three or more segments?

Even though it is not a hard limit, AFAICS as mentioned before the
longest possible record is what log_newpages() emits. that is up to
about 500kBytes for now. I think we don't want to make the length
longer. If we set the wal_segment_size to 1MB and set the block size
to 16kB or more, we would have a recrod spanning over three or more
segments but I don't think that is a sane configuration and that kind
of issue could happen elsewhere.

> Perhaps these assumptions are true, but it doesn't seem obvious to me
> that they are, and they might be pretty fragile.

I added an assertion that a record must be shorter than a wal segment
to XLogRecordAssemble(). This guarantees the assumption to be true?
(The condition is tentative, would need to be adjusted.)

> The attached patch doesn't make use of these assumptions.  Instead, we
> track the positions of the records that cross segment boundaries in a
> small hash map, and we use that to determine when it is safe to mark a
> segment as ready for archival.  I think this approach resembles
> Matsumura-san's patch from June.
> 
> As before, I'm not handling replication, archive_timeout, and
> persisting latest-marked-ready through crashes yet.  For persisting
> the latest-marked-ready segment through crashes, I was thinking of
> using a new file that stores the segment number.


Also, the attached is a PoC.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 5714bd064d61135c41a64ecc39aeff74c25a0e74 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 15 Dec 2020 16:24:13 +0900
Subject: [PATCH v3] PoC: Avoid archiving a WAL segment that continues to the
 next segment

If the last record of a finshed segment continues to the next segment,
we need to defer archiving of the segment until the record is flushed
to the end. Otherwise crash recovery can overwrite the last record of
a segment and history diverges between archive and pg_wal.
---
 src/backend/access/transam/xlog.c       | 187 +++++++++++++++++++++++-
 src/backend/access/transam/xloginsert.c |   3 +
 src/backend/replication/walsender.c     |  14 +-
 src/include/access/xlog.h               |   1 +
 4 files changed, 193 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8dd225c2e1..98da521601 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -723,6 +723,16 @@ typedef struct XLogCtlData
      */
     XLogRecPtr    lastFpwDisableRecPtr;
 
+    /* The last segment notified to be archived. Protected by WALWriteLock */
+    XLogSegNo    lastNotifiedSeg;
+
+    /*
+     * Remember the oldest and newest known segment that ends with a
+     * continuation record.
+     */
+    XLogRecPtr    firstSegContRecStart;
+    XLogRecPtr    lastSegContRecEnd;
+    
     slock_t        info_lck;        /* locks shared variables shown above */
 } XLogCtlData;
 
@@ -1160,6 +1170,9 @@ XLogInsertRecord(XLogRecData *rdata,
      */
     if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
     {
+        XLogSegNo startseg;
+        XLogSegNo endseg;
+
         SpinLockAcquire(&XLogCtl->info_lck);
         /* advance global request to include new block(s) */
         if (XLogCtl->LogwrtRqst.Write < EndPos)
@@ -1167,6 +1180,19 @@ XLogInsertRecord(XLogRecData *rdata,
         /* update local result copy while I have the chance */
         LogwrtResult = XLogCtl->LogwrtResult;
         SpinLockRelease(&XLogCtl->info_lck);
+
+        /* Remember the range of segments end with a continuation recrod */
+        XLByteToSeg(StartPos, startseg, wal_segment_size);
+        XLByteToPrevSeg(EndPos, endseg, wal_segment_size);
+
+        if (startseg != endseg)
+        {
+            SpinLockAcquire(&XLogCtl->info_lck);
+            if (XLogCtl->firstSegContRecStart == InvalidXLogRecPtr)
+                XLogCtl->firstSegContRecStart = StartPos;
+            XLogCtl->lastSegContRecEnd = EndPos;
+            SpinLockRelease(&XLogCtl->info_lck);
+        }
     }
 
     /*
@@ -2400,6 +2426,43 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
     return false;
 }
 
+/*
+ * Returns last notified segment.
+ */
+static XLogSegNo
+GetLastNotifiedSegment(void)
+{
+    XLogSegNo last_notified;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    last_notified = XLogCtl->lastNotifiedSeg;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    return last_notified;
+}
+
+/*
+ * Notify segments that are not yet notified.
+ */
+static void
+NotifySegmentsUpTo(XLogSegNo notifySegNo)
+{
+    XLogSegNo last_notified = GetLastNotifiedSegment();
+    XLogSegNo i;
+
+    if (notifySegNo <= last_notified)
+        return;
+
+    for (i = XLogCtl->lastNotifiedSeg + 1 ; i <= notifySegNo ; i++)
+        XLogArchiveNotifySeg(i);
+
+    /* Don't go back in the case someone else has made it go further. */
+    SpinLockAcquire(&XLogCtl->info_lck);
+    if (XLogCtl->lastNotifiedSeg < notifySegNo)
+        XLogCtl->lastNotifiedSeg = notifySegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+}
+
 /*
  * Write and/or fsync the log at least as far as WriteRqst indicates.
  *
@@ -2423,6 +2486,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
     int            npages;
     int            startidx;
     uint32        startoffset;
+    bool        extended = false;
 
     /* We should always be inside a critical section here */
     Assert(CritSectionCount > 0);
@@ -2579,15 +2643,73 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
              */
             if (finishing_seg)
             {
+                XLogRecPtr firstSegContRecStart;
+                XLogRecPtr lastSegContRecEnd;
+
                 issue_xlog_fsync(openLogFile, openLogSegNo);
 
                 /* signal that we need to wakeup walsenders later */
                 WalSndWakeupRequest();
 
-                LogwrtResult.Flush = LogwrtResult.Write;    /* end of page */
+                SpinLockAcquire(&XLogCtl->info_lck);
+                firstSegContRecStart = XLogCtl->firstSegContRecStart;
+                lastSegContRecEnd = XLogCtl->lastSegContRecEnd;
+                SpinLockRelease(&XLogCtl->info_lck);
 
-                if (XLogArchivingActive())
-                    XLogArchiveNotifySeg(openLogSegNo);
+                /*
+                 * If we're on a continuation record spans over segments, don't
+                 * expose flush location until the next record is written. This
+                 * prevents expose a flush location at middle of a
+                 * cross-segment WAL recrod.
+                 */
+                if (LogwrtResult.Write < firstSegContRecStart ||
+                    lastSegContRecEnd <= LogwrtResult.Write)
+                {
+                    LogwrtResult.Flush = LogwrtResult.Write; /* end of page */
+
+                    if (XLogArchivingActive())
+                        NotifySegmentsUpTo(openLogSegNo);
+
+                    extended = false;
+                }
+                else
+                {
+                    /*
+                     * There's a case where we are told to flush up not to the
+                     * end of a record but to WALbuffer page boundary. We
+                     * advance the request LSN to the end of the record in the
+                     * case the record at the requested LSN continues to the
+                     * next segment.
+                     */
+                    XLogSegNo oldseg;
+                    XLogSegNo currseg;
+
+                    XLByteToPrevSeg(WriteRqst.Write, currseg, wal_segment_size);
+                    XLByteToPrevSeg(lastSegContRecEnd, oldseg,
+                                    wal_segment_size);
+
+                    if (oldseg == currseg &&
+                        WriteRqst.Write <= lastSegContRecEnd)
+                        WriteRqst.Write = lastSegContRecEnd;
+
+                    /*
+                     * We need to finish writing at least the current record in
+                     * order to the old segment can be safely archived.
+                     */
+                    extended = true;
+                }
+
+                if (lastSegContRecEnd <= LogwrtResult.Write)
+                {
+                    /*
+                     * We got out from the continuation region, reset the
+                     * locations.
+                     */
+                    SpinLockAcquire(&XLogCtl->info_lck);
+                    XLogCtl->firstSegContRecStart = InvalidXLogRecPtr;
+                    XLogCtl->lastSegContRecEnd = InvalidXLogRecPtr;
+                    SpinLockRelease(&XLogCtl->info_lck);
+                }
 
                 XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
                 XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
@@ -2616,11 +2738,23 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
         }
         curridx = NextBufIdx(curridx);
 
-        /* If flexible, break out of loop as soon as we wrote something */
-        if (flexible && npages == 0)
+        /*
+         * If flexible, break out of loop as soon as we wrote something.
+         * However, we don't leave the loop if the last record in the just
+         * finished segment needs to be finished.
+         */
+        if (flexible && !extended && npages == 0)
             break;
     }
 
+    /*
+     * We have extended the write request to the next segment if the record at
+     * the initial WriteRqst.Write continues to the next segment.  In that case
+     * need to notify the last segment here.
+     */
+    if (extended)
+        NotifySegmentsUpTo(openLogSegNo - 1);
+
     Assert(npages == 0);
 
     /*
@@ -7705,6 +7839,18 @@ StartupXLOG(void)
     XLogCtl->LogwrtRqst.Write = EndOfLog;
     XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+    /*
+     * We have archived up to the previous segment of EndOfLog so far.
+     * Initialize lastNotifiedSeg if needed.
+     */
+    if (XLogArchivingActive())
+    {
+        XLogSegNo    endLogSegNo;
+
+        XLByteToSeg(EndOfLog, endLogSegNo, wal_segment_size);
+        XLogCtl->lastNotifiedSeg = endLogSegNo - 1;
+    }
+
     /*
      * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
      * record before resource manager writes cleanup WAL records or checkpoint
@@ -8429,6 +8575,37 @@ GetFlushRecPtr(void)
     return LogwrtResult.Flush;
 }
 
+/*
+ * GetReplicationTargetRecPtr -- Returns the latest position that is safe to
+ * replicate.
+ */
+XLogRecPtr
+GetReplicationTargetRecPtr(void)
+{
+    static XLogRecPtr    lastTargetRecPtr = InvalidXLogRecPtr;
+    XLogRecPtr    firstSegContRecStart;
+    XLogRecPtr    lastSegContRecEnd;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    LogwrtResult = XLogCtl->LogwrtResult;
+    firstSegContRecStart = XLogCtl->firstSegContRecStart;
+    lastSegContRecEnd = XLogCtl->lastSegContRecEnd;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * Don't move forward if the current flush position may be within a
+     * continuation record that spans over segments.
+     */
+    if (lastTargetRecPtr == InvalidXLogRecPtr ||
+        firstSegContRecStart == InvalidXLogRecPtr ||
+        XLogSegmentOffset(LogwrtResult.Flush, wal_segment_size) != 0 ||
+        LogwrtResult.Flush < firstSegContRecStart ||
+        lastSegContRecEnd <= LogwrtResult.Flush)
+        lastTargetRecPtr = LogwrtResult.Flush;
+
+    return lastTargetRecPtr;
+}
+
 /*
  * GetLastImportantRecPtr -- Returns the LSN of the last important record
  * inserted. All records not explicitly marked as unimportant are considered
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 1f0e4e01e6..af53d1f514 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -815,6 +815,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
     rechdr->xl_prev = InvalidXLogRecPtr;
     rechdr->xl_crc = rdata_crc;
 
+    /* we shouldn't have a record longer than a segment */
+    Assert(total_len < wal_segment_size);
+
     return &hdr_rdt;
 }
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2eb19ad293..00d8701a60 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2637,14 +2637,14 @@ XLogSendPhysical(void)
         /*
          * Streaming the current timeline on a primary.
          *
-         * Attempt to send all data that's already been written out and
-         * fsync'd to disk.  We cannot go further than what's been written out
-         * given the current implementation of WALRead().  And in any case
-         * it's unsafe to send WAL that is not securely down to disk on the
-         * primary: if the primary subsequently crashes and restarts, standbys
-         * must not have applied any WAL that got lost on the primary.
+         * Attempt to send all data that's can be replicated.  We cannot go
+         * further than what's been written out given the current
+         * implementation of WALRead().  And in any case it's unsafe to send
+         * WAL that is not securely down to disk on the primary: if the primary
+         * subsequently crashes and restarts, standbys must not have applied
+         * any WAL that got lost on the primary.
          */
-        SendRqstPtr = GetFlushRecPtr();
+        SendRqstPtr = GetReplicationTargetRecPtr();
     }
 
     /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..94876f628c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -338,6 +338,7 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetReplicationTargetRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void RemovePromoteSignalFiles(void);
 
-- 
2.27.0


Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
At Tue, 15 Dec 2020 19:32:57 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Mon, 14 Dec 2020 18:25:23 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> > I wonder if these are safe assumptions to make.  For your example, if
> > we've written record B to the WAL buffers, but neither record A nor B
> > have been written to disk or flushed, aren't we still in trouble?
> 
> You're right in that regard. There's a window where partial record is
> written when write location passes F0 after insertion location passes
> F1. However, remembering all spanning records seems overkilling to me.
> 
> I modifed the previous patch so that it remembers the start LSN of the
> *oldest* corss-segment continuation record in the last consecutive
> bonded segments, and the end LSN of the latest cross-segmetn
> continuation record. This doesn't foreget past segment boundaries.
> The region is cleard when WAL-write LSN goes beyond the remembered end
> LSN.  So the region may contain several wal-segments that are not
> connected to the next one, but that doesn't matter so much.

Mmm. Even tough it'a PoC, it was too bogus. I fixed it to work saner
way.

- Record the beginning LSN of the first cross-seg record and the end
  LSN of the last cross-seg recrod in a consecutive segments bonded by
  cross-seg recrods. Spcifically X and Y below.

       X                 Z         Y    
       [recA]  [recB]         [recC]
  [seg A] [seg B] [seg C] [seg D] [seg E]
(1)    (2.2)    (2.2)  (2.1)   (2.1)   (1)

1. If we wrote upto before X or beyond Y at a segment boundary, notify
  the finished segment immediately.

  1.1. If we have written beyond Y, clear the recorded region.

2. Otherwise we don't notify the segment immediately:

  2.1. If write request was up to exactly the current segment boundary
    and we know the end LSN of the record there (that is, it is recC
    above), extend the request to the end LSN. Then notify the segment
    after the record is written to the end.

  2.2. Otherwise (that is recA or recB), we don't know whether the
    last record of the last segment is ends just at the segment boundary
    (Z) or a record spans between segments (recB). Anyway even if there
    is such a record there, we don't know where it ends.  As the result
    what we can do there is only to refrain from notifying. It doesn't
    matter so much since we have already inserted recC so we will soon
    reach recC and will notify up to seg C.

There might be a case where we insert up to Y before writing up to Z,
the segment-region X-Y contains non-connected segment boundary in that
case. It is processed as if it is a connected segment
boundary. However, like 2.2 above, It doesn't matter since we write up
to Y soon.

At Tue, 15 Dec 2020 19:32:57 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
me> I added an assertion that a record must be shorter than a wal segment
me> to XLogRecordAssemble(). This guarantees the assumption to be true?
me> (The condition is tentative, would need to be adjusted.)

Changed the assertion more direct way.

me> Also, the attached is a PoC.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 50b3b05dd0eed79cd0b97991e82090b9d569cbac Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 16 Dec 2020 10:36:14 +0900
Subject: [PATCH v4 1/2] Avoid archiving a WAL segment that continues to the
 next  segment

If the last record of a finshed segment continues to the next segment,
we need to defer archiving of the segment until the record is flushed
to the end. Otherwise crash recovery can overwrite the last record of
a segment and history diverges between archive and pg_wal.
---
 src/backend/access/transam/xlog.c   | 214 +++++++++++++++++++++++++++-
 src/backend/replication/walsender.c |  14 +-
 src/include/access/xlog.h           |   1 +
 3 files changed, 217 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8dd225c2e1..8705809160 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -723,6 +723,16 @@ typedef struct XLogCtlData
      */
     XLogRecPtr    lastFpwDisableRecPtr;
 
+    /* The last segment notified to be archived. Protected by WALWriteLock */
+    XLogSegNo    lastNotifiedSeg;
+
+    /*
+     * Remember the oldest and newest known segment that ends with a
+     * continuation record.
+     */
+    XLogRecPtr    firstSegContRecStart;
+    XLogRecPtr    lastSegContRecEnd;
+    
     slock_t        info_lck;        /* locks shared variables shown above */
 } XLogCtlData;
 
@@ -1160,6 +1170,9 @@ XLogInsertRecord(XLogRecData *rdata,
      */
     if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
     {
+        XLogSegNo startseg;
+        XLogSegNo endseg;
+
         SpinLockAcquire(&XLogCtl->info_lck);
         /* advance global request to include new block(s) */
         if (XLogCtl->LogwrtRqst.Write < EndPos)
@@ -1167,6 +1180,24 @@ XLogInsertRecord(XLogRecData *rdata,
         /* update local result copy while I have the chance */
         LogwrtResult = XLogCtl->LogwrtResult;
         SpinLockRelease(&XLogCtl->info_lck);
+
+        /* Remember the range of segments end with a continuation recrod */
+        XLByteToSeg(StartPos, startseg, wal_segment_size);
+        XLByteToPrevSeg(EndPos, endseg, wal_segment_size);
+
+        if (startseg != endseg)
+        {
+            /*
+             * We shouldn't have a record spanning over more than two segments
+             */
+            Assert (startseg + 1 == endseg);
+
+            SpinLockAcquire(&XLogCtl->info_lck);
+            if (XLogCtl->firstSegContRecStart == InvalidXLogRecPtr)
+                XLogCtl->firstSegContRecStart = StartPos;
+            XLogCtl->lastSegContRecEnd = EndPos;
+            SpinLockRelease(&XLogCtl->info_lck);
+        }
     }
 
     /*
@@ -2400,6 +2431,43 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
     return false;
 }
 
+/*
+ * Returns last notified segment.
+ */
+static XLogSegNo
+GetLastNotifiedSegment(void)
+{
+    XLogSegNo last_notified;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    last_notified = XLogCtl->lastNotifiedSeg;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    return last_notified;
+}
+
+/*
+ * Notify segments that are not yet notified.
+ */
+static void
+NotifySegmentsUpTo(XLogSegNo notifySegNo)
+{
+    XLogSegNo last_notified = GetLastNotifiedSegment();
+    XLogSegNo i;
+
+    if (notifySegNo <= last_notified)
+        return;
+
+    for (i = XLogCtl->lastNotifiedSeg + 1 ; i <= notifySegNo ; i++)
+        XLogArchiveNotifySeg(i);
+
+    /* Don't go back in the case someone else has made it go further. */
+    SpinLockAcquire(&XLogCtl->info_lck);
+    if (XLogCtl->lastNotifiedSeg < notifySegNo)
+        XLogCtl->lastNotifiedSeg = notifySegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+}
+
 /*
  * Write and/or fsync the log at least as far as WriteRqst indicates.
  *
@@ -2423,6 +2491,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
     int            npages;
     int            startidx;
     uint32        startoffset;
+    bool        seg_notify = false;
 
     /* We should always be inside a critical section here */
     Assert(CritSectionCount > 0);
@@ -2579,15 +2648,95 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
              */
             if (finishing_seg)
             {
+                XLogRecPtr firstSegContRecStart;
+                XLogRecPtr lastSegContRecEnd;
+
                 issue_xlog_fsync(openLogFile, openLogSegNo);
 
                 /* signal that we need to wakeup walsenders later */
                 WalSndWakeupRequest();
 
-                LogwrtResult.Flush = LogwrtResult.Write;    /* end of page */
+                SpinLockAcquire(&XLogCtl->info_lck);
+                firstSegContRecStart = XLogCtl->firstSegContRecStart;
+                lastSegContRecEnd = XLogCtl->lastSegContRecEnd;
+                SpinLockRelease(&XLogCtl->info_lck);
 
-                if (XLogArchivingActive())
-                    XLogArchiveNotifySeg(openLogSegNo);
+                /*
+                 * If we may be on a continuation record spans over segments,
+                 * don't archive the segment until the record is written to the
+                 * end. If we do, we could have corrupt archive having
+                 * different records at the boundary after a server crash
+                 * around here.  For the same reason, also for replication,
+                 * don't expose flush location until the record is written to
+                 * the end so that an incomplete record at segment boundary
+                 * won't be sent to standby.
+                 */
+                if (LogwrtResult.Write < firstSegContRecStart ||
+                    lastSegContRecEnd <= LogwrtResult.Write)
+                {
+                    LogwrtResult.Flush = LogwrtResult.Write; /* end of page */
+
+                    if (XLogArchivingActive())
+                        NotifySegmentsUpTo(openLogSegNo);
+
+                    if (lastSegContRecEnd <= LogwrtResult.Write)
+                    {
+                        /*
+                         * We got out of the continuation region, reset the
+                         * locations.
+                         */
+                        SpinLockAcquire(&XLogCtl->info_lck);
+                        XLogCtl->firstSegContRecStart = InvalidXLogRecPtr;
+                        XLogCtl->lastSegContRecEnd = InvalidXLogRecPtr;
+                        SpinLockRelease(&XLogCtl->info_lck);
+                    }
+
+                    /* already notified */
+                    seg_notify = false;
+                }
+                else
+                {
+                    seg_notify = true;
+
+                    /*
+                     * There's a case where we are told to flush up not to the
+                     * end of a record but to WALbuffer page boundary. We
+                     * advance the request LSN to the end of the record in the
+                     * case the record at the requested LSN continues to the
+                     * next segment.
+                     */
+                    if (XLogSegmentOffset(WriteRqst.Write, wal_segment_size)
+                        == 0)
+                    {
+                        XLogSegNo oldseg;
+                        XLogSegNo currseg;
+
+                        XLByteToSeg(WriteRqst.Write, currseg,
+                                        wal_segment_size);
+                        XLByteToPrevSeg(lastSegContRecEnd, oldseg,
+                                        wal_segment_size);
+
+                        /*
+                         * If we know the exact placement of the record. Extend
+                         * request to the end of the recrod.
+                         */
+                        if (oldseg == currseg &&
+                            WriteRqst.Write < lastSegContRecEnd)
+                            WriteRqst.Write = lastSegContRecEnd;
+                        else
+                        {
+                            /*
+                             * We forgot the exact placement of the record
+                             * around requested write LSN. Refrain from
+                             * notifying the segment to avoid archiving a
+                             * segment having incomplete-record at the end. In
+                             * this case we are going to write another couple
+                             * of further segments thus we will soon reach the
+                             * next segment boundary.  */
+                            seg_notify = false;
+                        }
+                    }
+                }
 
                 XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
                 XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
@@ -2616,11 +2765,23 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
         }
         curridx = NextBufIdx(curridx);
 
-        /* If flexible, break out of loop as soon as we wrote something */
-        if (flexible && npages == 0)
+        /*
+         * If flexible, break out of loop as soon as we wrote something.
+         * However, we don't leave the loop if the last record in the just
+         * finished segment needs to be finished.
+         */
+        if (flexible && seg_notify && npages == 0)
             break;
     }
 
+    /*
+     * We have extended the write request to the next segment if the record at
+     * the initial WriteRqst.Write continues to the next segment.  In that case
+     * need to notify the last segment here.
+     */
+    if (seg_notify)
+        NotifySegmentsUpTo(openLogSegNo - 1);
+
     Assert(npages == 0);
 
     /*
@@ -7705,6 +7866,18 @@ StartupXLOG(void)
     XLogCtl->LogwrtRqst.Write = EndOfLog;
     XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+    /*
+     * We have archived up to the previous segment of EndOfLog so far.
+     * Initialize lastNotifiedSeg if needed.
+     */
+    if (XLogArchivingActive())
+    {
+        XLogSegNo    endLogSegNo;
+
+        XLByteToSeg(EndOfLog, endLogSegNo, wal_segment_size);
+        XLogCtl->lastNotifiedSeg = endLogSegNo - 1;
+    }
+
     /*
      * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
      * record before resource manager writes cleanup WAL records or checkpoint
@@ -8429,6 +8602,37 @@ GetFlushRecPtr(void)
     return LogwrtResult.Flush;
 }
 
+/*
+ * GetReplicationTargetRecPtr -- Returns the latest position that is safe to
+ * replicate.
+ */
+XLogRecPtr
+GetReplicationTargetRecPtr(void)
+{
+    static XLogRecPtr    lastTargetRecPtr = InvalidXLogRecPtr;
+    XLogRecPtr    firstSegContRecStart;
+    XLogRecPtr    lastSegContRecEnd;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    LogwrtResult = XLogCtl->LogwrtResult;
+    firstSegContRecStart = XLogCtl->firstSegContRecStart;
+    lastSegContRecEnd = XLogCtl->lastSegContRecEnd;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * Don't move forward if the current flush position may be within a
+     * continuation record that spans over segments.
+     */
+    if (lastTargetRecPtr == InvalidXLogRecPtr ||
+        firstSegContRecStart == InvalidXLogRecPtr ||
+        XLogSegmentOffset(LogwrtResult.Flush, wal_segment_size) != 0 ||
+        LogwrtResult.Flush < firstSegContRecStart ||
+        lastSegContRecEnd <= LogwrtResult.Flush)
+        lastTargetRecPtr = LogwrtResult.Flush;
+
+    return lastTargetRecPtr;
+}
+
 /*
  * GetLastImportantRecPtr -- Returns the LSN of the last important record
  * inserted. All records not explicitly marked as unimportant are considered
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2eb19ad293..00d8701a60 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2637,14 +2637,14 @@ XLogSendPhysical(void)
         /*
          * Streaming the current timeline on a primary.
          *
-         * Attempt to send all data that's already been written out and
-         * fsync'd to disk.  We cannot go further than what's been written out
-         * given the current implementation of WALRead().  And in any case
-         * it's unsafe to send WAL that is not securely down to disk on the
-         * primary: if the primary subsequently crashes and restarts, standbys
-         * must not have applied any WAL that got lost on the primary.
+         * Attempt to send all data that's can be replicated.  We cannot go
+         * further than what's been written out given the current
+         * implementation of WALRead().  And in any case it's unsafe to send
+         * WAL that is not securely down to disk on the primary: if the primary
+         * subsequently crashes and restarts, standbys must not have applied
+         * any WAL that got lost on the primary.
          */
-        SendRqstPtr = GetFlushRecPtr();
+        SendRqstPtr = GetReplicationTargetRecPtr();
     }
 
     /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..94876f628c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -338,6 +338,7 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetReplicationTargetRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void RemovePromoteSignalFiles(void);
 
-- 
2.27.0

From 6367ea9ffac7b6e6e692ddacfb8dc997c222c296 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 16 Dec 2020 10:36:42 +0900
Subject: [PATCH v4 2/2] debug print

---
 src/backend/access/transam/xlog.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8705809160..a1bd00ae1b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1192,6 +1192,11 @@ XLogInsertRecord(XLogRecData *rdata,
              */
             Assert (startseg + 1 == endseg);
 
+            if (XLogCtl->firstSegContRecStart == InvalidXLogRecPtr)
+                ereport(LOG, (errmsg("REG-REG: (%lX, %lX)", StartPos, EndPos),
errhidestmt(true),errhidecontext(true)));
+            else
+                ereport(LOG, (errmsg("UPD-REG: (%lX, %lX)", XLogCtl->firstSegContRecStart, EndPos),
errhidestmt(true),errhidecontext(true)));
+
             SpinLockAcquire(&XLogCtl->info_lck);
             if (XLogCtl->firstSegContRecStart == InvalidXLogRecPtr)
                 XLogCtl->firstSegContRecStart = StartPos;
@@ -2674,6 +2679,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
                 if (LogwrtResult.Write < firstSegContRecStart ||
                     lastSegContRecEnd <= LogwrtResult.Write)
                 {
+                    ereport(LOG, (errmsg("NOTIFY1 %lX-%lX: (%lX, %lX) %lX(/%lX) %lX", GetLastNotifiedSegment() + 1,
openLogSegNo,firstSegContRecStart, lastSegContRecEnd, WriteRqst.Write, XLogSegmentOffset(WriteRqst.Write,
wal_segment_size),LogwrtResult.Write), errhidestmt(true),errhidecontext(true)));
 
+
                     LogwrtResult.Flush = LogwrtResult.Write; /* end of page */
 
                     if (XLogArchivingActive())
@@ -2681,6 +2688,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 
                     if (lastSegContRecEnd <= LogwrtResult.Write)
                     {
+                        ereport(LOG, (errmsg("CLEAR-REG: (%lX, %lX) %lX, %lX", firstSegContRecStart,
lastSegContRecEnd,WriteRqst.Write, LogwrtResult.Write), errhidestmt(true),errhidecontext(true)));
 
                         /*
                          * We got out of the continuation region, reset the
                          * locations.
@@ -2705,6 +2713,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
                      * case the record at the requested LSN continues to the
                      * next segment.
                      */
+                    ereport(LOG, (errmsg("NOT-NOTIFY: (%lX, %lX) %lX(/%lX) %lX(/%lX)", firstSegContRecStart,
lastSegContRecEnd,LogwrtResult.Write, XLogSegmentOffset(LogwrtResult.Write, wal_segment_size), WriteRqst.Write,
XLogSegmentOffset(WriteRqst.Write,wal_segment_size)), errhidestmt(true),errhidecontext(true)));
 
                     if (XLogSegmentOffset(WriteRqst.Write, wal_segment_size)
                         == 0)
                     {
@@ -2722,9 +2731,13 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
                          */
                         if (oldseg == currseg &&
                             WriteRqst.Write < lastSegContRecEnd)
+                        {
+                            ereport(LOG, (errmsg("EXTEND-RQST: (%lX, %lX(%lX)) %lX(%lX) => %d", firstSegContRecStart,
lastSegContRecEnd,oldseg, WriteRqst.Write, currseg, (oldseg == currseg && WriteRqst.Write <= lastSegContRecEnd)),
errhidestmt(true),errhidecontext(true)));
                             WriteRqst.Write = lastSegContRecEnd;
+                        }
                         else
                         {
+                            ereport(LOG, (errmsg("SKIP-NOTIFY: (%lX, %lX(%lX)) %lX(%lX) => %d", firstSegContRecStart,
lastSegContRecEnd,oldseg, WriteRqst.Write, currseg, (oldseg == currseg && WriteRqst.Write <= lastSegContRecEnd)),
errhidestmt(true),errhidecontext(true)));
                             /*
                              * We forgot the exact placement of the record
                              * around requested write LSN. Refrain from
@@ -2780,7 +2793,11 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
      * need to notify the last segment here.
      */
     if (seg_notify)
+    {
+        ereport(LOG, (errmsg("NOTIFY2 %lX-%lX: (%lX, %lX) %lX, seg %lX", GetLastNotifiedSegment() + 1, openLogSegNo -
1,XLogCtl->firstSegContRecStart, XLogCtl->lastSegContRecEnd, LogwrtResult.Write, openLogSegNo - 1),
errhidestmt(true),errhidecontext(true)));
+
         NotifySegmentsUpTo(openLogSegNo - 1);
+    }
 
     Assert(npages == 0);
 
-- 
2.27.0


Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> You're right in that regard. There's a window where partial record is
> written when write location passes F0 after insertion location passes
> F1. However, remembering all spanning records seems overkilling to me.

I'm curious why you feel that recording all cross-segment records is
overkill.  IMO it seems far simpler to just do that rather than try to
reason about all these different scenarios and rely on various
(and possibly fragile) assumptions.  You only need to record the end
location of records that cross into the next segment (or that fit
perfectly into the end of the current one) and to evaluate which
segments to mark .ready as the "flushed" LSN advances.  I'd expect
that in most cases we wouldn't need to store more than a couple of
record boundaries, so it's not like we'd normally be storing dozens of
boundaries.  Even if we did need to store several boundaries, AFAICT
the approach I'm proposing should still work well enough.

Nathan


Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
At Thu, 17 Dec 2020 22:20:35 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> > You're right in that regard. There's a window where partial record is
> > written when write location passes F0 after insertion location passes
> > F1. However, remembering all spanning records seems overkilling to me.
> 
> I'm curious why you feel that recording all cross-segment records is
> overkill.  IMO it seems far simpler to just do that rather than try to

Sorry, my words are not enough. Remembering all spanning records in
*shared memory* seems to be overkilling.  Much more if it is stored in
shared hash table.  Even though it rarely the case, it can fail hard
way when reaching the limit. If we could do well by remembering just
two locations, we wouldn't need to worry about such a limitation.

> reason about all these different scenarios and rely on various
> (and possibly fragile) assumptions.  You only need to record the end

After the previous mail sent, I noticed that the assumption on
record-length was not needed. So that way no longer need any of the
assumption^^;

> location of records that cross into the next segment (or that fit
> perfectly into the end of the current one) and to evaluate which
> segments to mark .ready as the "flushed" LSN advances.  I'd expect
> that in most cases we wouldn't need to store more than a couple of
> record boundaries, so it's not like we'd normally be storing dozens of
> boundaries.  Even if we did need to store several boundaries, AFAICT
> the approach I'm proposing should still work well enough.

I didn't say it doesn't work, just overkill.

Another concern about the concrete patch:

NotifySegmentsReadyForArchive() searches the shared hashacquiaing a
LWLock every time XLogWrite is called while segment archive is being
held off.  I don't think it is acceptable and I think it could be a
problem when many backends are competing on WAL.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
At Wed, 16 Dec 2020 11:01:20 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> - Record the beginning LSN of the first cross-seg record and the end
>   LSN of the last cross-seg recrod in a consecutive segments bonded by
>   cross-seg recrods. Spcifically X and Y below.
> 
>        X                 Z         Y    
>        [recA]  [recB]         [recC]
>   [seg A] [seg B] [seg C] [seg D] [seg E]
> (1)    (2.2)    (2.2)  (2.1)   (2.1)   (1)
> 
> 1. If we wrote upto before X or beyond Y at a segment boundary, notify
>   the finished segment immediately.
> 
>   1.1. If we have written beyond Y, clear the recorded region.
> 
> 2. Otherwise we don't notify the segment immediately:
> 
>   2.1. If write request was up to exactly the current segment boundary
>     and we know the end LSN of the record there (that is, it is recC
>     above), extend the request to the end LSN. Then notify the segment
>     after the record is written to the end.
> 
>   2.2. Otherwise (that is recA or recB), we don't know whether the
>     last record of the last segment is ends just at the segment boundary
>     (Z) or a record spans between segments (recB). Anyway even if there
>     is such a record there, we don't know where it ends.  As the result
>     what we can do there is only to refrain from notifying. It doesn't
>     matter so much since we have already inserted recC so we will soon
>     reach recC and will notify up to seg C.
> 
> There might be a case where we insert up to Y before writing up to Z,
> the segment-region X-Y contains non-connected segment boundary in that
> case. It is processed as if it is a connected segment
> boundary. However, like 2.2 above, It doesn't matter since we write up
> to Y soon.

I noticed that we can cause the continuation record flushed
immedately. So in the attached,

1. If there's no remembered cross-segment boundary or we're out of the
  region X-Y, notify the finished segment immediately.

2. Otherwise we don't notify the segment immedately

 2.1. If we are finishing the last semgment known to continue to the
   next segment, extend write request to the end of the recrod *and*
   force to write then flush up to there.

 2.2. (the same to the above)

3. In the case of 2.1, we can flush the previous segment immediately
  so do that.

X. When we notify a segment, clear the rememberd region if we have got
  out of the region.


The attached is changed in the following points:

- Fixed some bugs that I confusedly refer to write-lsn instead of flush-lsn.

- Changed to urge flushing up to the end of a continuation record, not
  only waiting for the recrod to be written.

- More agressively clear the remembered region.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 95e438ee448c1686c946909f1fc84ec95ee6c7d4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 18 Dec 2020 13:45:29 +0900
Subject: [PATCH v5 1/2] Avoid archiving a WAL segment that continues to the
 next segment

If the last record of a finshed segment continues to the next segment,
we need to defer archiving of the segment until the record is flushed
to the end. Otherwise crash recovery can overwrite the last record of
a segment and history diverges between archive and pg_wal.
---
 src/backend/access/transam/xlog.c   | 221 +++++++++++++++++++++++++++-
 src/backend/replication/walsender.c |  14 +-
 src/include/access/xlog.h           |   1 +
 3 files changed, 224 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b1e5d2dbff..b0d3ba2c5a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -733,6 +733,16 @@ typedef struct XLogCtlData
      */
     XLogRecPtr    lastFpwDisableRecPtr;
 
+    /* The last segment notified to be archived. Protected by WALWriteLock */
+    XLogSegNo    lastNotifiedSeg;
+
+    /*
+     * Remember the region we need to consider refraining from archiving
+     * finished segments immediately. Protected by info_lck.
+     */
+    XLogRecPtr    firstSegContRecStart;
+    XLogRecPtr    lastSegContRecEnd;
+    
     slock_t        info_lck;        /* locks shared variables shown above */
 } XLogCtlData;
 
@@ -1170,6 +1180,9 @@ XLogInsertRecord(XLogRecData *rdata,
      */
     if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
     {
+        XLogSegNo startseg;
+        XLogSegNo endseg;
+
         SpinLockAcquire(&XLogCtl->info_lck);
         /* advance global request to include new block(s) */
         if (XLogCtl->LogwrtRqst.Write < EndPos)
@@ -1177,6 +1190,22 @@ XLogInsertRecord(XLogRecData *rdata,
         /* update local result copy while I have the chance */
         LogwrtResult = XLogCtl->LogwrtResult;
         SpinLockRelease(&XLogCtl->info_lck);
+
+        /*
+         * Remember the range of segment boundaries that are connected by a
+         * continuation record.
+         */
+        XLByteToSeg(StartPos, startseg, wal_segment_size);
+        XLByteToPrevSeg(EndPos, endseg, wal_segment_size);
+
+        if (startseg != endseg)
+        {
+            SpinLockAcquire(&XLogCtl->info_lck);
+            if (XLogCtl->firstSegContRecStart == InvalidXLogRecPtr)
+                XLogCtl->firstSegContRecStart = StartPos;
+            XLogCtl->lastSegContRecEnd = EndPos;
+            SpinLockRelease(&XLogCtl->info_lck);
+        }
     }
 
     /*
@@ -2410,6 +2439,50 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
     return false;
 }
 
+/*
+ * Returns last notified segment.
+ */
+static XLogSegNo
+GetLastNotifiedSegment(void)
+{
+    XLogSegNo last_notified;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    last_notified = XLogCtl->lastNotifiedSeg;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    return last_notified;
+}
+
+/*
+ * Notify segments that are not yet notified.
+ */
+static void
+NotifySegmentsUpTo(XLogSegNo notifySegNo)
+{
+    XLogSegNo last_notified = GetLastNotifiedSegment();
+    XLogSegNo i;
+
+    if (notifySegNo <= last_notified)
+        return;
+
+    for (i = XLogCtl->lastNotifiedSeg + 1 ; i <= notifySegNo ; i++)
+        XLogArchiveNotifySeg(i);
+
+    /* Don't go back in the case someone else has made it go further. */
+    SpinLockAcquire(&XLogCtl->info_lck);
+    if (XLogCtl->lastNotifiedSeg < notifySegNo)
+        XLogCtl->lastNotifiedSeg = notifySegNo;
+
+    /* Reset the locations if we have got out of the continuation region. */
+    if (XLogCtl->lastSegContRecEnd <= XLogCtl->LogwrtRqst.Flush)
+    {
+        XLogCtl->firstSegContRecStart = InvalidXLogRecPtr;
+        XLogCtl->lastSegContRecEnd = InvalidXLogRecPtr;
+    }
+    SpinLockRelease(&XLogCtl->info_lck);
+}
+
 /*
  * Write and/or fsync the log at least as far as WriteRqst indicates.
  *
@@ -2433,6 +2506,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
     int            npages;
     int            startidx;
     uint32        startoffset;
+    bool        notify_seg = false;
+    bool        force_continue = false;
 
     /* We should always be inside a critical section here */
     Assert(CritSectionCount > 0);
@@ -2589,15 +2664,97 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
              */
             if (finishing_seg)
             {
+                XLogRecPtr firstSegContRecStart;
+                XLogRecPtr lastSegContRecEnd;
+
                 issue_xlog_fsync(openLogFile, openLogSegNo);
 
                 /* signal that we need to wakeup walsenders later */
                 WalSndWakeupRequest();
 
-                LogwrtResult.Flush = LogwrtResult.Write;    /* end of page */
+                SpinLockAcquire(&XLogCtl->info_lck);
+                firstSegContRecStart = XLogCtl->firstSegContRecStart;
+                lastSegContRecEnd = XLogCtl->lastSegContRecEnd;
+                SpinLockRelease(&XLogCtl->info_lck);
 
-                if (XLogArchivingActive())
-                    XLogArchiveNotifySeg(openLogSegNo);
+                /*
+                 * If we may be on a continuation record spans over segments,
+                 * don't archive the segment until the record is written to the
+                 * end. If we do, we could have corrupt archive having
+                 * different records at the boundary after a server crash
+                 * around here.  For the same reason, also for replication,
+                 * don't expose flush location until the record is written to
+                 * the end so that an incomplete record at segment boundary
+                 * won't be sent to standby.
+                 */
+                if (firstSegContRecStart == InvalidXLogRecPtr ||
+                    LogwrtResult.Write < firstSegContRecStart ||
+                    lastSegContRecEnd <= LogwrtResult.Write)
+                {
+                    LogwrtResult.Flush = LogwrtResult.Write; /* end of page */
+
+                    if (XLogArchivingActive())
+                        NotifySegmentsUpTo(openLogSegNo);
+
+                    /* Already notified */
+                    notify_seg = false;
+                    force_continue = false;
+                }
+                else
+                {
+                    /*
+                     * The last record in the finishing segment may be
+                     * continuing into the next segment. Don't archive the
+                     * segment until we are sure that the continuation record
+                     * is completely written out.
+                     */
+
+                    XLogSegNo reqendseg;
+                    XLogSegNo contendseg;
+
+                    force_continue = true;
+
+                    XLByteToSeg(WriteRqst.Write, reqendseg, wal_segment_size);
+                    XLByteToSeg(lastSegContRecEnd, contendseg,
+                                wal_segment_size);
+
+                    /*
+                     * There's a case where we are told to flush up not to the
+                     * end of a record but to a page boundary. Advance the
+                     * request LSN to the end of the record at the boundary if
+                     * any.
+                     */
+                    if (reqendseg == contendseg)
+                    {
+                        /*
+                         * We finished the last segment in the region. Extend
+                         * the write request to the end of the recrod.
+                         */
+                        if (WriteRqst.Write < lastSegContRecEnd)
+                            WriteRqst.Write = lastSegContRecEnd;
+
+                        /*
+                         * Flush up to at least the same location immediately
+                         * so that the segment can be archived.
+                         */
+                        if (WriteRqst.Flush < lastSegContRecEnd)
+                            WriteRqst.Flush = lastSegContRecEnd;
+
+                        /* We know the record will be finished */
+                        notify_seg = true;
+                    }
+                    else
+                    {
+                        /* We don't know the exact placement of the record
+                         * around the requested write LSN. Do not archive this
+                         * segment since it might ends with an incomplete
+                         * record. In this case we will end up with archiving
+                         * the segment soon since we have written up to further
+                         * segments.
+                         */
+                        notify_seg = false;
+                    }
+                }
 
                 XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
                 XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
@@ -2626,8 +2783,11 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
         }
         curridx = NextBufIdx(curridx);
 
-        /* If flexible, break out of loop as soon as we wrote something */
-        if (flexible && npages == 0)
+        /*
+         * If flexible, break out of loop as soon as we wrote something.
+         * However, we don't leave the loop if we have to write a bit more.
+         */
+        if (flexible && !force_continue && npages == 0)
             break;
     }
 
@@ -2685,6 +2845,14 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
             XLogCtl->LogwrtRqst.Flush = LogwrtResult.Flush;
         SpinLockRelease(&XLogCtl->info_lck);
     }
+
+    /*
+     * We have extended the write request to the next segment if the record at
+     * the initial WriteRqst.Write continues to the next segment.  In that case
+     * need to notify the last segment here.
+     */
+    if (notify_seg)
+        NotifySegmentsUpTo(openLogSegNo - 1);
 }
 
 /*
@@ -7716,6 +7884,18 @@ StartupXLOG(void)
     XLogCtl->LogwrtRqst.Write = EndOfLog;
     XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+    /*
+     * We have archived up to the previous segment of EndOfLog so far.
+     * Initialize lastNotifiedSeg if needed.
+     */
+    if (XLogArchivingActive())
+    {
+        XLogSegNo    endLogSegNo;
+
+        XLByteToSeg(EndOfLog, endLogSegNo, wal_segment_size);
+        XLogCtl->lastNotifiedSeg = endLogSegNo - 1;
+    }
+
     /*
      * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
      * record before resource manager writes cleanup WAL records or checkpoint
@@ -8440,6 +8620,37 @@ GetFlushRecPtr(void)
     return LogwrtResult.Flush;
 }
 
+/*
+ * GetReplicationTargetRecPtr -- Returns the latest position that is safe to
+ * replicate.
+ */
+XLogRecPtr
+GetReplicationTargetRecPtr(void)
+{
+    static XLogRecPtr    lastTargetRecPtr = InvalidXLogRecPtr;
+    XLogRecPtr    firstSegContRecStart;
+    XLogRecPtr    lastSegContRecEnd;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    LogwrtResult = XLogCtl->LogwrtResult;
+    firstSegContRecStart = XLogCtl->firstSegContRecStart;
+    lastSegContRecEnd = XLogCtl->lastSegContRecEnd;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * Don't move forward if the current flush position may be within a
+     * continuation record that spans over segments.
+     */
+    if (lastTargetRecPtr == InvalidXLogRecPtr ||
+        firstSegContRecStart == InvalidXLogRecPtr ||
+        XLogSegmentOffset(LogwrtResult.Flush, wal_segment_size) != 0 ||
+        LogwrtResult.Flush < firstSegContRecStart ||
+        lastSegContRecEnd <= LogwrtResult.Flush)
+        lastTargetRecPtr = LogwrtResult.Flush;
+
+    return lastTargetRecPtr;
+}
+
 /*
  * GetLastImportantRecPtr -- Returns the LSN of the last important record
  * inserted. All records not explicitly marked as unimportant are considered
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d5c9bc31d8..c13cca1e64 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2631,14 +2631,14 @@ XLogSendPhysical(void)
         /*
          * Streaming the current timeline on a primary.
          *
-         * Attempt to send all data that's already been written out and
-         * fsync'd to disk.  We cannot go further than what's been written out
-         * given the current implementation of WALRead().  And in any case
-         * it's unsafe to send WAL that is not securely down to disk on the
-         * primary: if the primary subsequently crashes and restarts, standbys
-         * must not have applied any WAL that got lost on the primary.
+         * Attempt to send all data that's can be replicated.  We cannot go
+         * further than what's been written out given the current
+         * implementation of WALRead().  And in any case it's unsafe to send
+         * WAL that is not securely down to disk on the primary: if the primary
+         * subsequently crashes and restarts, standbys must not have applied
+         * any WAL that got lost on the primary.
          */
-        SendRqstPtr = GetFlushRecPtr();
+        SendRqstPtr = GetReplicationTargetRecPtr();
     }
 
     /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..94876f628c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -338,6 +338,7 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetReplicationTargetRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void RemovePromoteSignalFiles(void);
 
-- 
2.27.0

From db3b7a536199202539a29af50fa55cc46cdff1db Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 18 Dec 2020 13:49:43 +0900
Subject: [PATCH v5 2/2] debug print

---
 src/backend/access/transam/xlog.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b0d3ba2c5a..ee5494701c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1200,6 +1200,11 @@ XLogInsertRecord(XLogRecData *rdata,
 
         if (startseg != endseg)
         {
+            if (XLogCtl->firstSegContRecStart == InvalidXLogRecPtr)
+                ereport(LOG, (errmsg("REG-REG: (%lX, %lX)", StartPos, EndPos),
errhidestmt(true),errhidecontext(true)));
+            else
+                ereport(LOG, (errmsg("UPD-REG: (%lX, %lX)", XLogCtl->firstSegContRecStart, EndPos),
errhidestmt(true),errhidecontext(true)));
+
             SpinLockAcquire(&XLogCtl->info_lck);
             if (XLogCtl->firstSegContRecStart == InvalidXLogRecPtr)
                 XLogCtl->firstSegContRecStart = StartPos;
@@ -2477,6 +2482,8 @@ NotifySegmentsUpTo(XLogSegNo notifySegNo)
     /* Reset the locations if we have got out of the continuation region. */
     if (XLogCtl->lastSegContRecEnd <= XLogCtl->LogwrtRqst.Flush)
     {
+        ereport(LOG, (errmsg("CLEAR-REG: (%lX, %lX) %lX, %lX", XLogCtl->firstSegContRecStart,
XLogCtl->lastSegContRecEnd,XLogCtl->LogwrtRqst.Write, XLogCtl->LogwrtRqst.Flush),
errhidestmt(true),errhidecontext(true)));
+
         XLogCtl->firstSegContRecStart = InvalidXLogRecPtr;
         XLogCtl->lastSegContRecEnd = InvalidXLogRecPtr;
     }
@@ -2691,6 +2698,8 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
                     LogwrtResult.Write < firstSegContRecStart ||
                     lastSegContRecEnd <= LogwrtResult.Write)
                 {
+                    ereport(LOG, (errmsg("NOTIFY1 %lX-%lX: (%lX, %lX) %lX(/%lX) %lX", GetLastNotifiedSegment() + 1,
openLogSegNo,firstSegContRecStart, lastSegContRecEnd, WriteRqst.Write, XLogSegmentOffset(WriteRqst.Write,
wal_segment_size),LogwrtResult.Write), errhidestmt(true),errhidecontext(true)));
 
+
                     LogwrtResult.Flush = LogwrtResult.Write; /* end of page */
 
                     if (XLogArchivingActive())
@@ -2731,7 +2740,12 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
                          * the write request to the end of the recrod.
                          */
                         if (WriteRqst.Write < lastSegContRecEnd)
+                        {
+                            ereport(LOG, (errmsg("EXTEND-RQST: (%lX, %lX(%lX)) %lX(%lX)", firstSegContRecStart,
lastSegContRecEnd,contendseg, WriteRqst.Write, reqendseg), errhidestmt(true),errhidecontext(true)));
 
                             WriteRqst.Write = lastSegContRecEnd;
+                        }
+                        else
+                            ereport(LOG, (errmsg("NOT-EXTEND-RQST: (%lX, %lX(%lX)) %lX(%lX)", firstSegContRecStart,
lastSegContRecEnd,contendseg, WriteRqst.Write, reqendseg), errhidestmt(true),errhidecontext(true)));
 
 
                         /*
                          * Flush up to at least the same location immediately
@@ -2752,6 +2766,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
                          * the segment soon since we have written up to further
                          * segments.
                          */
+                        ereport(LOG, (errmsg("SKIP-NOTIFY: (%lX, %lX(%lX)) %lX(%lX) => %d", firstSegContRecStart,
lastSegContRecEnd,contendseg, WriteRqst.Write, reqendseg, (contendseg == reqendseg && WriteRqst.Write <
lastSegContRecEnd)),errhidestmt(true),errhidecontext(true)));
 
                         notify_seg = false;
                     }
                 }
@@ -2852,7 +2867,11 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
      * need to notify the last segment here.
      */
     if (notify_seg)
+    {
+        ereport(LOG, (errmsg("NOTIFY2 %lX-%lX: (%lX, %lX) %lX, seg %lX", GetLastNotifiedSegment() + 1, openLogSegNo -
1,XLogCtl->firstSegContRecStart, XLogCtl->lastSegContRecEnd, LogwrtResult.Flush, openLogSegNo - 1),
errhidestmt(true),errhidecontext(true)));
+
         NotifySegmentsUpTo(openLogSegNo - 1);
+    }
 }
 
 /*
-- 
2.27.0


Re: archive status ".ready" files may be created too early

From
Andrey Borodin
Date:
Hi!

I was looking to review something in CF. This seems like a thread of some interest to me.

Recently we had somewhat related incident. Do I understand correctly that this incident is related to the bug discussed
inthis thread? 

Primary instance was killed by OOM
[ 2020-11-12 15:27:03.732 MSK ,,,739,00000 ]:LOG:  server process (PID 40189) was terminated by signal 9: Killed
after recovery it archived some WAL segments.
[ 2020-11-12 15:27:31.477 MSK ,,,739,00000 ]:LOG:  database system is ready to accept connections
INFO: 2020/11/12 15:27:32.059541 FILE PATH: 0000000E0001C02F000000AF.br
INFO: 2020/11/12 15:27:32.114319 FILE PATH: 0000000E0001C02F000000B3.br

then PITR failed on another host
[ 2020-11-12 16:26:33.024 MSK ,,,51414,00000 ]:LOG:  restored log file "0000000E0001C02F000000B3" from archive
[ 2020-11-12 16:26:33.042 MSK ,,,51414,00000 ]:LOG:  invalid record length at 1C02F/B3FFF778: wanted 24, got 0
[ 2020-11-12 16:26:33.042 MSK ,,,51414,00000 ]:LOG:  invalid record length at 1C02F/B3FFF778: wanted 24, got 0

archived segment has some zeroes at the end
rmgr: XLOG        len (rec/tot):     51/  1634, tx:          0, lsn: 1C02F/B3FFF058, prev 1C02F/B3FFEFE8, desc:
FPI_FOR_HINT, blkref #0: rel 1663/14030/16384 blk 140 FPW 
rmgr: Heap        len (rec/tot):    129/   129, tx: 3890578935, lsn: 1C02F/B3FFF6C0, prev 1C02F/B3FFF058, desc:
HOT_UPDATEoff 34 xmax 3890578935 ; new off 35 xmax 0, blkref #0: rel 1663/14030/16384 blk 140 
rmgr: Transaction len (rec/tot):     46/    46, tx: 3890578935, lsn: 1C02F/B3FFF748, prev 1C02F/B3FFF6C0, desc: COMMIT
2020-11-1215:27:31.507363 MSK 
pg_waldump: FATAL:  error in WAL record at 1C02F/**B3FFF748**: invalid record length at 1C02F/**B3FFF778**: wanted 24,
got0 

Meanwhile next segment points to previous record at **B3FFF748**
postgres@man-odszl7u4361o8m3z:/tmp$ pg_waldump 0000000E0001C02F000000B4| head
rmgr: Heap        len (rec/tot):    129/   129, tx: 3890578936, lsn: 1C02F/B4000A68, prev 1C02F/**B3FFF778**, desc:
HOT_UPDATEoff 35 xmax 3890578936 ; new off 36 xmax 0, blkref #0: rel 1663/14030/16384 blk 140 
rmgr: Transaction len (rec/tot):     46/    46, tx: 3890578936, lsn: 1C02F/B4000AF0, prev 1C02F/B4000A68, desc: COMMIT
2020-11-1215:27:32.509443 MSK 

Best regards, Andrey Borodin.


Re: archive status ".ready" files may be created too early

From
Andrey Borodin
Date:
Hi!

Thanks for working on this.

> 18 дек. 2020 г., в 10:42, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а):
>
> I noticed that we can cause the continuation record flushed
> immedately.

I've took a look into the code and want to share some thoughts.

1. Maybe we could tend to avoid interlacing field protected by different locks in XLogCtlData? We can place
lastNotifiedSegsomewhere near field that is protected by WALWriteLock. I'm not sure it's useful idea. 
2. In XLogInsertRecord() we release &XLogCtl->info_lck just to compute few bytes. And possibly aquire it back. Maybe
justhold the lock a little longer? 
3. Things that are done by GetLastNotifiedSegment() could just be atomic read? I'm not sure it's common practice.

Thanks!

Best regards, Andrey Borodin.


Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 12/17/20, 9:15 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> At Thu, 17 Dec 2020 22:20:35 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in
>> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
>> > You're right in that regard. There's a window where partial record is
>> > written when write location passes F0 after insertion location passes
>> > F1. However, remembering all spanning records seems overkilling to me.
>>
>> I'm curious why you feel that recording all cross-segment records is
>> overkill.  IMO it seems far simpler to just do that rather than try to
>
> Sorry, my words are not enough. Remembering all spanning records in
> *shared memory* seems to be overkilling.  Much more if it is stored in
> shared hash table.  Even though it rarely the case, it can fail hard
> way when reaching the limit. If we could do well by remembering just
> two locations, we wouldn't need to worry about such a limitation.

I don't think it will fail if we reach max_size for the hash table.
The comment above ShmemInitHash() has this note:

    * max_size is the estimated maximum number of hashtable entries.  This is
    * not a hard limit, but the access efficiency will degrade if it is
    * exceeded substantially (since it's used to compute directory size and
    * the hash table buckets will get overfull).

> Another concern about the concrete patch:
>
> NotifySegmentsReadyForArchive() searches the shared hashacquiaing a
> LWLock every time XLogWrite is called while segment archive is being
> held off.  I don't think it is acceptable and I think it could be a
> problem when many backends are competing on WAL.

This is a fair point.  I did some benchmarking with a few hundred
connections all doing writes, and I was not able to discern any
noticeable performance impact.  My guess is that contention on this
new lock is unlikely because callers of XLogWrite() must already hold
WALWriteLock.  Otherwise, I believe we only acquire ArchNotifyLock no
more than once per segment to record new record boundaries.

Nathan


Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 1/2/21, 8:55 AM, "Andrey Borodin" <x4mmm@yandex-team.ru> wrote:
> Recently we had somewhat related incident. Do I understand correctly that this incident is related to the bug
discussedin this thread?
 

I'm not sure that we can rule it out, but the log pattern I've
typically seen for this is "invalid contrecord length."  The issue is
that we're marking segments as ready for archive when the segment is
fully written versus when its WAL records are fully written (since its
WAL records may cross into the next segment).  The fact that you're
seeing zeroes at the end of your archived segments leads me to think
it is unlikely that you are experiencing this bug.

Nathan


Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
At Tue, 26 Jan 2021 19:13:57 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> On 12/17/20, 9:15 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> > At Thu, 17 Dec 2020 22:20:35 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in
> >> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> >> > You're right in that regard. There's a window where partial record is
> >> > written when write location passes F0 after insertion location passes
> >> > F1. However, remembering all spanning records seems overkilling to me.
> >>
> >> I'm curious why you feel that recording all cross-segment records is
> >> overkill.  IMO it seems far simpler to just do that rather than try to
> >
> > Sorry, my words are not enough. Remembering all spanning records in
> > *shared memory* seems to be overkilling.  Much more if it is stored in
> > shared hash table.  Even though it rarely the case, it can fail hard
> > way when reaching the limit. If we could do well by remembering just
> > two locations, we wouldn't need to worry about such a limitation.
> 
> I don't think it will fail if we reach max_size for the hash table.
> The comment above ShmemInitHash() has this note:
> 
>     * max_size is the estimated maximum number of hashtable entries.  This is
>     * not a hard limit, but the access efficiency will degrade if it is
>     * exceeded substantially (since it's used to compute directory size and
>     * the hash table buckets will get overfull).

That description means that a shared hash has a directory with fixed
size thus there may be synonyms, which causes degradation. Even though
buckets are preallocated with the specified number, since the minimum
directory size is 256, buckets are allocated at least 256 in a long
run. Minimum on-the-fly allocation size is 32. I haven't calcuated
further precicely, but I'm worried about the amount of spare shared
memory the hash can allocate.

> > Another concern about the concrete patch:
> >
> > NotifySegmentsReadyForArchive() searches the shared hashacquiaing a
> > LWLock every time XLogWrite is called while segment archive is being
> > held off.  I don't think it is acceptable and I think it could be a
> > problem when many backends are competing on WAL.
> 
> This is a fair point.  I did some benchmarking with a few hundred
> connections all doing writes, and I was not able to discern any
> noticeable performance impact.  My guess is that contention on this
> new lock is unlikely because callers of XLogWrite() must already hold
> WALWriteLock.  Otherwise, I believe we only acquire ArchNotifyLock no
> more than once per segment to record new record boundaries.

Thanks. I agree that the reader-reader contention is not a problem due
to existing serialization by WALWriteLock. Adding an entry happens
only at segment boundary so the ArchNotifyLock doesn't seem to be a
problem.

However the function prolongs the WALWriteLock section. Couldn't we
somehow move the call to NotifySegmentsReadyForArchive in XLogWrite
out of the WALWriteLock section?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 1/26/21, 6:36 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> At Tue, 26 Jan 2021 19:13:57 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in
>> On 12/17/20, 9:15 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
>> > At Thu, 17 Dec 2020 22:20:35 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in
>> >> On 12/15/20, 2:33 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
>> >> > You're right in that regard. There's a window where partial record is
>> >> > written when write location passes F0 after insertion location passes
>> >> > F1. However, remembering all spanning records seems overkilling to me.
>> >>
>> >> I'm curious why you feel that recording all cross-segment records is
>> >> overkill.  IMO it seems far simpler to just do that rather than try to
>> >
>> > Sorry, my words are not enough. Remembering all spanning records in
>> > *shared memory* seems to be overkilling.  Much more if it is stored in
>> > shared hash table.  Even though it rarely the case, it can fail hard
>> > way when reaching the limit. If we could do well by remembering just
>> > two locations, we wouldn't need to worry about such a limitation.
>>
>> I don't think it will fail if we reach max_size for the hash table.
>> The comment above ShmemInitHash() has this note:
>>
>>     * max_size is the estimated maximum number of hashtable entries.  This is
>>     * not a hard limit, but the access efficiency will degrade if it is
>>     * exceeded substantially (since it's used to compute directory size and
>>     * the hash table buckets will get overfull).
>
> That description means that a shared hash has a directory with fixed
> size thus there may be synonyms, which causes degradation. Even though
> buckets are preallocated with the specified number, since the minimum
> directory size is 256, buckets are allocated at least 256 in a long
> run. Minimum on-the-fly allocation size is 32. I haven't calcuated
> further precicely, but I'm worried about the amount of spare shared
> memory the hash can allocate.

On my machine, hash_estimate_size() for the table returns 5,968 bytes.
That estimate is for a max_size of 16.  In my testing, I've been able
to need up to 6 elements in this table, but that required turning off
synchronous_commit, adding a long sleep at the end of XLogWrite(), and
increasing wal_buffers substantially.  This leads me to think that a
max_size of 16 elements is typically sufficient.  (I may have also
accidentally demonstrated that only storing two record boundaries
could be insufficient.)

>> > Another concern about the concrete patch:
>> >
>> > NotifySegmentsReadyForArchive() searches the shared hashacquiaing a
>> > LWLock every time XLogWrite is called while segment archive is being
>> > held off.  I don't think it is acceptable and I think it could be a
>> > problem when many backends are competing on WAL.
>>
>> This is a fair point.  I did some benchmarking with a few hundred
>> connections all doing writes, and I was not able to discern any
>> noticeable performance impact.  My guess is that contention on this
>> new lock is unlikely because callers of XLogWrite() must already hold
>> WALWriteLock.  Otherwise, I believe we only acquire ArchNotifyLock no
>> more than once per segment to record new record boundaries.
>
> Thanks. I agree that the reader-reader contention is not a problem due
> to existing serialization by WALWriteLock. Adding an entry happens
> only at segment boundary so the ArchNotifyLock doesn't seem to be a
> problem.
>
> However the function prolongs the WALWriteLock section. Couldn't we
> somehow move the call to NotifySegmentsReadyForArchive in XLogWrite
> out of the WALWriteLock section?

I don't see a clean way to do that.  XLogWrite() assumes that
WALWriteLock is held when it is called, and it doesn't release it at
any point.  I think we'd have to move NotifySegmentsReadyForArchive()
to the callers of XLogWrite() if we wanted to avoid holding onto
WALWriteLock for longer.  Unless we can point to a measurable
performance penalty, I'm not sure this is worth it.

Nathan


Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
Here is a rebased version of my patch.  As before, I'm not yet
handling replication, archive_timeout, and persisting latest-marked-
ready through crashes.  If this approach seems reasonable to others,
I'll go ahead and start working on these items.

Nathan


Attachment

Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
Alright, I've attached a new patch set for this.

0001 is similar to the last patch I sent in this thread, although it
contains a few fixes.  The main difference is that we no longer
initialize lastNotifiedSeg in StartupXLOG().  Instead, we initialize
it in XLogWrite() where we previously were creating the archive status
files.  This ensures that standby servers do not create many
unnecessary archive status files after promotion.

0002 adds logic for persisting the last notified segment through
crashes.  This is needed because a poorly-timed crash could otherwise
cause us to skip marking segments as ready-for-archival altogether.
This file is only used for primary servers, as there exists a separate
code path for marking segments as ready-for-archive for standbys.

I considered attempting to prevent this bug from affecting standby
servers by withholding WAL for a segment until the previous segment
has been marked ready-for-archival.  However, that would require us to
track record boundaries even with archiving turned off.  Also, my
patch relied on the assumption that the flush pointer advances along
record boundaries except for records that span multiple segments.
This assumption is likely not always true, and even if it is, it seems
pretty fragile.  Furthermore, I suspect that there are still problems
with standbys since the code path responsible for creating archive
status files on standbys has even less context about the WAL record
boundaries.  IMO patches 0001 and 0002 should be the focus for now,
and related bugs for standby servers should be picked up in a new
thread.

I ended up not touching archive_timeout at all.  The documentation for
this parameter seems to be written ambiguously enough such that any
small differences in behavior with these patches is still acceptable.
I don't expect that users will see much change.  In the worst case,
the timer for archive_timeout may get reset a bit before the segment's
archive status file is created.

Nathan


Attachment

Re: archive status ".ready" files may be created too early

From
Alvaro Herrera
Date:
On 2021-Feb-19, Bossart, Nathan wrote:

> 0002 adds logic for persisting the last notified segment through
> crashes.  This is needed because a poorly-timed crash could otherwise
> cause us to skip marking segments as ready-for-archival altogether.
> This file is only used for primary servers, as there exists a separate
> code path for marking segments as ready-for-archive for standbys.

I'm not sure I understand what's the reason not to store this value in
pg_control; I feel like I'm missing something.  Can you please explain?

There were some comments earlier in the thread about the maximum size of
a record.  As I recall, you can have records of arbitrary size if you
have COMMIT with a large number of relation invalidation messages being
included in the xlog record, or a large number of XIDs of
subtransactions in the transaction.  Spanning several segments is
possible, AFAIU.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 7/27/21, 6:05 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> On 2021-Feb-19, Bossart, Nathan wrote:
>
>> 0002 adds logic for persisting the last notified segment through
>> crashes.  This is needed because a poorly-timed crash could otherwise
>> cause us to skip marking segments as ready-for-archival altogether.
>> This file is only used for primary servers, as there exists a separate
>> code path for marking segments as ready-for-archive for standbys.
>
> I'm not sure I understand what's the reason not to store this value in
> pg_control; I feel like I'm missing something.  Can you please explain?

Thanks for taking a look.

The only reason I can think of is that it could make back-patching
difficult.  I don't mind working on a version of the patch that uses
pg_control.  Back-patching this fix might be a stretch, anyway.

> There were some comments earlier in the thread about the maximum size of
> a record.  As I recall, you can have records of arbitrary size if you
> have COMMIT with a large number of relation invalidation messages being
> included in the xlog record, or a large number of XIDs of
> subtransactions in the transaction.  Spanning several segments is
> possible, AFAIU.

This is my understanding, too.

Nathan


Re: archive status ".ready" files may be created too early

From
Alvaro Herrera
Date:
On 2021-Jul-28, Bossart, Nathan wrote:

> On 7/27/21, 6:05 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:

> > I'm not sure I understand what's the reason not to store this value in
> > pg_control; I feel like I'm missing something.  Can you please explain?
> 
> Thanks for taking a look.
> 
> The only reason I can think of is that it could make back-patching
> difficult.  I don't mind working on a version of the patch that uses
> pg_control.  Back-patching this fix might be a stretch, anyway.

Hmm ... I'm not sure we're prepared to backpatch this kind of change.
It seems a bit too disruptive to how replay works.  I think patch we
should be focusing solely on patch 0001 to surgically fix the precise
bug you see.  Does patch 0002 exist because you think that a system with
only 0001 will not correctly deal with a crash at the right time?


Now, the reason I'm looking at this patch series is that we're seeing a
related problem with walsender/walreceiver, which apparently are capable
of creating a file in the replica that ends up not existing in the
primary after a crash, for a reason closely related to what you
describe for WAL archival.  I'm not sure what is going on just yet, so
I'm not going to try and explain because I'm likely to get it wrong.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 7/30/21, 11:34 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> Hmm ... I'm not sure we're prepared to backpatch this kind of change.
> It seems a bit too disruptive to how replay works.  I think patch we
> should be focusing solely on patch 0001 to surgically fix the precise
> bug you see.  Does patch 0002 exist because you think that a system with
> only 0001 will not correctly deal with a crash at the right time?

Yes, that was what I was worried about.  However, I just performed a
variety of tests with just 0001 applied, and I am beginning to suspect
my concerns were unfounded.  With wal_buffers set very high,
synchronous_commit set to off, and a long sleep at the end of
XLogWrite(), I can reliably cause the archive status files to lag far
behind the current open WAL segment.  However, even if I crash at this
time, the .ready files are created when the server restarts (albeit
out of order).  This appears to be due to the call to
XLogArchiveCheckDone() in RemoveOldXlogFiles().  Therefore, we can
likely abandon 0002.

> Now, the reason I'm looking at this patch series is that we're seeing a
> related problem with walsender/walreceiver, which apparently are capable
> of creating a file in the replica that ends up not existing in the
> primary after a crash, for a reason closely related to what you
> describe for WAL archival.  I'm not sure what is going on just yet, so
> I'm not going to try and explain because I'm likely to get it wrong.

I've suspected that this is due to the use of the flushed location for
the send pointer, which AFAICT needn't align with a WAL record
boundary.

                /*
                 * Streaming the current timeline on a primary.
                 *
                 * Attempt to send all data that's already been written out and
                 * fsync'd to disk.  We cannot go further than what's been written out
                 * given the current implementation of WALRead().  And in any case
                 * it's unsafe to send WAL that is not securely down to disk on the
                 * primary: if the primary subsequently crashes and restarts, standbys
                 * must not have applied any WAL that got lost on the primary.
                 */
                 SendRqstPtr = GetFlushRecPtr();

Nathan


Re: archive status ".ready" files may be created too early

From
Alvaro Herrera
Date:
On 2021-Jul-30, Bossart, Nathan wrote:

> On 7/30/21, 11:34 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> > Hmm ... I'm not sure we're prepared to backpatch this kind of change.
> > It seems a bit too disruptive to how replay works.  I think patch we
> > should be focusing solely on patch 0001 to surgically fix the precise
> > bug you see.  Does patch 0002 exist because you think that a system with
> > only 0001 will not correctly deal with a crash at the right time?
> 
> Yes, that was what I was worried about.  However, I just performed a
> variety of tests with just 0001 applied, and I am beginning to suspect
> my concerns were unfounded.  With wal_buffers set very high,
> synchronous_commit set to off, and a long sleep at the end of
> XLogWrite(), I can reliably cause the archive status files to lag far
> behind the current open WAL segment.  However, even if I crash at this
> time, the .ready files are created when the server restarts (albeit
> out of order).  This appears to be due to the call to
> XLogArchiveCheckDone() in RemoveOldXlogFiles().  Therefore, we can
> likely abandon 0002.

That's great to hear.  I'll give 0001 a look again.

> > Now, the reason I'm looking at this patch series is that we're seeing a
> > related problem with walsender/walreceiver, which apparently are capable
> > of creating a file in the replica that ends up not existing in the
> > primary after a crash, for a reason closely related to what you
> > describe for WAL archival.  I'm not sure what is going on just yet, so
> > I'm not going to try and explain because I'm likely to get it wrong.
> 
> I've suspected that this is due to the use of the flushed location for
> the send pointer, which AFAICT needn't align with a WAL record
> boundary.

Yeah, I had gotten as far as the GetFlushRecPtr but haven't tracked down
what happens with a contrecord.


-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 7/30/21, 3:23 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> That's great to hear.  I'll give 0001 a look again.

Much appreciated.

Nathan


Re: archive status ".ready" files may be created too early

From
Alvaro Herrera
Date:
On 2021-Jul-30, Bossart, Nathan wrote:

> Yes, that was what I was worried about.  However, I just performed a
> variety of tests with just 0001 applied, and I am beginning to suspect
> my concerns were unfounded.  With wal_buffers set very high,
> synchronous_commit set to off, and a long sleep at the end of
> XLogWrite(), I can reliably cause the archive status files to lag far
> behind the current open WAL segment.  However, even if I crash at this
> time, the .ready files are created when the server restarts (albeit
> out of order).

I think that creating files out of order might be problematic.  On the
archiver side, pgarch_readyXlog() expects to return the oldest
archivable file; but if we create a newer segment's .ready file first,
it is possible that a directory scan would return that newer file before
the older segment's .ready file appears.

However, the comments in pgarch_readyXlog() aren't super convincing that
processing the files in order is actually a correctness requirement, so
perhaps it doesn't matter all that much.


I noticed that XLogCtl->lastNotifiedSeg is protected by both the
info_lck and ArchNotifyLock.  I think it it's going to be protected by
the lwlock, then we should drop the use of the spinlock.


We set archiver's latch on each XLogArchiveNotify(), but if we're doing
it in a loop such as in NotifySegmentsReadyForArchive() perhaps it is
better to create all the .ready files first and do PgArchWakeup() at the
end.  I'm not convinced that this is useful but let's at least discard
the idea explicitly if not.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Always assume the user will do much worse than the stupidest thing
you can imagine."                                (Julien PUYDT)

Attachment

Re: archive status ".ready" files may be created too early

From
Alvaro Herrera
Date:
On 2021-Jul-30, Alvaro Herrera wrote:

> We set archiver's latch on each XLogArchiveNotify(), but if we're doing
> it in a loop such as in NotifySegmentsReadyForArchive() perhaps it is
> better to create all the .ready files first and do PgArchWakeup() at the
> end.  I'm not convinced that this is useful but let's at least discard
> the idea explicitly if not.

hm, this causes an ABI change so it's not backpatchable.


-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"This is what I like so much about PostgreSQL.  Most of the surprises
are of the "oh wow!  That's cool" Not the "oh shit!" kind.  :)"
Scott Marlowe, http://archives.postgresql.org/pgsql-admin/2008-10/msg00152.php



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 7/30/21, 4:52 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> I think that creating files out of order might be problematic.  On the
> archiver side, pgarch_readyXlog() expects to return the oldest
> archivable file; but if we create a newer segment's .ready file first,
> it is possible that a directory scan would return that newer file before
> the older segment's .ready file appears.
>
> However, the comments in pgarch_readyXlog() aren't super convincing that
> processing the files in order is actually a correctness requirement, so
> perhaps it doesn't matter all that much.

I can't think of a reason it'd be needed from a correctness
perspective.  After a quick scan, I couldn't find any promises about
archival order in the documentation, either.  In any case, it doesn't
look like there's a risk that the archiver will skip files when the
.ready files are created out of order.

> I noticed that XLogCtl->lastNotifiedSeg is protected by both the
> info_lck and ArchNotifyLock.  I think it it's going to be protected by
> the lwlock, then we should drop the use of the spinlock.

That seems reasonable to me.  This means that the lock is acquired at
the end of every XLogWrite(), but the other places that acquire the
lock only do so once per WAL segment.  Plus, the call to
NotifySegmentsReadyForArchive() at the end of every XLogWrite() should
usually only need the lock for a short amount of time to retrieve a
value from shared memory.

> We set archiver's latch on each XLogArchiveNotify(), but if we're doing
> it in a loop such as in NotifySegmentsReadyForArchive() perhaps it is
> better to create all the .ready files first and do PgArchWakeup() at the
> end.  I'm not convinced that this is useful but let's at least discard
> the idea explicitly if not.

I don't have a terribly strong opinion, but I would lean towards
setting the latch for each call to XLogArchiveNotify() so that the
archiver process can get started as soon as a segment is ready.
However, I doubt that holding off until the end of the loop has any
discernible effect in most cases.

Nathan


Re: archive status ".ready" files may be created too early

From
Alvaro Herrera
Date:
On 2021-Jul-31, Bossart, Nathan wrote:

> On 7/30/21, 4:52 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:

> > I noticed that XLogCtl->lastNotifiedSeg is protected by both the
> > info_lck and ArchNotifyLock.  I think it it's going to be protected by
> > the lwlock, then we should drop the use of the spinlock.
> 
> That seems reasonable to me.  This means that the lock is acquired at
> the end of every XLogWrite(),

Uhh, actually that there sounds really bad; it's going to be a serious
contention point.

Another option might be to make it an atomic.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 7/31/21, 9:12 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> On 2021-Jul-31, Bossart, Nathan wrote:
>
>> On 7/30/21, 4:52 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
>
>> > I noticed that XLogCtl->lastNotifiedSeg is protected by both the
>> > info_lck and ArchNotifyLock.  I think it it's going to be protected by
>> > the lwlock, then we should drop the use of the spinlock.
>>
>> That seems reasonable to me.  This means that the lock is acquired at
>> the end of every XLogWrite(),
>
> Uhh, actually that there sounds really bad; it's going to be a serious
> contention point.
>
> Another option might be to make it an atomic.

This is why I was trying to get away with just using info_lck for
reading lastNotifiedSeg.  ArchNotifyLock is mostly intended to protect
RecordBoundaryMap.  However, since lastNotifiedSeg is used in
functions like GetLatestRecordBoundarySegment() that access the map, I
found it easier to reason about things if I knew that it wouldn't
change as long as I held ArchNotifyLock.

I think the main downside of making lastNotifiedSeg an atomic is that
the value we first read in NotifySegmentsReadyForArchive() might not
be up-to-date, which means we might hold off creating .ready files
longer than necessary.

Nathan


Re: archive status ".ready" files may be created too early

From
Alvaro Herrera
Date:
On 2021-Jul-31, Bossart, Nathan wrote:

> This is why I was trying to get away with just using info_lck for
> reading lastNotifiedSeg.  ArchNotifyLock is mostly intended to protect
> RecordBoundaryMap.  However, since lastNotifiedSeg is used in
> functions like GetLatestRecordBoundarySegment() that access the map, I
> found it easier to reason about things if I knew that it wouldn't
> change as long as I held ArchNotifyLock.

I think it's okay to make lastNotifiedSeg protected by just info_lck,
and RecordBoundaryMap protected by just ArchNotifyLock.  It's okay to
acquire the spinlock inside the lwlock-protected area, as long as we
make sure never to do the opposite.  (And we sure don't want to hold
info_lck long enough that a LWLock acquisition would occur in the
meantime).  So I modified things that way, and also added another
function to set the seg if it's unset, with a single spinlock
acquisition (rather than acqquire, read, release, acqquire, set,
release, which sounds like it would have trouble behaving.)

I haven't tried your repro with this yet.

I find it highly suspicious that the patch does an archiver notify (i.e.
creation of the .ready file) in XLogInsertRecord().  Is that a sane
thing to do?  Sounds to me like that should be attempted in XLogFlush
only.  This appeared after Kyotaro's patch at [1] and before your patch
at [2].

[1] https://postgr.es/m/20201014.090628.839639906081252194.horikyota.ntt@gmail.com
[2] https://postgr.es/m/EFF40306-8E8A-4259-B181-C84F3F06636C@amazon.com

I also just realized that Kyotaro's patch there also tried to handle the
streaming replication issue I was talking about.

> I think the main downside of making lastNotifiedSeg an atomic is that
> the value we first read in NotifySegmentsReadyForArchive() might not
> be up-to-date, which means we might hold off creating .ready files
> longer than necessary.

I'm not sure I understand how this would be a problem.  If we block
somebody from setting a newer value, they'll just set the value
immediately after we release the lock.  Will we reread the value
afterwards to see if it changed?

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/

Attachment

Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/2/21, 2:42 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> I think it's okay to make lastNotifiedSeg protected by just info_lck,
> and RecordBoundaryMap protected by just ArchNotifyLock.  It's okay to
> acquire the spinlock inside the lwlock-protected area, as long as we
> make sure never to do the opposite.  (And we sure don't want to hold
> info_lck long enough that a LWLock acquisition would occur in the
> meantime).  So I modified things that way, and also added another
> function to set the seg if it's unset, with a single spinlock
> acquisition (rather than acqquire, read, release, acqquire, set,
> release, which sounds like it would have trouble behaving.)

The patch looks good to me.

> I find it highly suspicious that the patch does an archiver notify (i.e.
> creation of the .ready file) in XLogInsertRecord().  Is that a sane
> thing to do?  Sounds to me like that should be attempted in XLogFlush
> only.  This appeared after Kyotaro's patch at [1] and before your patch
> at [2].

I believe my worry was that we'd miss notifying a segment as soon as
possible if the record was somehow flushed prior to registering the
record boundary in the map.  If that's actually impossible, then I
would agree that the extra call to NotifySegmentsReadyForArchive() is
unnecessary.

>> I think the main downside of making lastNotifiedSeg an atomic is that
>> the value we first read in NotifySegmentsReadyForArchive() might not
>> be up-to-date, which means we might hold off creating .ready files
>> longer than necessary.
>
> I'm not sure I understand how this would be a problem.  If we block
> somebody from setting a newer value, they'll just set the value
> immediately after we release the lock.  Will we reread the value
> afterwards to see if it changed?

I think you are right.  If we see an old value for lastNotifiedSeg,
the worst case is that we take the ArchNotifyLock, read
lastNotifiedSeg again (which should then be up-to-date), and then
basically do nothing.  I suspect initializing lastNotifiedSeg might
still be a little tricky, though.  Do you think it is important to try
to avoid this spinlock for lastNotifiedSeg?  IIUC it's acquired at the
end of every call to XLogWrite() already, and we'd still need to
acquire it for the flush pointer, anyway.

Nathan


Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
At Mon, 2 Aug 2021 23:28:19 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> On 8/2/21, 2:42 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> > I think it's okay to make lastNotifiedSeg protected by just info_lck,
> > and RecordBoundaryMap protected by just ArchNotifyLock.  It's okay to
> > acquire the spinlock inside the lwlock-protected area, as long as we
> > make sure never to do the opposite.  (And we sure don't want to hold
> > info_lck long enough that a LWLock acquisition would occur in the
> > meantime).  So I modified things that way, and also added another
> > function to set the seg if it's unset, with a single spinlock
> > acquisition (rather than acqquire, read, release, acqquire, set,
> > release, which sounds like it would have trouble behaving.)
> 
> The patch looks good to me.

+    for (seg = flushed_seg; seg > last_notified; seg--)
+    {
+        RecordBoundaryEntry *entry;
+
+        entry = (RecordBoundaryEntry *) hash_search(RecordBoundaryMap,
+                                                    (void *) &seg, HASH_FIND,

I'm afraid that using hash to store boundary info is too much. Isn't a
ring buffer enough for this use?  In that case it is enough to
remember only the end LSN of the segment spanning records.  It is easy
to expand the buffer if needed.

+    if (!XLogSegNoIsInvalid(latest_boundary_seg))

It is a matter of taste, but I see latest_boundary_seg !=
InvalidXLogSegNo more frequentlyl, maybe to avoid double negation.


@@ -1167,10 +1195,33 @@ XLogInsertRecord(XLogRecData *rdata,
         SpinLockRelease(&XLogCtl->info_lck);
     }
 
+    /*
+     * Record the record boundary if we crossed the segment boundary.  This is
...
+    XLByteToSeg(StartPos, StartSeg, wal_segment_size);
+    XLByteToSeg(EndPos, EndSeg, wal_segment_size);
+
+    if (StartSeg != EndSeg && XLogArchivingActive())
+    {

The immediately prceding if block is for cross-page records. So we can
reduce the overhaed by the above calculations by moving it to the
preceding if-block.


+RegisterRecordBoundaryEntry(XLogSegNo seg, XLogRecPtr pos)

The seg is restricted to the segment that pos resides on.  The caller
is free from caring that restriction if the function takes only pos.
It adds a small overhead to calculate segment number from the LSN but
I think it doesn't matter so much. (Or if we don't use hash, that
calculation is not required at all).


@@ -2602,7 +2855,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
                 LogwrtResult.Flush = LogwrtResult.Write;    /* end of page */
 
                 if (XLogArchivingActive())
-                    XLogArchiveNotifySeg(openLogSegNo);
+                    SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1);

Is it safe?  If server didn't notified of WAL files for recent 3
finished segments in the previous server life, they need to be
archived this life time. But this omits maybe all of the tree.
(I didn't confirm that behavior..)

> > I find it highly suspicious that the patch does an archiver notify (i.e.
> > creation of the .ready file) in XLogInsertRecord().  Is that a sane
> > thing to do?  Sounds to me like that should be attempted in XLogFlush
> > only.  This appeared after Kyotaro's patch at [1] and before your patch
> > at [2].
> 
> I believe my worry was that we'd miss notifying a segment as soon as
> possible if the record was somehow flushed prior to registering the
> record boundary in the map.  If that's actually impossible, then I
> would agree that the extra call to NotifySegmentsReadyForArchive() is
> unnecessary.

I don't think that XLogWrite(up to LSN=X) can happen before
XLogInsert(endpos = X) ends.

> >> I think the main downside of making lastNotifiedSeg an atomic is that
> >> the value we first read in NotifySegmentsReadyForArchive() might not
> >> be up-to-date, which means we might hold off creating .ready files
> >> longer than necessary.
> >
> > I'm not sure I understand how this would be a problem.  If we block
> > somebody from setting a newer value, they'll just set the value
> > immediately after we release the lock.  Will we reread the value
> > afterwards to see if it changed?
> 
> I think you are right.  If we see an old value for lastNotifiedSeg,
> the worst case is that we take the ArchNotifyLock, read
> lastNotifiedSeg again (which should then be up-to-date), and then

Agreed.

> basically do nothing.  I suspect initializing lastNotifiedSeg might
> still be a little tricky, though.  Do you think it is important to try
> to avoid this spinlock for lastNotifiedSeg?  IIUC it's acquired at the
> end of every call to XLogWrite() already, and we'd still need to
> acquire it for the flush pointer, anyway.

As mentioned above, I think it needs more cosideration.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/2/21, 7:37 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> I'm afraid that using hash to store boundary info is too much. Isn't a
> ring buffer enough for this use?  In that case it is enough to
> remember only the end LSN of the segment spanning records.  It is easy
> to expand the buffer if needed.

I agree that the hash table requires a bit more memory than what is
probably necessary, but I'm not sure I agree that maintaining a custom
data structure to save a few kilobytes of memory is worth the effort.

> @@ -2602,7 +2855,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
>                                 LogwrtResult.Flush = LogwrtResult.Write;        /* end of page */
>
>                                 if (XLogArchivingActive())
> -                                       XLogArchiveNotifySeg(openLogSegNo);
> +                                       SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1);
>
> Is it safe?  If server didn't notified of WAL files for recent 3
> finished segments in the previous server life, they need to be
> archived this life time. But this omits maybe all of the tree.
> (I didn't confirm that behavior..)

I tested this scenario out earlier [0].  It looks like the call to
XLogArchiveCheckDone() in RemoveOldXlogFiles() will take care of
creating any .ready files we missed.

>> I believe my worry was that we'd miss notifying a segment as soon as
>> possible if the record was somehow flushed prior to registering the
>> record boundary in the map.  If that's actually impossible, then I
>> would agree that the extra call to NotifySegmentsReadyForArchive() is
>> unnecessary.
>
> I don't think that XLogWrite(up to LSN=X) can happen before
> XLogInsert(endpos = X) ends.

Is there anything preventing that from happening?  At the location
where we are registering the record boundary, we've already called
CopyXLogRecordToWAL(), and neither the WAL insertion lock nor the
WALWriteLock are held.  Even if we register the boundary before
updating the shared LogwrtRqst.Write, there's a chance that someone
else has already moved it ahead and called XLogWrite().  I think the
worst case scenario is that we hold off creating .ready files longer
than necessary, but IMO that's still a worthwhile thing to do.

Nathan

[0] https://postgr.es/m/DA71434B-7340-4984-9B91-F085BC47A778%40amazon.com


Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
At Tue, 3 Aug 2021 21:32:18 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> On 8/2/21, 7:37 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> > I'm afraid that using hash to store boundary info is too much. Isn't a
> > ring buffer enough for this use?  In that case it is enough to
> > remember only the end LSN of the segment spanning records.  It is easy
> > to expand the buffer if needed.
> 
> I agree that the hash table requires a bit more memory than what is
> probably necessary, but I'm not sure I agree that maintaining a custom
> data structure to save a few kilobytes of memory is worth the effort.

Memory is one of my concerns but more significant point was required
CPU cycles by GetLatestRecordBoundarySegment.  So I don't mind it is
using a hash if the loop on the hash didn't block other backends.

Addition to that, while NotifySegmentsReadyForArchive() is notifying
pending segments, other backends simultaneously reach there are
blocked until the notification, incuding file creation, finishes.  I
don't think that's great. Couldn't we set lastNotifiedSegment before
the loop?  At the moment a backend decides to notify some segments,
others no longer need to consider those segments.  Even if the backend
crashes meanwhile, as you mentionied below, it's safe since the
unnotified segments are notifed after restart.

> > @@ -2602,7 +2855,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
> >                                 LogwrtResult.Flush = LogwrtResult.Write;        /* end of page */
> >
> >                                 if (XLogArchivingActive())
> > -                                       XLogArchiveNotifySeg(openLogSegNo);
> > +                                       SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1);
> >
> > Is it safe?  If server didn't notified of WAL files for recent 3
> > finished segments in the previous server life, they need to be
> > archived this life time. But this omits maybe all of the tree.
> > (I didn't confirm that behavior..)
> 
> I tested this scenario out earlier [0].  It looks like the call to
> XLogArchiveCheckDone() in RemoveOldXlogFiles() will take care of
> creating any .ready files we missed.

Yeah, I reclled of that behvaior. In that case crash recovery reads up
to just before the last (continued) record in the last finished
segment.  On the other hand if creash recovery was able to read that
record, it's safe to archive the last segment immediately after
recovery.  So that behavior is safe. Thanks!

> >> I believe my worry was that we'd miss notifying a segment as soon as
> >> possible if the record was somehow flushed prior to registering the
> >> record boundary in the map.  If that's actually impossible, then I
> >> would agree that the extra call to NotifySegmentsReadyForArchive() is
> >> unnecessary.
> >
> > I don't think that XLogWrite(up to LSN=X) can happen before
> > XLogInsert(endpos = X) ends.
> 
> Is there anything preventing that from happening?  At the location
> where we are registering the record boundary, we've already called
> CopyXLogRecordToWAL(), and neither the WAL insertion lock nor the
> WALWriteLock are held.  Even if we register the boundary before
> updating the shared LogwrtRqst.Write, there's a chance that someone
> else has already moved it ahead and called XLogWrite().  I think the
> worst case scenario is that we hold off creating .ready files longer
> than necessary, but IMO that's still a worthwhile thing to do.

Oh, boundary registration happens actually after an insertion ends
(but before XLogInsert ends). The missed segment is never processed
due to the qualification by lastNotifiedSeg.

Does it work that RegisterRecordBoundaryEntry omits registering of the
bounary if it finds lastNotifiedSeg have gone too far?

> Nathan
> 
> [0] https://postgr.es/m/DA71434B-7340-4984-9B91-F085BC47A778%40amazon.com
> 

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
By the way about the v3 patch,

+#define InvalidXLogSegNo    ((XLogSegNo) 0xFFFFFFFFFFFFFFFF)

Like InvalidXLogRecPtr, the first valid segment number is 1 so we can
use 0 as InvalidXLogSegNo.

BootStrapXLOG():
    /* Create first XLOG segment file */
    openLogFile = XLogFileInit(1);

KeepLogSeg():
            /* avoid underflow, don't go below 1 */
            if (currSegNo <= keep_segs)
                segno = 1;

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/4/21, 6:58 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> Addition to that, while NotifySegmentsReadyForArchive() is notifying
> pending segments, other backends simultaneously reach there are
> blocked until the notification, incuding file creation, finishes.  I
> don't think that's great. Couldn't we set lastNotifiedSegment before
> the loop?  At the moment a backend decides to notify some segments,
> others no longer need to consider those segments.  Even if the backend
> crashes meanwhile, as you mentionied below, it's safe since the
> unnotified segments are notifed after restart.

That seems reasonable to me.  It looks like we rely on
RemoveOldXlogFiles() even today for when XLogArchiveNotify() fails.  I
updated this in v4 of the patch.

In addition to this change, I also addressed your other feedback by
changing XLogSegNoIsInvalid() to XLogSegNoIsValid() and by moving
record boundary registration to the "if" block for cross-page records.

> Does it work that RegisterRecordBoundaryEntry omits registering of the
> bounary if it finds lastNotifiedSeg have gone too far?

Yeah, there's no reason to add a record boundary if we've already
notified the prior segment.  For that to happen, another cross-segment
record would have to be flushed to disk and
NotifySegmentsReadyForArchive() would have to be called before
registering the boundary.  With that being said, I don't expect an
extra map entry here and there to impact performance enough for us to
worry about it.

Nathan


Attachment

Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/4/21, 9:05 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> By the way about the v3 patch,
>
> +#define InvalidXLogSegNo       ((XLogSegNo) 0xFFFFFFFFFFFFFFFF)
>
> Like InvalidXLogRecPtr, the first valid segment number is 1 so we can
> use 0 as InvalidXLogSegNo.

It's been a while since I wrote this, but if I remember correctly, the
issue with using 0 is that we could end up initializing
lastNotifiedSeg to InvalidXLogSegNo in XLogWrite().  Eventually, we'd
initialize it to 1, but we will have skipped creating the .ready file
for the first segment.

Nathan


Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
At Thu, 5 Aug 2021 05:15:04 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> On 8/4/21, 9:05 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> > By the way about the v3 patch,
> >
> > +#define InvalidXLogSegNo       ((XLogSegNo) 0xFFFFFFFFFFFFFFFF)
> >
> > Like InvalidXLogRecPtr, the first valid segment number is 1 so we can
> > use 0 as InvalidXLogSegNo.
> 
> It's been a while since I wrote this, but if I remember correctly, the
> issue with using 0 is that we could end up initializing
> lastNotifiedSeg to InvalidXLogSegNo in XLogWrite().  Eventually, we'd
> initialize it to 1, but we will have skipped creating the .ready file
> for the first segment.

Maybe this?

+                    SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1);

Hmm. Theoretically 0 is invalid as segment number. So we'd better not
using 0 as a valid value of lastNotifiedSeg.

Honestly I don't like having this initialization in XLogWrite.  We
should and I think can initialize it earlier.  It seems to me the most
appropriate timing to initialize the variable is just before running
the end-of-recovery checkpoint).  Since StartupXLOG knows the first
segment to write .  If it were set to 0, that doesn't matter at all.
We can get rid of the new symbol by doing this.

Maybe something like this:

>    {
>        /*
>         * There is no partial block to copy. Just set InitializedUpTo, and
>         * let the first attempt to insert a log record to initialize the next
>         * buffer.
>         */
>        XLogCtl->InitializedUpTo = EndOfLog;
>    }
> 
+    /*
+     * EndOfLog resides on the next segment of the last finished one. Set the
+     * last finished segment as lastNotifiedSeg now.  In the case where the
+     * last crash has left the last several segments not being marked as
+     * .ready, the checkpoint just after does that for all finished segments.
+     * There's a corner case where the checkpoint advances segment, but that
+     * ends up at most with a duplicate archive notification.
+     */
+    XLByteToSeg(EndOfLog, EndOfLogSeg, wal_segment_size);
+    Assert(EndOfLogSeg > 0);
+    SetLastNotifiedSegment(EndOfLogSeg - 1);
+
>     LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;

Does this makes sense?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/5/21, 12:39 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> Honestly I don't like having this initialization in XLogWrite.  We
> should and I think can initialize it earlier.  It seems to me the most
> appropriate timing to initialize the variable is just before running
> the end-of-recovery checkpoint).  Since StartupXLOG knows the first
> segment to write .  If it were set to 0, that doesn't matter at all.
> We can get rid of the new symbol by doing this.

This seems like a good idea to me.  I made this change in v5.  I
performed some basic testing, and it seems to reliably initialize
lastNotifiedSeg correctly.

> +    /*
> +     * EndOfLog resides on the next segment of the last finished one. Set the
> +     * last finished segment as lastNotifiedSeg now.  In the case where the
> +     * last crash has left the last several segments not being marked as
> +     * .ready, the checkpoint just after does that for all finished segments.
> +     * There's a corner case where the checkpoint advances segment, but that
> +     * ends up at most with a duplicate archive notification.
> +     */

I'm not quite following the corner case you've described here.  Is it
possible that the segment that EndOfLog points to will be eligible for
removal after the checkpoint?

In v5 of the patch, I've also added an extra call to
NotifySegmentsReadyForArchive() in the same place we previously
created the .ready files.  I think this helps notify archiver sooner
in certain cases (e.g., asynchronous commit).

Nathan


Attachment

Re: archive status ".ready" files may be created too early

From
Kyotaro Horiguchi
Date:
At Fri, 6 Aug 2021 00:21:34 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> On 8/5/21, 12:39 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> > Honestly I don't like having this initialization in XLogWrite.  We
> > should and I think can initialize it earlier.  It seems to me the most
> > appropriate timing to initialize the variable is just before running
> > the end-of-recovery checkpoint).  Since StartupXLOG knows the first
> > segment to write .  If it were set to 0, that doesn't matter at all.
> > We can get rid of the new symbol by doing this.
> 
> This seems like a good idea to me.  I made this change in v5.  I
> performed some basic testing, and it seems to reliably initialize
> lastNotifiedSeg correctly.
> 
> > +    /*
> > +     * EndOfLog resides on the next segment of the last finished one. Set the
> > +     * last finished segment as lastNotifiedSeg now.  In the case where the
> > +     * last crash has left the last several segments not being marked as
> > +     * .ready, the checkpoint just after does that for all finished segments.
> > +     * There's a corner case where the checkpoint advances segment, but that
> > +     * ends up at most with a duplicate archive notification.
> > +     */
> 
> I'm not quite following the corner case you've described here.  Is it
> possible that the segment that EndOfLog points to will be eligible for
> removal after the checkpoint?

Archiving doesn't immediately mean removal.  A finished segment is
ought to be archived right away.  Since the EndOfLog segment must not
get marked .ready, setting lastNotifiedSeg to the previous segment is
quite right, but if the end-of-recovery checkpoint advances segment,
EndOfLog is marked .ready at the XLogFlush just after.  But, sorry,
what I forgot at the time was the checkpoint also moves
lastNotifiedSeg.  So, sorry, that corner case does not exist.

> In v5 of the patch, I've also added an extra call to
> NotifySegmentsReadyForArchive() in the same place we previously
> created the .ready files.  I think this helps notify archiver sooner
> in certain cases (e.g., asynchronous commit).

In v5, NotifySegmentsReadyForArchive() still holds ArchNotifyLock
including .ready file creations. Since the notification loop doesn't
need the hash itself, the loop can be took out of the lock section?

current:
    LWLockAcquire(ArchNotifyLock, LW_EXCLUSIVE);
    last_notified = GetLastNotifiedSegment();
    latest_boundary_seg = GetLatestRecordBoundarySegment(last_notified, flushed, &found);

    if (found)
    {
        SetLastNotifiedSegment(latest_boundary_seg - 1);
        for (seg = last_notified + 1; seg < latest_boundary_seg; seg++)
            XLogArchiveNotifySeg(seg, false);

        RemoveRecordBoundariesUpTo(latest_boundary_seg);

        PgArchWakeup();
    }
    LWLockRelease(ArchNotifyLock);

But we can release the lock earlier.

    LWLockAcquire(ArchNotifyLock, LW_EXCLUSIVE);
    last_notified = GetLastNotifiedSegment();
    latest_boundary_seg = GetLatestRecordBoundarySegment(last_notified, flushed, &found);

    if (found)
    {
        SetLastNotifiedSegment(latest_boundary_seg - 1);
        RemoveRecordBoundariesUpTo(latest_boundary_seg);
    }
    LWLockRelease(ArchNotifyLock);

    if (found)
    {
        for (seg = last_notified + 1; seg < latest_boundary_seg; seg++)
            XLogArchiveNotifySeg(seg, false);

        PgArchWakeup();
    }

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/6/21, 12:42 AM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> In v5, NotifySegmentsReadyForArchive() still holds ArchNotifyLock
> including .ready file creations. Since the notification loop doesn't
> need the hash itself, the loop can be took out of the lock section?

I think that works.  This creates another opportunity for archive
status files to be created out of order, but as discussed elsewhere, I
think we have to be prepared for that regardless.  I moved the
notification loop out of the lock section in v6.

Nathan


Attachment

Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
So why do we call this structure "record boundary map", when the
boundaries it refers to are segment boundaries?  I think we should call
it "segment boundary map" instead ... and also I think we should use
that name in the lock that protects it, so instead of ArchNotifyLock, it
could be SegmentBoundaryLock or perhaps WalSegmentBoundaryLock.

The reason for the latter is that I suspect the segment boundary map
will also have a use to fix the streaming replication issue I mentioned
earlier in the thread.  This also makes me think that we'll want the wal
record *start* address to be in the hash table too, not just its *end*
address.  So we'll use the start-1 as position to send, rather than the
end-of-segment when GetFlushRecPtr() returns that.

As for 0xFFFFFFFFFFFFFFFF, I think it would be cleaner to do a
#define MaxXLogSegNo with that value in the line immediately after
typedef XLogSegNo, rather than use the numerical value directly in the
assignment.

Typo in comment atop RemoveRecordBoundariesUpTo: it reads "up to an",
should read "up to and".

I think the API of GetLatestRecordBoundarySegment would be better by
returning the boolean and having the segment as out argument.  Then you
could do the caller more cleanly,

if (GetLatestRecordBoundarySegment(last_notified, flushed, &latest_boundary_segment))
{
   SetLastNotified( ... );
   RemoveRecordBoundaries( ... );
   LWLockRelease( ... );
   for (..)
     XLogArchiveNotifySeg(...);
   PgArchWakeup();
}
else
   LWLockRelease(...);

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"La virtud es el justo medio entre dos defectos" (Aristóteles)



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
Attached is a new version of the patch with all feedback addressed.

On 8/16/21, 5:09 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> The reason for the latter is that I suspect the segment boundary map
> will also have a use to fix the streaming replication issue I mentioned
> earlier in the thread.  This also makes me think that we'll want the wal
> record *start* address to be in the hash table too, not just its *end*
> address.  So we'll use the start-1 as position to send, rather than the
> end-of-segment when GetFlushRecPtr() returns that.

I've been thinking about the next steps for this one, too.  ISTM we'll
need to basically assume that the flush pointer jumps along record
boundaries except for the cross-segment records.  I don't know if that
is the safest assumption, but I think the alternative involves
recording every record boundary in the map.

Nathan


Attachment

Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-17, Bossart, Nathan wrote:

> On 8/16/21, 5:09 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> > The reason for the latter is that I suspect the segment boundary map
> > will also have a use to fix the streaming replication issue I mentioned
> > earlier in the thread.  This also makes me think that we'll want the wal
> > record *start* address to be in the hash table too, not just its *end*
> > address.  So we'll use the start-1 as position to send, rather than the
> > end-of-segment when GetFlushRecPtr() returns that.
> 
> I've been thinking about the next steps for this one, too.  ISTM we'll
> need to basically assume that the flush pointer jumps along record
> boundaries except for the cross-segment records.  I don't know if that
> is the safest assumption, but I think the alternative involves
> recording every record boundary in the map.

I'm not sure I understand your idea correctly.  Perhaps another solution
is to assume that the flush pointer jumps along record boundaries
*including* for cross-segment records.  The problem stems precisely from
the fact that we set the flush pointer at segment boundaries, even when
they aren't record boundary.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
Bob [Floyd] used to say that he was planning to get a Ph.D. by the "green
stamp method," namely by saving envelopes addressed to him as 'Dr. Floyd'.
After collecting 500 such letters, he mused, a university somewhere in
Arizona would probably grant him a degree.              (Don Knuth)



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/17/21, 10:44 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> On 2021-Aug-17, Bossart, Nathan wrote:
>> I've been thinking about the next steps for this one, too.  ISTM we'll
>> need to basically assume that the flush pointer jumps along record
>> boundaries except for the cross-segment records.  I don't know if that
>> is the safest assumption, but I think the alternative involves
>> recording every record boundary in the map.
>
> I'm not sure I understand your idea correctly.  Perhaps another solution
> is to assume that the flush pointer jumps along record boundaries
> *including* for cross-segment records.  The problem stems precisely from
> the fact that we set the flush pointer at segment boundaries, even when
> they aren't record boundary.

I think we are in agreement.  If we assume that the flush pointer
jumps along record boundaries and segment boundaries, the solution
would be to avoid using the flush pointer when it points to a segment
boundary (given that the segment boundary is not also a record
boundary).  Instead, we'd only send up to the start position of the
last record in the segment to standbys.

Nathan


Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-17, Bossart, Nathan wrote:

> I think we are in agreement.  If we assume that the flush pointer
> jumps along record boundaries and segment boundaries, the solution
> would be to avoid using the flush pointer when it points to a segment
> boundary (given that the segment boundary is not also a record
> boundary).  Instead, we'd only send up to the start position of the
> last record in the segment to standbys.

Agreed.

An implementation for that would be to test the flush pointer for it
being a segment boundary, and in that case we (acquire segment boundary
lock and) test for presence in the segment boundary map.  If present,
then retreat the pointer to the record's start address.

This means that we acquire the segment boundary lock rarely.  I was
concerned that we'd need to acquire it every time we read the flush
pointer, which would have been a disaster.

Thanks

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
The thing I still don't understand about this patch is why we call
RegisterSegmentBoundaryEntry and NotifySegmentsReadyForArchive in
XLogInsertRecord.

My model concept of this would have these routines called only in
XLogFlush / XLogWrite, which are conceptually about transferring data
from in-memory WAL buffers into the on-disk WAL store (pg_xlog files).
As I understand, XLogInsertRecord is about copying bytes from the
high-level operation (heap insert etc) into WAL buffers.  So doing the
register/notify dance in both places should be redundant and
unnecessary.


In the NotifySegmentsReadyForArchive() call at the bottom of XLogWrite,
we use flushed=InvalidXLogRecPtr.  Why is that?  Surely we can use
LogwrtResult.Flush, just like in the other callsite there?  (If we're
covering for somebody advancing FlushRecPtr concurrently, I think we
add a comment to explain that.  But why do we need to do that in the
first place?)

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The eagle never lost so much time, as
when he submitted to learn of the crow." (William Blake)



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/17/21, 1:24 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> The thing I still don't understand about this patch is why we call
> RegisterSegmentBoundaryEntry and NotifySegmentsReadyForArchive in
> XLogInsertRecord.
>
> My model concept of this would have these routines called only in
> XLogFlush / XLogWrite, which are conceptually about transferring data
> from in-memory WAL buffers into the on-disk WAL store (pg_xlog files).
> As I understand, XLogInsertRecord is about copying bytes from the
> high-level operation (heap insert etc) into WAL buffers.  So doing the
> register/notify dance in both places should be redundant and
> unnecessary.

The main reason for registering the boundaries in XLogInsertRecord()
is that it has the required information about the WAL record
boundaries.  I do not think that XLogWrite() has this information.
If we assumed that write requests always pointed to record boundaries,
we could probably just move the XLogArchiveNotifySeg() calls to the
end of XLogWrite(), which is what my original patch [0] did.

> In the NotifySegmentsReadyForArchive() call at the bottom of XLogWrite,
> we use flushed=InvalidXLogRecPtr.  Why is that?  Surely we can use
> LogwrtResult.Flush, just like in the other callsite there?  (If we're
> covering for somebody advancing FlushRecPtr concurrently, I think we
> add a comment to explain that.  But why do we need to do that in the
> first place?)

Good point.  I did this in the new version of the patch.

Nathan

[0] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com


Attachment

Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-17, Bossart, Nathan wrote:

> The main reason for registering the boundaries in XLogInsertRecord()
> is that it has the required information about the WAL record
> boundaries.  I do not think that XLogWrite() has this information.

Doh, of course.  So, why isn't it that we call Register in
XLogInsertRecord, and Notify in XLogWrite?

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"La fuerza no está en los medios físicos
sino que reside en una voluntad indomable" (Gandhi)



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/17/21, 2:13 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> On 2021-Aug-17, Bossart, Nathan wrote:
>
>> The main reason for registering the boundaries in XLogInsertRecord()
>> is that it has the required information about the WAL record
>> boundaries.  I do not think that XLogWrite() has this information.
>
> Doh, of course.  So, why isn't it that we call Register in
> XLogInsertRecord, and Notify in XLogWrite?

We do.  However, we also call NotifySegmentsReadyForArchive() in
XLogInsertRecord() to handle the probably-unlikely scenario that the
flush pointer has already advanced past the to-be-registered boundary.
This ensures that the .ready files are created as soon as possible.

Nathan


Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-17, Bossart, Nathan wrote:
> On 8/17/21, 2:13 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
>
> > So, why isn't it that we call Register in XLogInsertRecord, and
> > Notify in XLogWrite?
> 
> We do.  However, we also call NotifySegmentsReadyForArchive() in
> XLogInsertRecord() to handle the probably-unlikely scenario that the
> flush pointer has already advanced past the to-be-registered boundary.
> This ensures that the .ready files are created as soon as possible.

I see.

I have two thoughts on that.  First, why not do it outside the block
that tests for crossing a segment boundary?  If that's a good thing to
do, then we should do it always.

However, why do it in a WAL-producing client-connected backend?  It
strikes me as a bad thing to do, because you are possibly causing delays
for client-connected backends.  I suggest that we should give this task
to the WAL writer process -- say, have XLogBackgroundFlush do it.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"All rings of power are equal,
But some rings of power are more equal than others."
                                 (George Orwell's The Lord of the Rings)



Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-17, alvherre@alvh.no-ip.org wrote:

> However, why do it in a WAL-producing client-connected backend?  It
> strikes me as a bad thing to do, because you are possibly causing delays
> for client-connected backends.  I suggest that we should give this task
> to the WAL writer process -- say, have XLogBackgroundFlush do it.

Reading the comments on walwriter.c I am hesitant of having walwriter do
it:

>  * Because the walwriter's cycle is directly linked to the maximum delay
>  * before async-commit transactions are guaranteed committed, it's probably
>  * unwise to load additional functionality onto it.  For instance, if you've
>  * got a yen to create xlog segments further in advance, that'd be better done
>  * in bgwriter than in walwriter.

So that comment suggests that we should give the responsibility to bgwriter.
This seems good enough to me.  I suppose if bgwriter has a long run of
buffers to write it could take a little bit of time (a few hundred
milliseconds?) but I think that should be okay.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"No renuncies a nada. No te aferres a nada."



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/18/21, 10:06 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> So that comment suggests that we should give the responsibility to bgwriter.
> This seems good enough to me.  I suppose if bgwriter has a long run of
> buffers to write it could take a little bit of time (a few hundred
> milliseconds?) but I think that should be okay.

Do you think bgwriter should be the only caller of
NotifySegmentsReadyForArchive(), or should we still have XLogWrite()
call it?

Nathan


Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-18, Bossart, Nathan wrote:

> On 8/18/21, 10:06 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> > So that comment suggests that we should give the responsibility to bgwriter.
> > This seems good enough to me.  I suppose if bgwriter has a long run of
> > buffers to write it could take a little bit of time (a few hundred
> > milliseconds?) but I think that should be okay.
> 
> Do you think bgwriter should be the only caller of
> NotifySegmentsReadyForArchive(), or should we still have XLogWrite()
> call it?

I think XLogWrite should absolutely be the primary caller.  The one in
bgwriter should be a backstop for the case you describe where the flush
pointer advanced past the registration point in XLogInsertRecord.

I realize this means there's a contradiction with my previous argument,
in that synchronous transaction commit calls XLogWrite at some point, so
we *are* putting the client-connected backend in charge of creating the
notify files.  However, that only happens on transaction commit, where
we already accept responsibility for the WAL flush, not on each
individual XLOG record insert; also, the WAL writer will take care of it
sometimes, for transactions that are long-enough lived.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"How strange it is to find the words "Perl" and "saner" in such close
proximity, with no apparent sense of irony. I doubt that Larry himself
could have managed it."         (ncm, http://lwn.net/Articles/174769/)



Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-18, alvherre@alvh.no-ip.org wrote:

> I realize this means there's a contradiction with my previous argument,
> in that synchronous transaction commit calls XLogWrite at some point, so
> we *are* putting the client-connected backend in charge of creating the
> notify files.  However, that only happens on transaction commit, where
> we already accept responsibility for the WAL flush, not on each
> individual XLOG record insert; also, the WAL writer will take care of it
> sometimes, for transactions that are long-enough lived.

Eh.  I just said WAL writer will sometimes do it, and that's true
because it'll occur in XLogBackgroundFlush.  But upthread I wimped out
of having WAL writer call NotifySegmentsReadyForArchive() and instead
opined to give responsibility to bgwriter.  However, thinking about it
again, maybe it does make sense to have walwriter do it too directly.
This causes no harm to walwriter's time constraints, since *it will have
to do it via XLogBackgroundFlush anyway*.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/18/21, 10:48 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> On 2021-Aug-18, alvherre@alvh.no-ip.org wrote:
>
>> I realize this means there's a contradiction with my previous argument,
>> in that synchronous transaction commit calls XLogWrite at some point, so
>> we *are* putting the client-connected backend in charge of creating the
>> notify files.  However, that only happens on transaction commit, where
>> we already accept responsibility for the WAL flush, not on each
>> individual XLOG record insert; also, the WAL writer will take care of it
>> sometimes, for transactions that are long-enough lived.
>
> Eh.  I just said WAL writer will sometimes do it, and that's true
> because it'll occur in XLogBackgroundFlush.  But upthread I wimped out
> of having WAL writer call NotifySegmentsReadyForArchive() and instead
> opined to give responsibility to bgwriter.  However, thinking about it
> again, maybe it does make sense to have walwriter do it too directly.
> This causes no harm to walwriter's time constraints, since *it will have
> to do it via XLogBackgroundFlush anyway*.

I'll add it after XLogBackgroundFlush().  I think we'll also want to
set the WAL writer's latch in case it is hibernating.

Another approach could be to keep the NotifySegmentsReadyForArchive()
call in XLogInsertRecord(), but only call it if the flush pointer is
beyond the boundary we just registered.  Or we could only set the
latch in XLogInsertRecord() if we detect that the flush pointer has
advanced.

Nathan


Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-18, Bossart, Nathan wrote:

> I'll add it after XLogBackgroundFlush().

I was wondering which would be better: before or after.
XLogBackgroundFlush would do it anyway, so if you do it after then it's
not clear to me that it'd do anything (I mean we should not do any new
calls of NotifySegmentsReadyForArchive and just rely on the one in
XLogBackgroundFlush -> XLogWrite).

The advantage of doing NotifySegmentsReadyForArchive before
XLogBackgroundFlush is that the files would be created sooner, so the
archiver can be working in parallel while walwriter does its other
thing; then we'd reach the NotifySegmentsReadyForArchive in
XLogBackgroundFlush and it'd find nothing to do most of the time, which
is just fine.

> I think we'll also want to set the WAL writer's latch in case it is
> hibernating.

Yeah.  (That's another advantage of doing it in walwriter rather than
bgwriter: we don't publish bgwriter's latch anywhere AFAICS).

> Another approach could be to keep the NotifySegmentsReadyForArchive()
> call in XLogInsertRecord(), but only call it if the flush pointer is
> beyond the boundary we just registered.  Or we could only set the
> latch in XLogInsertRecord() if we detect that the flush pointer has
> advanced.

Hmm.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/18/21, 11:31 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> I was wondering which would be better: before or after.
> XLogBackgroundFlush would do it anyway, so if you do it after then it's
> not clear to me that it'd do anything (I mean we should not do any new
> calls of NotifySegmentsReadyForArchive and just rely on the one in
> XLogBackgroundFlush -> XLogWrite).
>
> The advantage of doing NotifySegmentsReadyForArchive before
> XLogBackgroundFlush is that the files would be created sooner, so the
> archiver can be working in parallel while walwriter does its other
> thing; then we'd reach the NotifySegmentsReadyForArchive in
> XLogBackgroundFlush and it'd find nothing to do most of the time, which
> is just fine.

As long as XLogBackgroundFlush() found work to do, it would take care
of notifying, but I don't think we can depend on that.  However, since
we're primarily using the WAL writer to take care of the case when the
record has already been flushed, notifying beforehand seems fine to
me.  If XLogBackgroundFlush() does end up calling XLogWrite(), it'll
call it again, anyway.

In the attached patch, I modified XLogInsertRecord() to simply set the
latch if we detect that flushRecPtr has advanced.

Nathan


Attachment

Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-18, Bossart, Nathan wrote:

> As long as XLogBackgroundFlush() found work to do, it would take care
> of notifying, but I don't think we can depend on that.  However, since
> we're primarily using the WAL writer to take care of the case when the
> record has already been flushed, notifying beforehand seems fine to
> me.  If XLogBackgroundFlush() does end up calling XLogWrite(), it'll
> call it again, anyway.

Agreed.

> In the attached patch, I modified XLogInsertRecord() to simply set the
> latch if we detect that flushRecPtr has advanced.

Right, that's what I was thinking.  I modified that slightly to use
LogwrtResult.Flush (which should be fresh enough) instead of calling
GetFlushRecPtr again, which saves a bit.  I also changed it to > instead
of >=, because if I understand you correctly we only care to notify if
the flush pointer advanced, not in the case it stayed the same.

I made a few other cosmetic tweaks -- added comment to
SegmentBoundaryEntry and renamed the 'pos' to 'endpos'; renamed argument
'notify' of XLogArchiveNotify to 'nudge' (because having two different
"notify" concepts in that function seemed confusing).

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"It takes less than 2 seconds to get to 78% complete; that's a good sign.
A few seconds later it's at 90%, but it seems to have stuck there.  Did
somebody make percentages logarithmic while I wasn't looking?"
                http://smylers.hates-software.com/2005/09/08/1995c749.html

Attachment

Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/18/21, 4:47 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> On 2021-Aug-18, Bossart, Nathan wrote:
>> In the attached patch, I modified XLogInsertRecord() to simply set the
>> latch if we detect that flushRecPtr has advanced.
>
> Right, that's what I was thinking.  I modified that slightly to use
> LogwrtResult.Flush (which should be fresh enough) instead of calling
> GetFlushRecPtr again, which saves a bit.  I also changed it to > instead
> of >=, because if I understand you correctly we only care to notify if
> the flush pointer advanced, not in the case it stayed the same.

My thinking was that we needed to read flushRecPtr after registering
the boundary in case it advanced just before registration.  And I used
>= because if flushRecPtr points to the end of the record, we should
be able to create the .ready file for the segment.

We can avoid acquiring the spinlock an extra time if we move the first
part of the cross-segment logic to before we update the local copy of
LogwrtResult.  I attached a new version of the patch that does this.

The rest looks good to me.

Nathan


Attachment

Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
In v12 I moved the code around a bit and reworded some comments.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Attachment

Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
Two things.

1. We use a hash table in shared memory.  That's great.  The part that's
   not so great is that in both places where we read items from it, we
   have to iterate in some way.  This seems a bit silly.  An array would
   serve us better, if only we could expand it as needed.  However, in
   shared memory we can't do that.  (I think the list of elements we
   need to memoize is arbitrary long, if enough processes can be writing
   WAL at the same time.)

   Now that I think about this again, maybe it's limited by
   NUM_XLOGINSERT_LOCKS, since there can only be that many records being
   written down at a time ...

2. There is a new LWLock acquisition that may be a new contention point.
   We acquire the lock in these cases:
   - WAL record insert, when a segment boundary is crosses (rare
     enough).
   - XLogWrite, when a segment needs to be notified.

   Looking again, I think the last point might be a problem actually,
   because XLogWrite is called with WALWriteLock held.  Maybe we should
   take the NotifySegmentsReadyForArchive() call outside the section
   locked by WALWriteLock (so put it in XLogWrite callers instead of
   XLogWrite itself).

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Cómo ponemos nuestros dedos en la arcilla del otro. Eso es la amistad; jugar
al alfarero y ver qué formas se pueden sacar del otro" (C. Halloway en
La Feria de las Tinieblas, R. Bradbury)



Re: archive status ".ready" files may be created too early

From
Robert Haas
Date:
On Fri, Aug 20, 2021 at 10:50 AM alvherre@alvh.no-ip.org
<alvherre@alvh.no-ip.org> wrote:
> 1. We use a hash table in shared memory.  That's great.  The part that's
>    not so great is that in both places where we read items from it, we
>    have to iterate in some way.  This seems a bit silly.  An array would
>    serve us better, if only we could expand it as needed.  However, in
>    shared memory we can't do that.  (I think the list of elements we
>    need to memoize is arbitrary long, if enough processes can be writing
>    WAL at the same time.)

We can't expand the hash table either. It has an initial and maximum
size of 16 elements, which means it's basically an expensive array,
and which also means that it imposes a new limit of 16 *
wal_segment_size on the size of WAL records. If you exceed that limit,
I think things just go boom... which I think is not acceptable. I
think we can have records in the multi-GB range of wal_level=logical
and someone chooses a stupid replica identity setting.

It's actually not clear to me why we need to track multiple entries
anyway. The scenario postulated by Horiguchi-san in
https://www.postgresql.org/message-id/20201014.090628.839639906081252194.horikyota.ntt@gmail.com
seems to require that the write position be multiple segments ahead of
the flush position, but that seems impossible with the present code,
because XLogWrite() calls issue_xlog_fsync() at once if the segment is
filled. So I think, at least with the present code, any record that
isn't completely flushed to disk has to be at least partially in the
current segment. And there can be only one record that starts in some
earlier segment and ends in this one.

I will be the first to admit that the forced end-of-segment syncs
suck. They often stall every backend in the entire system at the same
time. Everyone fills up the xlog segment really fast and then stalls
HARD while waiting for that sync to happen. So it's arguably better
not to do more things that depend on that being how it works, but I
think needing a variable-size amount of shared memory is even worse.
If we're going to track multiple entries here we need some rule that
bounds how many of them we can need to track. If the number of entries
is defined by the number of segment boundaries that a particular
record crosses, it's effectively unbounded, because right now WAL
records can be pretty much arbitrarily big.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/20/21, 8:29 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> On Fri, Aug 20, 2021 at 10:50 AM alvherre@alvh.no-ip.org
> <alvherre@alvh.no-ip.org> wrote:
>> 1. We use a hash table in shared memory.  That's great.  The part that's
>>    not so great is that in both places where we read items from it, we
>>    have to iterate in some way.  This seems a bit silly.  An array would
>>    serve us better, if only we could expand it as needed.  However, in
>>    shared memory we can't do that.  (I think the list of elements we
>>    need to memoize is arbitrary long, if enough processes can be writing
>>    WAL at the same time.)
>
> We can't expand the hash table either. It has an initial and maximum
> size of 16 elements, which means it's basically an expensive array,
> and which also means that it imposes a new limit of 16 *
> wal_segment_size on the size of WAL records. If you exceed that limit,
> I think things just go boom... which I think is not acceptable. I
> think we can have records in the multi-GB range of wal_level=logical
> and someone chooses a stupid replica identity setting.

If a record spans multiple segments, we only register one segment
boundary.  For example, if I insert a record that starts at segment
number 1 and stops at 10, I'll insert one segment boundary for segment
10.  We'll only create .ready files for segments 1 through 9 once this
record is completely flushed to disk.

I was under the impression that shared hash tables could be expanded
as necessary, but from your note and the following comment, that does
not seem to be true:

* Note: for a shared-memory hashtable, nelem needs to be a pretty good
* estimate, since we can't expand the table on the fly.  But an unshared
* hashtable can be expanded on-the-fly, so it's better for nelem to be
* on the small side and let the table grow if it's exceeded.  An overly
* large nelem will penalize hash_seq_search speed without buying much.

> It's actually not clear to me why we need to track multiple entries
> anyway. The scenario postulated by Horiguchi-san in
> https://www.postgresql.org/message-id/20201014.090628.839639906081252194.horikyota.ntt@gmail.com
> seems to require that the write position be multiple segments ahead of
> the flush position, but that seems impossible with the present code,
> because XLogWrite() calls issue_xlog_fsync() at once if the segment is
> filled. So I think, at least with the present code, any record that
> isn't completely flushed to disk has to be at least partially in the
> current segment. And there can be only one record that starts in some
> earlier segment and ends in this one.

We register the boundaries XLogInsertRecord(), which AFAICT just bumps
the global write request pointer ahead, so I'm not sure we can make
any assumptions about what is written/flushed at that time.  (I see
that we do end up calling XLogFlush() for XLOG_SWITCH records in
XLogInsertRecord(), but I don't see any other cases where we actually
write anything in this function.)  Am I missing something?

> I will be the first to admit that the forced end-of-segment syncs
> suck. They often stall every backend in the entire system at the same
> time. Everyone fills up the xlog segment really fast and then stalls
> HARD while waiting for that sync to happen. So it's arguably better
> not to do more things that depend on that being how it works, but I
> think needing a variable-size amount of shared memory is even worse.
> If we're going to track multiple entries here we need some rule that
> bounds how many of them we can need to track. If the number of entries
> is defined by the number of segment boundaries that a particular
> record crosses, it's effectively unbounded, because right now WAL
> records can be pretty much arbitrarily big.

If there isn't a way to ensure that the number of entries we need to
store is bounded, I'm tempted to propose my original patch [0], which
just moves .ready file creation to the very end of XLogWrite().  It's
probably not a complete solution, but it might be better than what's
there today.

Nathan

[0] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com


Re: archive status ".ready" files may be created too early

From
Robert Haas
Date:
On Fri, Aug 20, 2021 at 12:36 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> If a record spans multiple segments, we only register one segment
> boundary.  For example, if I insert a record that starts at segment
> number 1 and stops at 10, I'll insert one segment boundary for segment
> 10.  We'll only create .ready files for segments 1 through 9 once this
> record is completely flushed to disk.

Oh ... OK. So is there any experimental scenario in which the hash
table ends up with more than 1 entry? And if so, how does that happen?

> > It's actually not clear to me why we need to track multiple entries
> > anyway. The scenario postulated by Horiguchi-san in
> > https://www.postgresql.org/message-id/20201014.090628.839639906081252194.horikyota.ntt@gmail.com
> > seems to require that the write position be multiple segments ahead of
> > the flush position, but that seems impossible with the present code,
> > because XLogWrite() calls issue_xlog_fsync() at once if the segment is
> > filled. So I think, at least with the present code, any record that
> > isn't completely flushed to disk has to be at least partially in the
> > current segment. And there can be only one record that starts in some
> > earlier segment and ends in this one.
>
> We register the boundaries XLogInsertRecord(), which AFAICT just bumps
> the global write request pointer ahead, so I'm not sure we can make
> any assumptions about what is written/flushed at that time.  (I see
> that we do end up calling XLogFlush() for XLOG_SWITCH records in
> XLogInsertRecord(), but I don't see any other cases where we actually
> write anything in this function.)  Am I missing something?

Well, I'm not sure. But I *think* that the code as it exists today is
smart enough not to try to archive a segment that hasn't been
completely flushed, and the gap is only that even though the segment
might be completely flushed, some portion of the record that is part
of a later segment might not be flushed, and thus after a crash we
might overwrite the already-flushed contents. The patch can make an
implementation choice to do some work at XLogInsertRecord() time if it
likes, but there's no real hazard at that point. The hazard only
exists, or so I think, once a segment that contains part of the record
is fully on disk. But that means, if my previous logic is correct,
that the hazard can only exist for at most 1 record at any point in
time.

> If there isn't a way to ensure that the number of entries we need to
> store is bounded, I'm tempted to propose my original patch [0], which
> just moves .ready file creation to the very end of XLogWrite().  It's
> probably not a complete solution, but it might be better than what's
> there today.

Doesn't that allocate memory inside a critical section? I would have
thought it would cause an immediate assertion failure.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/20/21, 10:08 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> On Fri, Aug 20, 2021 at 12:36 PM Bossart, Nathan <bossartn@amazon.com> wrote:
>> If a record spans multiple segments, we only register one segment
>> boundary.  For example, if I insert a record that starts at segment
>> number 1 and stops at 10, I'll insert one segment boundary for segment
>> 10.  We'll only create .ready files for segments 1 through 9 once this
>> record is completely flushed to disk.
>
> Oh ... OK. So is there any experimental scenario in which the hash
> table ends up with more than 1 entry? And if so, how does that happen?

I was able to do this by turning synchronous_commit off, increasing
wal_buffers substantially, and adding sleeps to XLogWrite().

>> If there isn't a way to ensure that the number of entries we need to
>> store is bounded, I'm tempted to propose my original patch [0], which
>> just moves .ready file creation to the very end of XLogWrite().  It's
>> probably not a complete solution, but it might be better than what's
>> there today.
>
> Doesn't that allocate memory inside a critical section? I would have
> thought it would cause an immediate assertion failure.

I could probably replace the list with two local variables (start and
end segments).

Thinking about this stuff further, I was wondering if one way to
handle the bounded shared hash table problem would be to replace the
latest boundary in the map whenever it was full.  But at that point,
do we even need a hash table?  This led me to revisit the two-element
approach that was discussed upthread.  What if we only stored the
earliest and latest segment boundaries at any given time?  Once the
earliest boundary is added, it never changes until the segment is
flushed and it is removed.  The latest boundary, however, will be
updated any time we register another segment.  Once the earliest
boundary is removed, we replace it with the latest boundary.  This
strategy could cause us to miss intermediate boundaries, but AFAICT
the worst case scenario is that we hold off creating .ready files a
bit longer than necessary.

I'll work on a patch to illustrate what I'm thinking.

Nathan


Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-20, Bossart, Nathan wrote:

> On 8/20/21, 8:29 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:

> > We can't expand the hash table either. It has an initial and maximum
> > size of 16 elements, which means it's basically an expensive array,
> > and which also means that it imposes a new limit of 16 *
> > wal_segment_size on the size of WAL records. If you exceed that limit,
> > I think things just go boom... which I think is not acceptable. I
> > think we can have records in the multi-GB range of wal_level=logical
> > and someone chooses a stupid replica identity setting.
> 
> I was under the impression that shared hash tables could be expanded
> as necessary, but from your note and the following comment, that does
> not seem to be true:

Actually, you were right.  Hash tables in shared memory can be expanded.
There are some limitations (the hash "directory" is fixed size, which
means the hash table get less efficient if it grows too much), but you
can definitely create more hash entries than the initial size.  See for
example element_alloc(), which covers the case of a hash table being
IS_PARTITIONED -- something that only shmem hash tables can be.  Note
that ShmemInitHash passes the HASH_ALLOC flag and uses ShmemAllocNoError
as allocation function, which acquires memory from the shared segment.

This is a minor thing -- it doesn't affect the fact that the hash table
is possibly being misused and inefficient -- but I thought it was worth
pointing out.


As an example, consider the LOCK / PROCLOCK hash tables.  These can
contain more elements than max_backends * max_locks_per_transaction.
Those elements consume shared memory from the "allocation slop" in the
shared memory segment.  It's tough when it happens (as far as I know the
memory is never "returned" once such a hash table grows to use that
space), but it does work.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/



Re: archive status ".ready" files may be created too early

From
Robert Haas
Date:
On Fri, Aug 20, 2021 at 1:29 PM Bossart, Nathan <bossartn@amazon.com> wrote:
> Thinking about this stuff further, I was wondering if one way to
> handle the bounded shared hash table problem would be to replace the
> latest boundary in the map whenever it was full.  But at that point,
> do we even need a hash table?  This led me to revisit the two-element
> approach that was discussed upthread.  What if we only stored the
> earliest and latest segment boundaries at any given time?  Once the
> earliest boundary is added, it never changes until the segment is
> flushed and it is removed.  The latest boundary, however, will be
> updated any time we register another segment.  Once the earliest
> boundary is removed, we replace it with the latest boundary.  This
> strategy could cause us to miss intermediate boundaries, but AFAICT
> the worst case scenario is that we hold off creating .ready files a
> bit longer than necessary.

I think this is a promising approach. We could also have a small
fixed-size array, so that we only have to risk losing track of
anything when we overflow the array. But I guess I'm still unconvinced
that there's a real possibility of genuinely needing multiple
elements. Suppose we are thinking of adding a second element to the
array (or the hash table). I feel like it's got to be safe to just
remove the first one. If not, then apparently the WAL record that
caused us to make the first entry isn't totally flushed yet - which I
still think is impossible.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: archive status ".ready" files may be created too early

From
Robert Haas
Date:
On Fri, Aug 20, 2021 at 1:52 PM alvherre@alvh.no-ip.org
<alvherre@alvh.no-ip.org> wrote:
> Actually, you were right.  Hash tables in shared memory can be expanded.
> There are some limitations (the hash "directory" is fixed size, which
> means the hash table get less efficient if it grows too much), but you
> can definitely create more hash entries than the initial size.  See for
> example element_alloc(), which covers the case of a hash table being
> IS_PARTITIONED -- something that only shmem hash tables can be.  Note
> that ShmemInitHash passes the HASH_ALLOC flag and uses ShmemAllocNoError
> as allocation function, which acquires memory from the shared segment.

I realize that the code supports this ... but I thought we had
established a policy that only the main lock manager's shared hash
tables, and not any others, are actually allowed to make use of this
functionality.  See commit 7c797e7194d969f974abf579cacf30ffdccdbb95.

It seems like a dangerous thing to rely on in any case, since we can't
predict how much extra shared memory might actually be available.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/20/21, 11:20 AM, "Robert Haas" <robertmhaas@gmail.com> wrote:
> On Fri, Aug 20, 2021 at 1:29 PM Bossart, Nathan <bossartn@amazon.com> wrote:
>> Thinking about this stuff further, I was wondering if one way to
>> handle the bounded shared hash table problem would be to replace the
>> latest boundary in the map whenever it was full.  But at that point,
>> do we even need a hash table?  This led me to revisit the two-element
>> approach that was discussed upthread.  What if we only stored the
>> earliest and latest segment boundaries at any given time?  Once the
>> earliest boundary is added, it never changes until the segment is
>> flushed and it is removed.  The latest boundary, however, will be
>> updated any time we register another segment.  Once the earliest
>> boundary is removed, we replace it with the latest boundary.  This
>> strategy could cause us to miss intermediate boundaries, but AFAICT
>> the worst case scenario is that we hold off creating .ready files a
>> bit longer than necessary.
>
> I think this is a promising approach. We could also have a small
> fixed-size array, so that we only have to risk losing track of
> anything when we overflow the array. But I guess I'm still unconvinced
> that there's a real possibility of genuinely needing multiple
> elements. Suppose we are thinking of adding a second element to the
> array (or the hash table). I feel like it's got to be safe to just
> remove the first one. If not, then apparently the WAL record that
> caused us to make the first entry isn't totally flushed yet - which I
> still think is impossible.

I've attached a patch to demonstrate what I'm thinking.

Nathan


Attachment

Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-20, Bossart, Nathan wrote:

> > On Fri, Aug 20, 2021 at 1:29 PM Bossart, Nathan <bossartn@amazon.com> wrote:

> >> This led me to revisit the two-element
> >> approach that was discussed upthread.  What if we only stored the
> >> earliest and latest segment boundaries at any given time?  Once the
> >> earliest boundary is added, it never changes until the segment is
> >> flushed and it is removed.  The latest boundary, however, will be
> >> updated any time we register another segment.  Once the earliest
> >> boundary is removed, we replace it with the latest boundary.  This
> >> strategy could cause us to miss intermediate boundaries, but AFAICT
> >> the worst case scenario is that we hold off creating .ready files a
> >> bit longer than necessary.

> I've attached a patch to demonstrate what I'm thinking.

There is only one thing I didn't like in this new version, which is that
we're holding info_lck too much.  I've seen info_lck contention be a
problem in some workloads and I'd rather not add more stuff to it.  I'd
rather we stick with using a new lock object to protect all the data we
need for this job.

Should this new lock object be a spinlock or an lwlock?  I think a
spinlock would generally be better because it's lower overhead and we
can't use it in shared mode anywhere, which would be the greatest
argument for an lwlock.  However, I think we avoid letting code run with
spinlocks held that's not straight-line code, and we have some function
calls there.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/



Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
Attached is v14 which uses a separate spinlock.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"No me acuerdo, pero no es cierto.  No es cierto, y si fuera cierto,
 no me acuerdo."                 (Augusto Pinochet a una corte de justicia)

Attachment

Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/20/21, 4:00 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> Attached is v14 which uses a separate spinlock.

This looks good to me.

I was looking at moving the function calls out of the spinlock region.
I don't think the functions are doing anything too expensive, and they
help clean up NotifySegmentsReadyForArchive() quite a bit, but I
understand why it might be against project policy to do something like
that.  It would be easy enough to get rid of the helper functions if
that was concern.

Nathan


Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-20, Bossart, Nathan wrote:

> I was looking at moving the function calls out of the spinlock region.
> I don't think the functions are doing anything too expensive, and they
> help clean up NotifySegmentsReadyForArchive() quite a bit, but I
> understand why it might be against project policy to do something like
> that.  It would be easy enough to get rid of the helper functions if
> that was concern.

Well, the thing I realized is that these three helper functions have
exactly one caller each.  I think the compiler is going to inline them,
so there isn't going to be a function call in the assembly.  I haven't
verified this, though.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Los dioses no protegen a los insensatos.  Éstos reciben protección de
otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/20/21, 4:52 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> On 2021-Aug-20, Bossart, Nathan wrote:
>
>> I was looking at moving the function calls out of the spinlock region.
>> I don't think the functions are doing anything too expensive, and they
>> help clean up NotifySegmentsReadyForArchive() quite a bit, but I
>> understand why it might be against project policy to do something like
>> that.  It would be easy enough to get rid of the helper functions if
>> that was concern.
>
> Well, the thing I realized is that these three helper functions have
> exactly one caller each.  I think the compiler is going to inline them,
> so there isn't going to be a function call in the assembly.  I haven't
> verified this, though.

Good point.  It looks like they're getting inlined for me.

Nathan


Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-21, Bossart, Nathan wrote:

> > Well, the thing I realized is that these three helper functions have
> > exactly one caller each.  I think the compiler is going to inline them,
> > so there isn't going to be a function call in the assembly.  I haven't
> > verified this, though.
> 
> Good point.  It looks like they're getting inlined for me.

I still didn't like it, because it looks like we're creating an API for
which there can be only one caller.  So I expanded the functions in the
caller.  It doesn't look too bad.  However ...

... while reading the resulting code after backpatching to all branches,
I realized that if there are no registrations whatsoever, then archiving
won't do anything, which surely is the wrong thing to do.  The correct
behavior should be "if there are no registrations, then *all* flushed
segments can be notified".

I'll fix that ...

Another thing I didn't like is that you used a name ending in RecPtr for
the LSN, which gives no indication that it really is the *end* LSN, not
the start pointer.  And it won't play nice with the need to add the
*start* LSN which we'll need to implement solving the equivalent problem
for streaming replication.  I'll rename those to
earliestSegBoundaryEndPtr and latestSegBoundaryEndPtr.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
Syntax error: function hell() needs an argument.
Please choose what hell you want to involve.

Attachment

Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/23/21, 8:50 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> ... while reading the resulting code after backpatching to all branches,
> I realized that if there are no registrations whatsoever, then archiving
> won't do anything, which surely is the wrong thing to do.  The correct
> behavior should be "if there are no registrations, then *all* flushed
> segments can be notified".

Hm.  My expectation would be that if there are no registrations, we
cannot create .ready files for the flushed segments.  The scenario
where I can see that happening is when a record gets flushed to disk
prior to registration.  In that case, we'll still eventually register
the record and wake up the WAL writer process, which will take care of
creating the .ready files that were missed earlier.  Is there another
case you are thinking of where we could miss registration for a cross-
segment record altogether?

Nathan


Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-23, Bossart, Nathan wrote:

> Hm.  My expectation would be that if there are no registrations, we
> cannot create .ready files for the flushed segments.  The scenario
> where I can see that happening is when a record gets flushed to disk
> prior to registration.  In that case, we'll still eventually register
> the record and wake up the WAL writer process, which will take care of
> creating the .ready files that were missed earlier.  Is there another
> case you are thinking of where we could miss registration for a cross-
> segment record altogether?

I'm thinking of the case where no record cross segment boundaries ever.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/23/21, 9:33 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> On 2021-Aug-23, Bossart, Nathan wrote:
>
>> Hm.  My expectation would be that if there are no registrations, we
>> cannot create .ready files for the flushed segments.  The scenario
>> where I can see that happening is when a record gets flushed to disk
>> prior to registration.  In that case, we'll still eventually register
>> the record and wake up the WAL writer process, which will take care of
>> creating the .ready files that were missed earlier.  Is there another
>> case you are thinking of where we could miss registration for a cross-
>> segment record altogether?
>
> I'm thinking of the case where no record cross segment boundaries ever.

Sorry, I'm still not following this one.  If we skipped creating
.ready segments due to a crash, we rely on RemoveOldXlogFiles() to
create them as needed in the end-of-recovery checkpoint.  If a record
fits perfectly in the end of a segment, we'll still register it as a
boundary for the next segment (hence why we use XLByteToSeg() instead
of XLByteToPrevSeg()).  If database activity stops completely, there
shouldn't be anything to mark ready.

Nathan


Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-23, Bossart, Nathan wrote:

> Sorry, I'm still not following this one.  If we skipped creating
> .ready segments due to a crash, we rely on RemoveOldXlogFiles() to
> create them as needed in the end-of-recovery checkpoint.  If a record
> fits perfectly in the end of a segment, we'll still register it as a
> boundary for the next segment (hence why we use XLByteToSeg() instead
> of XLByteToPrevSeg()).  If database activity stops completely, there
> shouldn't be anything to mark ready.

The only way .ready files are created is that XLogNotifyWrite() is
called.  For regular WAL files during regular operation, that only
happens in XLogNotifyWriteSeg().  That, in turn, only happens in
NotifySegmentsReadyForArchive().  But if the system runs and never
writes WAL records that cross WAL boundaries, that function will see
that both earliestSegBoundary and latestSegBoundary are MaxXLogSegno,
and return without doing anything.  So no segments will be notified.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/



Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-23, alvherre@alvh.no-ip.org wrote:

> The only way .ready files are created is that XLogNotifyWrite() is
> called.  For regular WAL files during regular operation, that only
> happens in XLogNotifyWriteSeg().  That, in turn, only happens in
> NotifySegmentsReadyForArchive().  But if the system runs and never
> writes WAL records that cross WAL boundaries, that function will see
> that both earliestSegBoundary and latestSegBoundary are MaxXLogSegno,
> and return without doing anything.  So no segments will be notified.

Nevermind -- I realized that all segments get registered, not just those
for which we generate continuation records.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"Use it up, wear it out, make it do, or do without"



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/23/21, 10:31 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> On 2021-Aug-23, alvherre@alvh.no-ip.org wrote:
>
>> The only way .ready files are created is that XLogNotifyWrite() is
>> called.  For regular WAL files during regular operation, that only
>> happens in XLogNotifyWriteSeg().  That, in turn, only happens in
>> NotifySegmentsReadyForArchive().  But if the system runs and never
>> writes WAL records that cross WAL boundaries, that function will see
>> that both earliestSegBoundary and latestSegBoundary are MaxXLogSegno,
>> and return without doing anything.  So no segments will be notified.
>
> Nevermind -- I realized that all segments get registered, not just those
> for which we generate continuation records.

Ah, okay.  BTW the other changes you mentioned made sense to me.

Nathan


Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-23, Bossart, Nathan wrote:

> Ah, okay.  BTW the other changes you mentioned made sense to me.

Thanks.  I've pushed this now to all live branches.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
<Schwern> It does it in a really, really complicated way
<crab> why does it need to be complicated?
<Schwern> Because it's MakeMaker.



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/23/21, 12:55 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> Thanks.  I've pushed this now to all live branches.

Thank you!  I appreciate the thorough reviews.  Should we make a new
thread for the streaming replication fix?

Nathan


Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-23, Bossart, Nathan wrote:

> On 8/23/21, 12:55 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> > Thanks.  I've pushed this now to all live branches.
> 
> Thank you!  I appreciate the thorough reviews.  Should we make a new
> thread for the streaming replication fix?

Yeah, this one is long enough :-)

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: archive status ".ready" files may be created too early

From
Fujii Masao
Date:

On 2021/08/24 4:55, alvherre@alvh.no-ip.org wrote:
> On 2021-Aug-23, Bossart, Nathan wrote:
> 
>> Ah, okay.  BTW the other changes you mentioned made sense to me.
> 
> Thanks.  I've pushed this now to all live branches.

Thanks a lot!

+        /*
+         * There's a chance that the record was already flushed to disk and we
+         * missed marking segments as ready for archive.  If this happens, we
+         * nudge the WALWriter, which will take care of notifying segments as
+         * needed.
+         */
+        if (StartSeg != EndSeg && XLogArchivingActive() &&
+            LogwrtResult.Flush >= EndPos && ProcGlobal->walwriterLatch)
+            SetLatch(ProcGlobal->walwriterLatch);

Is this really necessary?

If LogwrtResult.Flush >= EndPos, which means that another process already
has flushed the record concurrently and updated XLogCtl->LogwrtResult.Flush.
This situation also means that that another process called
NotifySegmentsReadyForArchive(LogwrtResult.Flush). Right?

If this understanding is right, there seems no need to wake walwriter up here
so that it can call NotifySegmentsReadyForArchive(LogwrtResult.Flush) gain.
Thought?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/25/21, 11:01 AM, "Fujii Masao" <masao.fujii@oss.nttdata.com> wrote:
> If LogwrtResult.Flush >= EndPos, which means that another process already
> has flushed the record concurrently and updated XLogCtl->LogwrtResult.Flush.
> This situation also means that that another process called
> NotifySegmentsReadyForArchive(LogwrtResult.Flush). Right?

If the segment boundary wasn't registered before the other process
called NotifySegmentsReadyForArchive(), then it couldn't have used the
boundary for deciding which .ready files to create.

> If this understanding is right, there seems no need to wake walwriter up here
> so that it can call NotifySegmentsReadyForArchive(LogwrtResult.Flush) gain.
> Thought?

We're actually discussing this right now in another thread [0].  I
think we might be able to get rid of that part if we move the boundary
registration to before we release the WAL insert lock(s).

Nathan

[0] https://postgr.es/m/DE60B9AA-9670-47DA-9678-6C79BCD884E3%40amazon.com


Re: archive status ".ready" files may be created too early

From
Andres Freund
Date:
Hi,

On 2021-08-23 15:55:03 -0400, alvherre@alvh.no-ip.org wrote:
> On 2021-Aug-23, Bossart, Nathan wrote:
> 
> > Ah, okay.  BTW the other changes you mentioned made sense to me.
> 
> Thanks.  I've pushed this now to all live branches.

While rebasing the aio patchset ontop of HEAD I noticed that this commit added
another atomic operation to XLogWrite() with archiving enabled. The WAL write
path is really quite hot, and at least some of the
NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held.

I think we should at least try to make the fast-path where no segment
boundaries were crossed use no atomic operations.

Greetings,

Andres Freund



Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-28, Andres Freund wrote:

> While rebasing the aio patchset ontop of HEAD I noticed that this commit added
> another atomic operation to XLogWrite() with archiving enabled. The WAL write
> path is really quite hot, and at least some of the
> NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held.
> 
> I think we should at least try to make the fast-path where no segment
> boundaries were crossed use no atomic operations.

I think the best way to achieve this is is to rely completely on
walwriter doing the segment notification, so that the WAL write done by
backend would only do a latch set.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/30/21, 12:52 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> On 2021-Aug-28, Andres Freund wrote:
>
>> While rebasing the aio patchset ontop of HEAD I noticed that this commit added
>> another atomic operation to XLogWrite() with archiving enabled. The WAL write
>> path is really quite hot, and at least some of the
>> NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held.
>>
>> I think we should at least try to make the fast-path where no segment
>> boundaries were crossed use no atomic operations.
>
> I think the best way to achieve this is is to rely completely on
> walwriter doing the segment notification, so that the WAL write done by
> backend would only do a latch set.

+1.  If we do that, we may also want to move
NotifySegmentsReadyForArchive() to after the call to
XLogBackgroundFlush() in the WAL writer.

Nathan


Re: archive status ".ready" files may be created too early

From
Andres Freund
Date:
Hi,

On 2021-08-30 15:51:54 -0400, alvherre@alvh.no-ip.org wrote:
> On 2021-Aug-28, Andres Freund wrote:
>
> > While rebasing the aio patchset ontop of HEAD I noticed that this commit added
> > another atomic operation to XLogWrite() with archiving enabled. The WAL write
> > path is really quite hot, and at least some of the
> > NotifySegmentsReadyForArchive() calls are done while WALWriteLock is held.
> >
> > I think we should at least try to make the fast-path where no segment
> > boundaries were crossed use no atomic operations.
>
> I think the best way to achieve this is is to rely completely on
> walwriter doing the segment notification, so that the WAL write done by
> backend would only do a latch set.

When were you thinking of doing the latch sets? Adding a latch set for every
XLogWrite() wouldn't be cheap either. Both because syscalls under a lock
aren't free and because waking up walsender even more often isn't free (we
already have a few threads about reducing the signalling frequency).

There's also the question of what to do with single user mode. We shouldn't
just skip creating .ready files there...


Although, the more I think about, the more I am confused about the trailing
    if (XLogArchivingActive())
        NotifySegmentsReadyForArchive(LogwrtResult.Flush);

in XLogWrite(). Shouldn't that at the very least be inside the "If asked to
flush, do so" branch? Outside that and the finishing_seg branch
LogwrtResult.Flush won't have moved, right? So the call to
NotifySegmentsReadyForArchive() can't do anything, no?

Nor does it seem like we'd ever need to call NotifySegmentsReadyForArchive()
if we started writing on the current page - flushRecPtr can't move across a
segment boundary in that case.


I hadn't yet realized that this commit doesn't just make XLogWrite() more
expensive, it also makes XLogInsertRecord() more expensive :(. Adding two
divisions to XLogInsertRecord() isn't nice, especially as it happens
even if !XLogArchivingActive().


I can't really convince myself this deals correctly with multiple segment
spanning records and with records spanning more than one segment? It'd be
easier to understand if the new XLogCtlData variables were documented...

If there's one record from segment s0 to s1 and one from s1 to s4, and
wal_buffers is big enough to contain them all, the first record will set
earliestSegBoundary = s1
the second
latestSegBoundary = s4.

When s1 is fully written out, NotifySegmentsReadyForArchive() will set
earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 -
ok. But when when s2 is flushed, we'll afaict happily create .ready files
for s2, s3 despite s4 not yet being written, because earliestSegBoundary is
now s4.

I think there's other issues as well.


The more I look at this commit, the less I believe it's right.


The whole approach here of delaying .ready creation for these types of
segments seems wrong to me. Doesn't the exact same problem also exist for
streaming rep - which one can also use to maintain a PITR archive? walsender
sends up to the flush location, and pg_receivewal's FindStreamingStart() will
afaict just continue receiving from after that point.


Greetings,

Andres Freund



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/30/21, 2:06 PM, "Andres Freund" <andres@anarazel.de> wrote:
> When were you thinking of doing the latch sets? Adding a latch set for every
> XLogWrite() wouldn't be cheap either. Both because syscalls under a lock
> aren't free and because waking up walsender even more often isn't free (we
> already have a few threads about reducing the signalling frequency).
>
> There's also the question of what to do with single user mode. We shouldn't
> just skip creating .ready files there...

Good point.

> Although, the more I think about, the more I am confused about the trailing
>         if (XLogArchivingActive())
>                 NotifySegmentsReadyForArchive(LogwrtResult.Flush);
>
> in XLogWrite(). Shouldn't that at the very least be inside the "If asked to
> flush, do so" branch? Outside that and the finishing_seg branch
> LogwrtResult.Flush won't have moved, right? So the call to
> NotifySegmentsReadyForArchive() can't do anything, no?

The registration logic looks like this:
        1. Register boundary
        2. Get flush location from shared memory
        3. If flush location >= our just-registered boundary, nudge
           the WAL writer to create .ready files if needed

If we called NotifySegmentsReadyForArchive() before we updated the
flush location in shared memory, we might skip nudging the WAL writer
even though we should.

> Nor does it seem like we'd ever need to call NotifySegmentsReadyForArchive()
> if we started writing on the current page - flushRecPtr can't move across a
> segment boundary in that case.

I think there is a chance that we've crossed one of our recorded
segment boundaries anytime the flush pointer moves.

> When s1 is fully written out, NotifySegmentsReadyForArchive() will set
> earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 -
> ok. But when when s2 is flushed, we'll afaict happily create .ready files
> for s2, s3 despite s4 not yet being written, because earliestSegBoundary is
> now s4.

In this case, the .ready files for s2 and s3 wouldn't be created until
s4 is flushed to disk.

> The whole approach here of delaying .ready creation for these types of
> segments seems wrong to me. Doesn't the exact same problem also exist for
> streaming rep - which one can also use to maintain a PITR archive? walsender
> sends up to the flush location, and pg_receivewal's FindStreamingStart() will
> afaict just continue receiving from after that point.

The problem with streaming replication is being discussed in a new
thread [0].

Nathan

[0] https://postgr.es/m/202108232252.dh7uxf6oxwcy%40alvherre.pgsql


Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/30/21, 3:40 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:
> On 8/30/21, 2:06 PM, "Andres Freund" <andres@anarazel.de> wrote:
>> Although, the more I think about, the more I am confused about the trailing
>>         if (XLogArchivingActive())
>>                 NotifySegmentsReadyForArchive(LogwrtResult.Flush);
>>
>> in XLogWrite(). Shouldn't that at the very least be inside the "If asked to
>> flush, do so" branch? Outside that and the finishing_seg branch
>> LogwrtResult.Flush won't have moved, right? So the call to
>> NotifySegmentsReadyForArchive() can't do anything, no?
>
> The registration logic looks like this:
>         1. Register boundary
>         2. Get flush location from shared memory
>         3. If flush location >= our just-registered boundary, nudge
>            the WAL writer to create .ready files if needed
>
> If we called NotifySegmentsReadyForArchive() before we updated the
> flush location in shared memory, we might skip nudging the WAL writer
> even though we should.

In the other thread [0], we're considering moving boundary
registration to before WALInsertLockRelease().  If I'm right that this
removes the race condition in question, we should be able to move the
call to NotifySegmentsReadyForArchive() at the end of XLogWrite() to
the if-asked-to-flush branch.

Nathan

[0] https://postgr.es/m/DE60B9AA-9670-47DA-9678-6C79BCD884E3%40amazon.com


Re: archive status ".ready" files may be created too early

From
Andres Freund
Date:
Hi,

On 2021-08-30 22:39:04 +0000, Bossart, Nathan wrote:
> On 8/30/21, 2:06 PM, "Andres Freund" <andres@anarazel.de> wrote:
> > When were you thinking of doing the latch sets? Adding a latch set for every
> > XLogWrite() wouldn't be cheap either. Both because syscalls under a lock
> > aren't free and because waking up walsender even more often isn't free (we
> > already have a few threads about reducing the signalling frequency).
> >
> > There's also the question of what to do with single user mode. We shouldn't
> > just skip creating .ready files there...
> 
> Good point.
> 
> > Although, the more I think about, the more I am confused about the trailing
> >         if (XLogArchivingActive())
> >                 NotifySegmentsReadyForArchive(LogwrtResult.Flush);
> >
> > in XLogWrite(). Shouldn't that at the very least be inside the "If asked to
> > flush, do so" branch? Outside that and the finishing_seg branch
> > LogwrtResult.Flush won't have moved, right? So the call to
> > NotifySegmentsReadyForArchive() can't do anything, no?
> 
> The registration logic looks like this:
>         1. Register boundary
>         2. Get flush location from shared memory
>         3. If flush location >= our just-registered boundary, nudge
>            the WAL writer to create .ready files if needed
> 
> If we called NotifySegmentsReadyForArchive() before we updated the
> flush location in shared memory, we might skip nudging the WAL writer
> even though we should.

That's trivial to address - just have a local variable saying whether we need
to call NotifySegmentsReadyForArchive().

Note that the finishing_seg path currently calls
NotifySegmentsReadyForArchive() before the shared memory flush location is
updated.


> > When s1 is fully written out, NotifySegmentsReadyForArchive() will set
> > earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 -
> > ok. But when when s2 is flushed, we'll afaict happily create .ready files
> > for s2, s3 despite s4 not yet being written, because earliestSegBoundary is
> > now s4.
> 
> In this case, the .ready files for s2 and s3 wouldn't be created until
> s4 is flushed to disk.

I don't think that's true as the code stands today? The
NotifySegmentsReadyForArchive() for s2 will update earliestSegBoundary to s4,
because latestSegBoundary = 4 and earliestSegBoundary = 1, triggering the
keep_latest branch. Any subsequent NotifySegmentsReadyForArchive() with a
segment < 4 will then be able to flush s2 and s3?


> > The whole approach here of delaying .ready creation for these types of
> > segments seems wrong to me. Doesn't the exact same problem also exist for
> > streaming rep - which one can also use to maintain a PITR archive? walsender
> > sends up to the flush location, and pg_receivewal's FindStreamingStart() will
> > afaict just continue receiving from after that point.
> 
> The problem with streaming replication is being discussed in a new
> thread [0].

I don't think it's sensible to fix these separately. It'd be one thing to do
that for HEAD, but on the back branches? And that this patch hasn't gotten any
performance testing is scary.

Greetings,

Andres Freund



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/30/21, 7:39 PM, "Andres Freund" <andres@anarazel.de> wrote:
> On 2021-08-30 22:39:04 +0000, Bossart, Nathan wrote:
>> If we called NotifySegmentsReadyForArchive() before we updated the
>> flush location in shared memory, we might skip nudging the WAL writer
>> even though we should.
>
> That's trivial to address - just have a local variable saying whether we need
> to call NotifySegmentsReadyForArchive().

I think we can remove the race condition entirely by moving boundary
registration to before WALInsertLockRelease().  I attached a patch for
discussion.

>> > When s1 is fully written out, NotifySegmentsReadyForArchive() will set
>> > earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 -
>> > ok. But when when s2 is flushed, we'll afaict happily create .ready files
>> > for s2, s3 despite s4 not yet being written, because earliestSegBoundary is
>> > now s4.
>>
>> In this case, the .ready files for s2 and s3 wouldn't be created until
>> s4 is flushed to disk.
>
> I don't think that's true as the code stands today? The
> NotifySegmentsReadyForArchive() for s2 will update earliestSegBoundary to s4,
> because latestSegBoundary = 4 and earliestSegBoundary = 1, triggering the
> keep_latest branch. Any subsequent NotifySegmentsReadyForArchive() with a
> segment < 4 will then be able to flush s2 and s3?

When flushRecPtr is less than both of the segment boundaries,
NotifySegmentsReadyForArchive() will return without doing anything.
At least, that was the intent.  If there is some reason it's not
actually working that way, I can work on fixing it.

> I don't think it's sensible to fix these separately. It'd be one thing to do
> that for HEAD, but on the back branches? And that this patch hasn't gotten any
> performance testing is scary.

Are there any specific performance tests you'd like to see?  I don't
mind running a couple.

Nathan


Attachment

Re: archive status ".ready" files may be created too early

From
Andres Freund
Date:
Hi

On 2021-08-31 06:45:06 +0000, Bossart, Nathan wrote:
> On 8/30/21, 7:39 PM, "Andres Freund" <andres@anarazel.de> wrote:
> > On 2021-08-30 22:39:04 +0000, Bossart, Nathan wrote:
> >> If we called NotifySegmentsReadyForArchive() before we updated the
> >> flush location in shared memory, we might skip nudging the WAL writer
> >> even though we should.
> >
> > That's trivial to address - just have a local variable saying whether we need
> > to call NotifySegmentsReadyForArchive().
>
> I think we can remove the race condition entirely by moving boundary
> registration to before WALInsertLockRelease().  I attached a patch for
> discussion.

I think it's a bad idea to move more code to before
WALInsertLockRelease. There's a very limited number of xlog insert slots, and
WAL flushes (e.g. commits) need to wait for insertions to finish.


> >> > When s1 is fully written out, NotifySegmentsReadyForArchive() will set
> >> > earliestSegBoundary = latestSegBoundary = s4 and create .reaady for .s1 -
> >> > ok. But when when s2 is flushed, we'll afaict happily create .ready files
> >> > for s2, s3 despite s4 not yet being written, because earliestSegBoundary is
> >> > now s4.
> >>
> >> In this case, the .ready files for s2 and s3 wouldn't be created until
> >> s4 is flushed to disk.
> >
> > I don't think that's true as the code stands today? The
> > NotifySegmentsReadyForArchive() for s2 will update earliestSegBoundary to s4,
> > because latestSegBoundary = 4 and earliestSegBoundary = 1, triggering the
> > keep_latest branch. Any subsequent NotifySegmentsReadyForArchive() with a
> > segment < 4 will then be able to flush s2 and s3?
>
> When flushRecPtr is less than both of the segment boundaries,
> NotifySegmentsReadyForArchive() will return without doing anything.
> At least, that was the intent.  If there is some reason it's not
> actually working that way, I can work on fixing it.

But that's not OK either! Consider a scenario when there's small records each
spanning just a bit into the next segment, and initially all the data is in
wal_buffers.

RegisterSegmentBoundary(s1, s1+10)
earliestSegBoundary = s1
earliestSegBoundaryEndPtr = s1+10

RegisterSegmentBoundary(s2, s2+10)
earliestSegBoundary = s1
earliestSegBoundaryEndPtr = s1+10
latestSegBoundary = s2
latestSegBoundaryEndPtr = s2 + 10

RegisterSegmentBoundary(s3, s3+10)
earliestSegBoundary = s1
earliestSegBoundaryEndPtr = s1+10
latestSegBoundary = s2
latestSegBoundaryEndPtr = s2 + 10

RegisterSegmentBoundary(s4, s4+10)
earliestSegBoundary = s1
earliestSegBoundaryEndPtr = s1+10
latestSegBoundary = s4
latestSegBoundaryEndPtr = s4 + 10

If there's now a flush request including all of s3, we'll have the following
sequence of notifies:

NotifySegmentsReadyForArchive(s1)
nothing happens, smaller than s1+10

NotifySegmentsReadyForArchive(s2)
earliestSegBoundary = s4
earliestSegBoundaryEndPtr = s4+10
latestSegBoundary = s4
latestSegBoundaryEndPtr = s4 + 10
latest_boundary_seg = s1

NotifySegmentsReadyForArchive(s3)
nothing happens, flush is smaller than s4

If the record ending at s4 + 10 isn't an async commit (and thus
XLogCtl->asyncXactLSN is smaller), and there are no further records, we can
end up waiting effectively forever for s2 (and s3) to be archived. If all
other connections (and autovac etc) are idle, there will be no XLogFlush()
calls, nor will XLogBackgroundFlush() do anything, because it'll hit the "If
already known flushed" path, because the the first page in s4 is only
partially filled.

Am I missing something?


> > I don't think it's sensible to fix these separately. It'd be one thing to do
> > that for HEAD, but on the back branches? And that this patch hasn't gotten any
> > performance testing is scary.
>
> Are there any specific performance tests you'd like to see?  I don't
> mind running a couple.

- Parallel copy with > 8 processes

- Parallel non-transactional insertion of small-medium records
  Simulates inserting rows within a transaction
- Parallel transactional insertion of small-medium sized records, with fsync=on
  Plain oltp writes
- Parallel transactional insertion of small-medium sized records, with fsync=off
  fsync=off to simulate a fast server-class SSD (where fsync is
  instantaneous). Of course, if you have one of those, you can also use that.

For the oltp ones I've had good experience simulating workloads with
pg_logical_emit_message(). That just hits the WAL path, *drastically* reducing
the variance / shortening the required test duration.

Greetings,

Andres Freund



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/31/21, 12:44 AM, "Andres Freund" <andres@anarazel.de> wrote:
> If there's now a flush request including all of s3, we'll have the following
> sequence of notifies:
>
> NotifySegmentsReadyForArchive(s1)
> nothing happens, smaller than s1+10
>
> NotifySegmentsReadyForArchive(s2)
> earliestSegBoundary = s4
> earliestSegBoundaryEndPtr = s4+10
> latestSegBoundary = s4
> latestSegBoundaryEndPtr = s4 + 10
> latest_boundary_seg = s1
>
> NotifySegmentsReadyForArchive(s3)
> nothing happens, flush is smaller than s4

When earliestSegBoundary is set to s4, latestSegBoundary will be set
to MaxXLogSegNo.

> If the record ending at s4 + 10 isn't an async commit (and thus
> XLogCtl->asyncXactLSN is smaller), and there are no further records, we can
> end up waiting effectively forever for s2 (and s3) to be archived. If all
> other connections (and autovac etc) are idle, there will be no XLogFlush()
> calls, nor will XLogBackgroundFlush() do anything, because it'll hit the "If
> already known flushed" path, because the the first page in s4 is only
> partially filled.

I'm not following why s4 wouldn't be flushed in this example.  Even if
the first page in s4 is only partially filled, that portion of the
record should still get flushed, and we'll create the .ready files for
s2 and s3 at that time.  I tested this by adding some debug logging
and creating a small record that crossed segment boundaries but didn't
fill the first page on the next segment, and the .ready file was
created as expected.  Is there a case where we wouldn't flush the end
of the record to disk?

During my testing, I did find an obvious bug.  We probably shouldn't
be calling NotifySegmentsReadyForArchive() when archiving isn't
enabled.

diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 6a1e16edc2..8ca0d8e616 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -253,7 +253,8 @@ WalWriterMain(void)
          * here to handle a race condition where WAL is flushed to disk prior
          * to registering the segment boundary.
          */
-        NotifySegmentsReadyForArchive(GetFlushRecPtr());
+        if (XLogArchivingActive())
+            NotifySegmentsReadyForArchive(GetFlushRecPtr());

         /*
          * Do what we're here for; then, if XLogBackgroundFlush() found useful

Nathan


Re: archive status ".ready" files may be created too early

From
Andres Freund
Date:
Hi,

On 2021-08-31 17:01:31 +0000, Bossart, Nathan wrote:
> > If the record ending at s4 + 10 isn't an async commit (and thus
> > XLogCtl->asyncXactLSN is smaller), and there are no further records, we can
> > end up waiting effectively forever for s2 (and s3) to be archived. If all
> > other connections (and autovac etc) are idle, there will be no XLogFlush()
> > calls, nor will XLogBackgroundFlush() do anything, because it'll hit the "If
> > already known flushed" path, because the the first page in s4 is only
> > partially filled.
> 
> I'm not following why s4 wouldn't be flushed in this example.  Even if
> the first page in s4 is only partially filled, that portion of the
> record should still get flushed, and we'll create the .ready files for
> s2 and s3 at that time.

What would trigger the flushing? We don't write out partially filled pages
unless
a) we're explicitly flushing an LSN on the partial page (e.g. because a
   synchronous commit record resides on it)
b) there's an async commit (i.e. commit with synchronous_commit=off) on the
   page

Greetings,

Andres Freund



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/31/21, 10:21 AM, "Andres Freund" <andres@anarazel.de> wrote:
> What would trigger the flushing? We don't write out partially filled pages
> unless
> a) we're explicitly flushing an LSN on the partial page (e.g. because a
>    synchronous commit record resides on it)
> b) there's an async commit (i.e. commit with synchronous_commit=off) on the
>    page

Ah, so your point is that an open transaction that has written a
partial page on the next segment wouldn't trigger a flush.  What
appears to happen in this case is that bgwriter eventually creates a
xl_running_xacts record and nudges walwriter to flush it to disk, at
which point the .ready file(s) will be created.  That's admittedly a
bit fragile.

Nathan


Re: archive status ".ready" files may be created too early

From
Andres Freund
Date:
Hi,

On 2021-08-31 18:09:36 +0000, Bossart, Nathan wrote:
> On 8/31/21, 10:21 AM, "Andres Freund" <andres@anarazel.de> wrote:
> > What would trigger the flushing? We don't write out partially filled pages
> > unless
> > a) we're explicitly flushing an LSN on the partial page (e.g. because a
> >    synchronous commit record resides on it)
> > b) there's an async commit (i.e. commit with synchronous_commit=off) on the
> >    page
> 
> Ah, so your point is that an open transaction that has written a
> partial page on the next segment wouldn't trigger a flush.

Doesn't have to be a transaction, can be a checkpoint or xl_running_xacts, or
... as well.


> What appears to happen in this case is that bgwriter eventually creates a
> xl_running_xacts record and nudges walwriter to flush it to disk, at which
> point the .ready file(s) will be created.  That's admittedly a bit fragile.

That's not guaranteed to happen. If e.g. the partial record is a checkpoint or
a xl_running_xacts, we'll not trigger further WAL writes in the background,
unless autovacuum ends up doing something.

Regards,

Andres



Re: archive status ".ready" files may be created too early

From
"Bossart, Nathan"
Date:
On 8/31/21, 1:30 PM, "Andres Freund" <andres@anarazel.de> wrote:
> On 2021-08-31 18:09:36 +0000, Bossart, Nathan wrote:
>> What appears to happen in this case is that bgwriter eventually creates a
>> xl_running_xacts record and nudges walwriter to flush it to disk, at which
>> point the .ready file(s) will be created.  That's admittedly a bit fragile.
>
> That's not guaranteed to happen. If e.g. the partial record is a checkpoint or
> a xl_running_xacts, we'll not trigger further WAL writes in the background,
> unless autovacuum ends up doing something.

Right.  Per the attached patch, a simple way to handle that could be
to teach XLogBackgroundFlush() to flush to the "earliest" segment
boundary if it doesn't find anything else to do.  I think you could
still miss creating a .ready file for the previous segment in single-
user mode, though.

Nathan


Attachment

Re: archive status ".ready" files may be created too early

From
Andres Freund
Date:
On 2021-08-31 23:31:15 +0000, Bossart, Nathan wrote:
> On 8/31/21, 1:30 PM, "Andres Freund" <andres@anarazel.de> wrote:
> > On 2021-08-31 18:09:36 +0000, Bossart, Nathan wrote:
> >> What appears to happen in this case is that bgwriter eventually creates a
> >> xl_running_xacts record and nudges walwriter to flush it to disk, at which
> >> point the .ready file(s) will be created.  That's admittedly a bit fragile.
> >
> > That's not guaranteed to happen. If e.g. the partial record is a checkpoint or
> > a xl_running_xacts, we'll not trigger further WAL writes in the background,
> > unless autovacuum ends up doing something.
> 
> Right.  Per the attached patch, a simple way to handle that could be
> to teach XLogBackgroundFlush() to flush to the "earliest" segment
> boundary if it doesn't find anything else to do.  I think you could
> still miss creating a .ready file for the previous segment in single-
> user mode, though.

Maybe, but this is getting uglier and uglier.

I think patch should be reverted. It's not in a state that's appropriate for
the backbranches.



Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Aug-31, Andres Freund wrote:

> Maybe, but this is getting uglier and uglier.
> 
> I think patch should be reverted. It's not in a state that's appropriate for
> the backbranches.

Yeah, that's becoming my conclusion too -- undo that, and start from
scratch using the other idea.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/



Re: archive status ".ready" files may be created too early

From
Michael Paquier
Date:
On Tue, Aug 31, 2021 at 08:52:05PM -0400, alvherre@alvh.no-ip.org wrote:
> Yeah, that's becoming my conclusion too -- undo that, and start from
> scratch using the other idea.

That's about 515e3d8, right?  I have not looked in details at what you
have here, but this produces a compilation warning on Windows for me
with this part of the patch:
+RegisterSegmentBoundary(XLogSegNo seg, XLogRecPtr endpos)
+{
+   XLogSegNo   segno PG_USED_FOR_ASSERTS_ONLY;

segno gets to be an unreferenced local variable.
--
Michael

Attachment

Re: archive status ".ready" files may be created too early

From
"alvherre@alvh.no-ip.org"
Date:
On 2021-Sep-01, Michael Paquier wrote:

> That's about 515e3d8, right?

Yes.

> I have not looked in details at what you have here, but this produces
> a compilation warning on Windows for me with this part of the patch:

This seems a tiny speck in a sea of bogosity.  If you want to silence
the warning, be my guest, but in the long run I am inclined to revert
the whole commit once I have a better picture of a way forward.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/