Thread: prevent immature WAL streaming

prevent immature WAL streaming

From

Alvaro Herrera

Date:

23 August 2021, 22:52:17

Included 蔡梦娟 and Jakub Wartak because they've expressed interest on
this topic -- notably [2] ("Bug on update timing of walrcv->flushedUpto
variable").

As mentioned in the course of thread [1], we're missing a fix for
streaming replication to avoid sending records that the primary hasn't
fully flushed yet.  This patch is a first attempt at fixing that problem
by retreating the LSN reported as FlushPtr whenever a segment is
registered, based on the understanding that if no registration exists
then the LogwrtResult.Flush pointer can be taken at face value; but if a
registration exists, then we have to stream only till the start LSN of
that registered entry.

This patch is probably incomplete.  First, I'm not sure that logical
replication is affected by this problem.  I think it isn't, because
logical replication will halt until the record can be read completely --
maybe I'm wrong and there is a way for things to go wrong with logical
replication as well.  But also, I need to look at the other uses of
GetFlushRecPtr() and see if those need to change to the new function too
or they can remain what they are now.

I'd also like to have tests.  That seems moderately hard, but if we had
WAL-molasses that could be used in walreceiver, it could be done. (It
sounds easier to write tests with a molasses-archive_command.)


[1] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
[2] https://postgr.es/m/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Attachment

0001-Don-t-stream-non-final-WAL-segments.patch

Re: prevent immature WAL streaming

From

Kyotaro Horiguchi

Date:

24 August 2021, 03:03:57

At Mon, 23 Aug 2021 18:52:17 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in 
> Included 蔡梦娟 and Jakub Wartak because they've expressed interest on
> this topic -- notably [2] ("Bug on update timing of walrcv->flushedUpto
> variable").
> 
> As mentioned in the course of thread [1], we're missing a fix for
> streaming replication to avoid sending records that the primary hasn't
> fully flushed yet.  This patch is a first attempt at fixing that problem
> by retreating the LSN reported as FlushPtr whenever a segment is
> registered, based on the understanding that if no registration exists
> then the LogwrtResult.Flush pointer can be taken at face value; but if a
> registration exists, then we have to stream only till the start LSN of
> that registered entry.
> 
> This patch is probably incomplete.  First, I'm not sure that logical
> replication is affected by this problem.  I think it isn't, because
> logical replication will halt until the record can be read completely --
> maybe I'm wrong and there is a way for things to go wrong with logical
> replication as well.  But also, I need to look at the other uses of
> GetFlushRecPtr() and see if those need to change to the new function too
> or they can remain what they are now.
> 
> I'd also like to have tests.  That seems moderately hard, but if we had
> WAL-molasses that could be used in walreceiver, it could be done. (It
> sounds easier to write tests with a molasses-archive_command.)
> 
> 
> [1] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
> [2] https://postgr.es/m/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com

(I'm not sure what "WAL-molasses" above expresses, same as "sugar"?)

For our information, this issue is related to the commit 0668719801
which makes XLogPageRead restart reading a (continued or
segments-spanning) record with switching sources.  In that thread, I
modifed the code to cause a server crash under the desired situation.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: prevent immature WAL streaming

From

"Bossart, Nathan"

Date:

24 August 2021, 18:28:07

On 8/23/21, 3:53 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> As mentioned in the course of thread [1], we're missing a fix for
> streaming replication to avoid sending records that the primary hasn't
> fully flushed yet.  This patch is a first attempt at fixing that problem
> by retreating the LSN reported as FlushPtr whenever a segment is
> registered, based on the understanding that if no registration exists
> then the LogwrtResult.Flush pointer can be taken at face value; but if a
> registration exists, then we have to stream only till the start LSN of
> that registered entry.

I wonder if we need to move the call to RegisterSegmentBoundary() to
somewhere before WALInsertLockRelease() for this to work as intended.
Right now, boundary registration could take place after the flush
pointer has been advanced, which means GetSafeFlushRecPtr() could
still return an unsafe position.

Nathan

Re: prevent immature WAL streaming

From

"alvherre@alvh.no-ip.org"

Date:

24 August 2021, 23:01:27

On 2021-Aug-24, Bossart, Nathan wrote:

> I wonder if we need to move the call to RegisterSegmentBoundary() to
> somewhere before WALInsertLockRelease() for this to work as intended.
> Right now, boundary registration could take place after the flush
> pointer has been advanced, which means GetSafeFlushRecPtr() could
> still return an unsafe position.

Yeah, you're right -- that's a definite risk.  I didn't try to reproduce
a problem with that, but it is seems pretty obvious that it can happen.

I didn't have a lot of luck with a reliable reproducer script.  I was
able to reproduce the problem starting with Ryo Matsumura's script and
attaching a replica; most of the time the replica would recover by
restarting from a streaming position earlier than where the problem
occurred; but a few times it would just get stuck with a WAL segment
containing a bogus record.  Then, after patch, the problem no longer
occurs.

I attach the patch with the change you suggested.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
Tom: There seems to be something broken here.
Teodor: I'm in sackcloth and ashes...  Fixed.
        http://archives.postgresql.org/message-id/482D1632.8010507@sigaev.ru

Attachment

v2-0001-Don-t-stream-non-final-WAL-segments.patch

Re: prevent immature WAL streaming

From

"Bossart, Nathan"

Date:

24 August 2021, 23:52:34

On 8/24/21, 4:03 PM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> On 2021-Aug-24, Bossart, Nathan wrote:
>
>> I wonder if we need to move the call to RegisterSegmentBoundary() to
>> somewhere before WALInsertLockRelease() for this to work as intended.
>> Right now, boundary registration could take place after the flush
>> pointer has been advanced, which means GetSafeFlushRecPtr() could
>> still return an unsafe position.
>
> Yeah, you're right -- that's a definite risk.  I didn't try to reproduce
> a problem with that, but it is seems pretty obvious that it can happen.
>
> I didn't have a lot of luck with a reliable reproducer script.  I was
> able to reproduce the problem starting with Ryo Matsumura's script and
> attaching a replica; most of the time the replica would recover by
> restarting from a streaming position earlier than where the problem
> occurred; but a few times it would just get stuck with a WAL segment
> containing a bogus record.  Then, after patch, the problem no longer
> occurs.

If moving RegisterSegmentBoundary() is sufficient to prevent the flush
pointer from advancing before we register the boundary, I bet we could
also remove the WAL writer nudge.

Another interesting thing I see is that the boundary stored in
earliestSegBoundary is not necessarily the earliest one.  It's just
the first one that has been registered.  I did this for simplicity for
the .ready file fix, but I can see it causing problems here.  I think
we can move earliestSegBoundary backwards as long as it is greater
than lastNotifiedSeg + 1.  However, it's still not necessarily the
earliest one if we copied latestSegBoundary to earliestSegBoundary in
NotifySegmentsReadyForArchive().  To handle this, we could track
several boundaries in an array, but then we'd have to hold the safe
LSN back whenever the array overflowed and we started forgetting
boundaries.

Perhaps there's a simpler solution.  I'll keep thinking...

Nathan

RE: prevent immature WAL streaming

From

Jakub Wartak

Date:

25 August 2021, 11:59:45

Hi Álvaro, -hackers,

> I attach the patch with the change you suggested.

I've gave a shot to to the v02 patch on top of REL_12_STABLE (already including
5065aeafb0b7593c04d3bc5bc2a86037f32143fc).Previously(yesterday) without the v02 patch I was getting standby corruption
alwaysvia simulation by having separate /pg_xlog dedicated fs, and archive_mode=on, wal_keep_segments=120,
archive_commandset to rsync to different dir on same fs, wal_init_zero at default(true).  

Today (with v02) I've got corruption in only initial 2 runs out of ~ >30 tries on standby. Probably the 2 failures were
somehowmy fault (?) or some rare condition (and in 1 of those 2 cases simply restarting standby did help). To be honest
I'vetried to force this error, but with v02 I simply cannot force this error anymore, so that's good! :) 

> I didn't have a lot of luck with a reliable reproducer script.  I was able to
> reproduce the problem starting with Ryo Matsumura's script and attaching
> a replica; most of the time the replica would recover by restarting from a
> streaming position earlier than where the problem occurred; but a few
> times it would just get stuck with a WAL segment containing a bogus
> record.

In order to get reliable reproducer and get proper the fault injection instead of playing with really filling up fs,
apparentlyone could substitute fd with fd of /dev/full using e.g. dup2() so that every write is going to throw this
errortoo: 

root@hive:~# ./t & # simple while(1) { fprintf() flush () } testcase
root@hive:~# ls -l /proc/27296/fd/3
lrwx------ 1 root root 64 Aug 25 06:22 /proc/27296/fd/3 -> /tmp/testwrite
root@hive:~# gdb -q -p 27296
-- 1089 is bitmask O_WRONLY|..
(gdb) p dup2(open("/dev/full", 1089, 0777), 3)
$1 = 3
(gdb) c
Continuing.
==>
fflush/write(): : No space left on device

So I've also tried to be malicious while writing to the DB and inject ENOSPCE near places like:

a) XLogWrite()->XLogFileInit() near line 3322 // assuming: if (wal_init_zero) is true, one gets classic "PANIC:  could
notwrite to file "pg_wal/xlogtemp.90670": No space left on device" 
b) XLogWrite() near line 2547 just after pg_pwrite // one can get "PANIC:  could not write to log file
000000010000003B000000A8at offset 0, length 15466496: No space left on device" (that would be possible with
wal_init_zero=false?)
c) XLogWrite() near line 2592 // just before issue_xlog_fsync to get "PANIC:  could not fdatasync file
"000000010000004300000004":Invalid argument" that would pretty much mean same as above but with last possible offset
nearend of WAL?  

This was done with gdb voodoo:
handle SIGUSR1 noprint nostop
break xlog.c:<LINE> // https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/transam/xlog.c#L3311
c
print fd or openLogFile -- to verify it is 3
p dup2(open("/dev/full", 1089, 0777), 3) -- during most of walwriter runtime it has current log as fd=3

After restarting master and inspecting standby - in all of those above 3 cases - the standby didn't inhibit the
"invalidcontrecord length" at least here, while without corruption this v02 patch it is notorious. So if it passes the
worst-casecode review assumptions I would be wondering if it shouldn't even be committed as it stands right now. 

-J.

Re: prevent immature WAL streaming

From

"alvherre@alvh.no-ip.org"

Date:

25 August 2021, 12:32:31

On 2021-Aug-24, Bossart, Nathan wrote:

> If moving RegisterSegmentBoundary() is sufficient to prevent the flush
> pointer from advancing before we register the boundary, I bet we could
> also remove the WAL writer nudge.

Can you elaborate on this?  I'm not sure I see the connection.

> Another interesting thing I see is that the boundary stored in
> earliestSegBoundary is not necessarily the earliest one.  It's just
> the first one that has been registered.  I did this for simplicity for
> the .ready file fix, but I can see it causing problems here.

Hmm, is there really a problem here?  Surely the flush point cannot go
past whatever has been written.  If somebody is writing an earlier
section of WAL, then we cannot move the flush pointer to a later
position.  So it doesn't matter if the earliest point we have registered
is the true earliest -- we only care for it to be the earliest that is
past the flush point.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

From

Robert Haas

Date:

25 August 2021, 13:56:56

On Mon, Aug 23, 2021 at 11:04 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> At Mon, 23 Aug 2021 18:52:17 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in
> > I'd also like to have tests.  That seems moderately hard, but if we had
> > WAL-molasses that could be used in walreceiver, it could be done. (It
> > sounds easier to write tests with a molasses-archive_command.)
> >
> >
> > [1] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
> > [2] https://postgr.es/m/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com
>
> (I'm not sure what "WAL-molasses" above expresses, same as "sugar"?)

I think, but am not 100% sure, that "molasses" here is being used to
refer to fault injection.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: prevent immature WAL streaming

From

"Bossart, Nathan"

Date:

25 August 2021, 18:18:59

On 8/25/21, 5:33 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> On 2021-Aug-24, Bossart, Nathan wrote:
>
>> If moving RegisterSegmentBoundary() is sufficient to prevent the flush
>> pointer from advancing before we register the boundary, I bet we could
>> also remove the WAL writer nudge.
>
> Can you elaborate on this?  I'm not sure I see the connection.

The reason we are moving RegisterSegmentBoundary() to before
WALInsertLockRelease() is because we believe it will prevent boundary
registration from taking place after the flush pointer has already
advanced past the boundary in question.  We had added the WAL writer
nudge to make sure we called NotifySegmentsReadyForArchive() whenever
that happened.

If moving boundary registration to before we release the lock(s) is
enough to prevent the race condition with the flush pointer, then ISTM
we no longer have to worry about nudging the WAL writer.

>> Another interesting thing I see is that the boundary stored in
>> earliestSegBoundary is not necessarily the earliest one.  It's just
>> the first one that has been registered.  I did this for simplicity for
>> the .ready file fix, but I can see it causing problems here.
>
> Hmm, is there really a problem here?  Surely the flush point cannot go
> past whatever has been written.  If somebody is writing an earlier
> section of WAL, then we cannot move the flush pointer to a later
> position.  So it doesn't matter if the earliest point we have registered
> is the true earliest -- we only care for it to be the earliest that is
> past the flush point.

Let's say we have the following situation (F = flush, E = earliest
registered boundary, and L = latest registered boundary), and let's
assume that each segment has a cross-segment record that ends in the
next segment.

        F     E                                         L
        |-----|-----|-----|-----|-----|-----|-----|-----|
           1     2     3     4     5     6     7     8

Then, we write out WAL to disk and create .ready files as needed.  If
we didn't flush beyond the latest registered boundary, the latest
registered boundary now becomes the earliest boundary.

                          F                             E
        |-----|-----|-----|-----|-----|-----|-----|-----|
           1     2     3     4     5     6     7     8

At this point, the earliest segment boundary past the flush point is
before the "earliest" boundary we are tracking.

Nathan

Re: prevent immature WAL streaming

From

"alvherre@alvh.no-ip.org"

Date:

25 August 2021, 23:29:54

On 2021-Aug-25, Jakub Wartak wrote:

> In order to get reliable reproducer and get proper the fault injection
> instead of playing with really filling up fs, apparently one could
> substitute fd with fd of /dev/full using e.g. dup2() so that every
> write is going to throw this error too:

Oh, this is a neat trick that I didn't know about.  Thanks.

> After restarting master and inspecting standby - in all of those above
> 3 cases - the standby didn't inhibit the "invalid contrecord length"
> at least here, while without corruption this v02 patch it is
> notorious. So if it passes the worst-case code review assumptions I
> would be wondering if it shouldn't even be committed as it stands
> right now.

Well, Nathan is right that this patch isn't really closing the hole
because we aren't tracking segment boundaries that aren't "earliest" nor
"latest" at the moment of registration.  The simplistic approach seems
okay for the archive case, but not for the replication case.

I also noticed today (facepalm) that the patch doesn't work unless you
have archiving enabled, because the registration code is only invoked
when XLogArchivingActive().  Doh.  This part is easy to solve.  The
other doesn't look easy ...

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

From

"alvherre@alvh.no-ip.org"

Date:

26 August 2021, 00:20:04

BTW while going about testing this, I noticed that we forbid
pg_walfile_name() while in recovery.  That restriction was added by
commit 370f770c15a4 because ThisTimeLineID was not set correctly during
recovery.  That was supposed to be fixed by commit 1148e22a82ed, so I
thought that it should be possible to remove the restriction.  However,
I did that per the attached patch, but was quickly disappointed because
ThisTimeLineID seems to remain zero in a standby for reasons that I
didn't investigate.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"The problem with the facetime model is not just that it's demoralizing, but
that the people pretending to work interrupt the ones actually working."
                                                           (Paul Graham)

Attachment

0001-Don-t-forbid-pg_walfile_name-in-recovery-mode.patch

Re: prevent immature WAL streaming

From

Kyotaro Horiguchi

Date:

26 August 2021, 00:40:09

At Wed, 25 Aug 2021 18:18:59 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> On 8/25/21, 5:33 AM, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote:
> > On 2021-Aug-24, Bossart, Nathan wrote:
> >> Another interesting thing I see is that the boundary stored in
> >> earliestSegBoundary is not necessarily the earliest one.  It's just
> >> the first one that has been registered.  I did this for simplicity for
> >> the .ready file fix, but I can see it causing problems here.
> >
> > Hmm, is there really a problem here?  Surely the flush point cannot go
> > past whatever has been written.  If somebody is writing an earlier
> > section of WAL, then we cannot move the flush pointer to a later
> > position.  So it doesn't matter if the earliest point we have registered
> > is the true earliest -- we only care for it to be the earliest that is
> > past the flush point.
> 
> Let's say we have the following situation (F = flush, E = earliest
> registered boundary, and L = latest registered boundary), and let's
> assume that each segment has a cross-segment record that ends in the
> next segment.
> 
>         F     E                                         L
>         |-----|-----|-----|-----|-----|-----|-----|-----|
>            1     2     3     4     5     6     7     8
> 
> Then, we write out WAL to disk and create .ready files as needed.  If
> we didn't flush beyond the latest registered boundary, the latest
> registered boundary now becomes the earliest boundary.
> 
>                           F                             E
>         |-----|-----|-----|-----|-----|-----|-----|-----|
>            1     2     3     4     5     6     7     8
> 
> At this point, the earliest segment boundary past the flush point is
> before the "earliest" boundary we are tracking.

We know we have some cross-segment records in the regin [E L] so we
cannot add a .ready file if flush is in the region. I haven't looked
the latest patch (or I may misunderstand the discussion here) but I
think we shouldn't move E before F exceeds previous (or in the first
picture above) L.  Things are done that way in my ancient proposal in
[1].


https://www.postgresql.org/message-id/attachment/117052/v4-0001-Avoid-archiving-a-WAL-segment-that-continues-to-t.patch
+    if (LogwrtResult.Write < firstSegContRecStart ||
+        lastSegContRecEnd <= LogwrtResult.Write)
+    {
        <notify the last segment>

By doing so, at the time of the second picutre, the pointers are set as:

               E           F                             L
         |-----|-----|-----|-----|-----|-----|-----|-----|
            1     2     3     4     5     6     7     8

Then the poiter are cleard at the time F reaches L,

                                                         F
         |-----|-----|-----|-----|-----|-----|-----|-----|
            1     2     3     4     5     6     7     8

Isn't this work correctly?  As I think I mentioned in the thread, I
don't think we don't have so many (more than several, specifically)
segments in a region [E L].

[1] https://www.postgresql.org/message-id/20201216.110120.887433782054853494.horikyota.ntt%40gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: prevent immature WAL streaming

From

Kyotaro Horiguchi

Date:

26 August 2021, 01:20:52

At Wed, 25 Aug 2021 20:20:04 -0400, "alvherre@alvh.no-ip.org" <alvherre@alvh.no-ip.org> wrote in 
> BTW while going about testing this, I noticed that we forbid
> pg_walfile_name() while in recovery.  That restriction was added by
> commit 370f770c15a4 because ThisTimeLineID was not set correctly during
> recovery.  That was supposed to be fixed by commit 1148e22a82ed, so I
> thought that it should be possible to remove the restriction.  However,
> I did that per the attached patch, but was quickly disappointed because
> ThisTimeLineID seems to remain zero in a standby for reasons that I
> didn't investigate.

On a intermediate node of a cascading replication set, timeline id on
walsender and walrecever can differ and ordinary backends cannot
decide which to use.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: prevent immature WAL streaming

From

Kyotaro Horiguchi

Date:

26 August 2021, 01:32:43

(this is off-topic here)

At Wed, 25 Aug 2021 09:56:56 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Mon, Aug 23, 2021 at 11:04 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > At Mon, 23 Aug 2021 18:52:17 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in
> > > I'd also like to have tests.  That seems moderately hard, but if we had
> > > WAL-molasses that could be used in walreceiver, it could be done. (It
> > > sounds easier to write tests with a molasses-archive_command.)
> > >
> > >
> > > [1] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505@amazon.com
> > > [2] https://postgr.es/m/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com
> >
> > (I'm not sure what "WAL-molasses" above expresses, same as "sugar"?)
> 
> I think, but am not 100% sure, that "molasses" here is being used to
> refer to fault injection.

Oh. That makes sense, thanks.

I sometimes inject artificial faults (a server crash, in this case) to
create specific on-disk states but I cannot imagine that that kind of
machinery can be statically placed in the source tree..

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: prevent immature WAL streaming

From

"Bossart, Nathan"

Date:

26 August 2021, 03:24:48

On 8/25/21, 5:40 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> At Wed, 25 Aug 2021 18:18:59 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in
>> Let's say we have the following situation (F = flush, E = earliest
>> registered boundary, and L = latest registered boundary), and let's
>> assume that each segment has a cross-segment record that ends in the
>> next segment.
>>
>>         F     E                                         L
>>         |-----|-----|-----|-----|-----|-----|-----|-----|
>>            1     2     3     4     5     6     7     8
>>
>> Then, we write out WAL to disk and create .ready files as needed.  If
>> we didn't flush beyond the latest registered boundary, the latest
>> registered boundary now becomes the earliest boundary.
>>
>>                           F                             E
>>         |-----|-----|-----|-----|-----|-----|-----|-----|
>>            1     2     3     4     5     6     7     8
>>
>> At this point, the earliest segment boundary past the flush point is
>> before the "earliest" boundary we are tracking.
>
> We know we have some cross-segment records in the regin [E L] so we
> cannot add a .ready file if flush is in the region. I haven't looked
> the latest patch (or I may misunderstand the discussion here) but I
> think we shouldn't move E before F exceeds previous (or in the first
> picture above) L.  Things are done that way in my ancient proposal in
> [1].

The strategy in place ensures that we track a boundary that doesn't
change until the flush position passes it as well as the latest
registered boundary.  I think it is important that any segment
boundary tracking mechanism does at least those two things.  I don't
see how we could do that if we didn't update E until F passed both E
and L.

Nathan

Re: prevent immature WAL streaming

From

Kyotaro Horiguchi

Date:

26 August 2021, 08:48:34

At Thu, 26 Aug 2021 03:24:48 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> On 8/25/21, 5:40 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> > At Wed, 25 Aug 2021 18:18:59 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in
> >> Let's say we have the following situation (F = flush, E = earliest
> >> registered boundary, and L = latest registered boundary), and let's
> >> assume that each segment has a cross-segment record that ends in the
> >> next segment.
> >>
> >>         F     E                                         L
> >>         |-----|-----|-----|-----|-----|-----|-----|-----|
> >>            1     2     3     4     5     6     7     8
> >>
> >> Then, we write out WAL to disk and create .ready files as needed.  If
> >> we didn't flush beyond the latest registered boundary, the latest
> >> registered boundary now becomes the earliest boundary.
> >>
> >>                           F                             E
> >>         |-----|-----|-----|-----|-----|-----|-----|-----|
> >>            1     2     3     4     5     6     7     8
> >>
> >> At this point, the earliest segment boundary past the flush point is
> >> before the "earliest" boundary we are tracking.
> >
> > We know we have some cross-segment records in the regin [E L] so we
> > cannot add a .ready file if flush is in the region. I haven't looked
> > the latest patch (or I may misunderstand the discussion here) but I
> > think we shouldn't move E before F exceeds previous (or in the first
> > picture above) L.  Things are done that way in my ancient proposal in
> > [1].
> 
> The strategy in place ensures that we track a boundary that doesn't
> change until the flush position passes it as well as the latest
> registered boundary.  I think it is important that any segment
> boundary tracking mechanism does at least those two things.  I don't
> see how we could do that if we didn't update E until F passed both E
> and L.

(Sorry, but I didn't get you clearly. So the discussion below might be
pointless.)

The ancient patch did:

If a flush didn't reach E, we can archive finished segments.

If a flush ends between E and L, we shouldn't archive finshed segments
at all. L can be moved further while this state, while E sits on the
same location while this state.

Once a flush passes L, we can archive all finished segments and can
erase both E and L.

A drawback of this strategy is that the region [E L] can contain gaps
(that is, segment boundaries that is not bonded by a continuation
record) and archive can be excessively retarded.  Perhaps if flush
goes behind write head by more than two segments, the probability of
creating the gaps would be higher.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: prevent immature WAL streaming

From

Andres Freund

Date:

31 August 2021, 04:29:49

Hi,

On 2021-08-23 18:52:17 -0400, Alvaro Herrera wrote:
> Included 蔡梦娟 and Jakub Wartak because they've expressed interest on
> this topic -- notably [2] ("Bug on update timing of walrcv->flushedUpto
> variable").
>
> As mentioned in the course of thread [1], we're missing a fix for
> streaming replication to avoid sending records that the primary hasn't
> fully flushed yet.  This patch is a first attempt at fixing that problem
> by retreating the LSN reported as FlushPtr whenever a segment is
> registered, based on the understanding that if no registration exists
> then the LogwrtResult.Flush pointer can be taken at face value; but if a
> registration exists, then we have to stream only till the start LSN of
> that registered entry.

I'm doubtful that the approach of adding awareness of record boundaries
is a good path to go down:

- It adds nontrivial work to hot code paths to handle an edge case,
  rather than making rare code paths more expensive.

- There are very similar issues with promotions of replicas (consider
  what happens if we need to promote with the end of local WAL spanning
  a segment boundary, and what happens to cascading replicas). We have
  some logic to try to deal with that, but it's pretty grotty and I
  think incomplete.

- It seems to make some future optimizations harder - we should work
  towards replicating data sooner, rather than the opposite. Right now
  that's a major bottleneck around syncrep.

- Once XLogFlush() for some LSN returned we can write that LSN to
  disk. The LSN doesn't necessarily have to correspond to a specific
  on-disk location (it could e.g. be the return value from
  GetFlushRecPtr()). But "rewinding" to before the last record makes that
  problematic.

- I suspect that schemes with heuristic knowledge of segment boundary
  spanning records have deadlock or at least latency spike issues. What
  if synchronous commit needs to flush up to a certain record boundary,
  but streaming rep doesn't replicate it out because there's segment
  spanning records both before and after?



I think a better approach might be to handle this on the WAL layout
level. What if we never overwrite partial records but instead just
skipped over them during decoding?

Of course there's some difficulties with that - the checksum and the
length from the record header aren't going to be meaningful.

But we could deal with that using a special flag in the
XLogPageHeaderData.xlp_info of the following page. If that flag is set,
xlp_rem_len could contain the checksum of the partial record.

Greetings,

Andres Freund

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

31 August 2021, 13:56:30

On 2021-Aug-30, Andres Freund wrote:

> I'm doubtful that the approach of adding awareness of record boundaries
> is a good path to go down:

Honestly, I do not like it one bit and if I can avoid relying on them
while making the whole thing work correctly, I am happy.  Clearly it
wasn't a problem for the ancient recovery-only WAL design, but as soon
as we added replication on top the whole issue of continuation records
became a bug.

I do think that the code should be first correct and second performant,
though.

> - There are very similar issues with promotions of replicas (consider
>   what happens if we need to promote with the end of local WAL spanning
>   a segment boundary, and what happens to cascading replicas). We have
>   some logic to try to deal with that, but it's pretty grotty and I
>   think incomplete.

Ouch, I hadn't thought of cascading replicas.

> - It seems to make some future optimizations harder - we should work
>   towards replicating data sooner, rather than the opposite. Right now
>   that's a major bottleneck around syncrep.

Absolutely.

> I think a better approach might be to handle this on the WAL layout
> level. What if we never overwrite partial records but instead just
> skipped over them during decoding?

Maybe this is a workable approach, let's work it out fully.

Let me see if I understand what you mean:
* We would remove the logic to inhibit archiving and streaming-
  replicating the tail end of a split WAL record; that logic deals with
  bytes only, so doesn't have to be aware of record boundaries.
* On WAL replay, we ignore records that are split across a segment
  boundary and whose checksum does not match.
* On WAL write ... ?

How do we detect after recovery that a record that was being written,
and potentially was sent to the archive, needs to be "skipped"?

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

From

Andres Freund

Date:

31 August 2021, 15:53:33

Hi,

On 2021-08-31 09:56:30 -0400, Alvaro Herrera wrote:
> On 2021-Aug-30, Andres Freund wrote:
> > I think a better approach might be to handle this on the WAL layout
> > level. What if we never overwrite partial records but instead just
> > skipped over them during decoding?
>
> Maybe this is a workable approach, let's work it out fully.
>
> Let me see if I understand what you mean:
> * We would remove the logic to inhibit archiving and streaming-
>   replicating the tail end of a split WAL record; that logic deals with
>   bytes only, so doesn't have to be aware of record boundaries.
> * On WAL replay, we ignore records that are split across a segment
>   boundary and whose checksum does not match.
> * On WAL write ... ?

I was thinking that on a normal WAL write we'd do nothing. Instead we would
have dedicated code at the end of recovery that, if the WAL ends in a partial
record, changes the page following the "valid" portion of the WAL to indicate
that an incomplete record is to be skipped.

Of course, we need to be careful to not weaken WAL validity checking too
much. How about the following:

If we're "aborting" a continued record, we set XLP_FIRST_IS_ABORTED_PARTIAL on
the page at which we do so (i.e. the page after the valid end of the WAL).

On a page with XLP_FIRST_IS_ABORTED_PARTIAL we expect a special type of record
to start just after the page header. That record contains sufficient
information for us to verify the validity of the partial record (since its
checksum and length aren't valid, and may not even be all readable if the
record header itself was split). I think it would make sense to include the
LSN of the aborted record, and a checksum of the partial data.

> How do we detect after recovery that a record that was being written,
> and potentially was sent to the archive, needs to be "skipped"?

I think we can just read the WAL and see if it ends with a partial
record. It'd add a bit of complication to the error checking in xlogreader,
because we'd likely want to treat verification from page headers a bit
different from verification due to record data. But that seems doable.

Does this make sense?

Greetings,

Andres Freund

Re: prevent immature WAL streaming

From

Fujii Masao

Date:

01 September 2021, 02:34:34


On 2021/09/01 0:53, Andres Freund wrote:
> I was thinking that on a normal WAL write we'd do nothing. Instead we would
> have dedicated code at the end of recovery that, if the WAL ends in a partial
> record, changes the page following the "valid" portion of the WAL to indicate
> that an incomplete record is to be skipped.

Agreed!



> Of course, we need to be careful to not weaken WAL validity checking too
> much. How about the following:
> 
> If we're "aborting" a continued record, we set XLP_FIRST_IS_ABORTED_PARTIAL on
> the page at which we do so (i.e. the page after the valid end of the WAL).

When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set
when recovery finds a a partially-flushed segment-spanning record?
But maybe we cannot do that (i.e., cannot overwrite the page) because
the page that the flag is set in might have already been archived. No?


> I think we can just read the WAL and see if it ends with a partial
> record. It'd add a bit of complication to the error checking in xlogreader,
> because we'd likely want to treat verification from page headers a bit
> different from verification due to record data. But that seems doable.

Yes.


> Does this make sense?

Yes, I think!


Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: prevent immature WAL streaming

From

Andres Freund

Date:

01 September 2021, 03:15:24

Hi,

On 2021-09-01 11:34:34 +0900, Fujii Masao wrote:
> On 2021/09/01 0:53, Andres Freund wrote:
> > Of course, we need to be careful to not weaken WAL validity checking too
> > much. How about the following:
> > 
> > If we're "aborting" a continued record, we set XLP_FIRST_IS_ABORTED_PARTIAL on
> > the page at which we do so (i.e. the page after the valid end of the WAL).
> 
> When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set
> when recovery finds a a partially-flushed segment-spanning record?
> But maybe we cannot do that (i.e., cannot overwrite the page) because
> the page that the flag is set in might have already been archived. No?

I was imagining that XLP_FIRST_IS_ABORTED_PARTIAL would be set in the "tail
end" of a partial record. I.e. if there's a partial record starting in the
successfully archived segment A, but the end of the record, in B, has not been
written to disk before a crash, we'd set XLP_FIRST_IS_ABORTED_PARTIAL at the
end of the valid data in B. Which could not have been archived yet, or we'd
not have a partial record.  So we should never need to set the flag on an
already archived page.

Greetings,

Andres Freund

Re: prevent immature WAL streaming

From

Kyotaro Horiguchi

Date:

01 September 2021, 04:15:41

At Tue, 31 Aug 2021 20:15:24 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2021-09-01 11:34:34 +0900, Fujii Masao wrote:
> > On 2021/09/01 0:53, Andres Freund wrote:
> > > Of course, we need to be careful to not weaken WAL validity checking too
> > > much. How about the following:
> > > 
> > > If we're "aborting" a continued record, we set XLP_FIRST_IS_ABORTED_PARTIAL on
> > > the page at which we do so (i.e. the page after the valid end of the WAL).
> > 
> > When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set
> > when recovery finds a a partially-flushed segment-spanning record?
> > But maybe we cannot do that (i.e., cannot overwrite the page) because
> > the page that the flag is set in might have already been archived. No?
> 
> I was imagining that XLP_FIRST_IS_ABORTED_PARTIAL would be set in the "tail
> end" of a partial record. I.e. if there's a partial record starting in the
> successfully archived segment A, but the end of the record, in B, has not been
> written to disk before a crash, we'd set XLP_FIRST_IS_ABORTED_PARTIAL at the
> end of the valid data in B. Which could not have been archived yet, or we'd
> not have a partial record.  So we should never need to set the flag on an
> already archived page.

I agree that that makes sense.

Is that that, crash recovery remembers if the last record was an
immature record that looks like continue to the next segment, and if
so, set the flag when inserting the first record, which would be
CHECKPOINT_SHUTDOWN? (and reader deals with it)

I'll try to show how it looks like.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: prevent immature WAL streaming

From

Fujii Masao

Date:

01 September 2021, 06:01:43


On 2021/09/01 12:15, Andres Freund wrote:
> Hi,
> 
> On 2021-09-01 11:34:34 +0900, Fujii Masao wrote:
>> On 2021/09/01 0:53, Andres Freund wrote:
>>> Of course, we need to be careful to not weaken WAL validity checking too
>>> much. How about the following:
>>>
>>> If we're "aborting" a continued record, we set XLP_FIRST_IS_ABORTED_PARTIAL on
>>> the page at which we do so (i.e. the page after the valid end of the WAL).
>>
>> When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set
>> when recovery finds a a partially-flushed segment-spanning record?
>> But maybe we cannot do that (i.e., cannot overwrite the page) because
>> the page that the flag is set in might have already been archived. No?
> 
> I was imagining that XLP_FIRST_IS_ABORTED_PARTIAL would be set in the "tail
> end" of a partial record. I.e. if there's a partial record starting in the
> successfully archived segment A, but the end of the record, in B, has not been
> written to disk before a crash, we'd set XLP_FIRST_IS_ABORTED_PARTIAL at the
> end of the valid data in B. Which could not have been archived yet, or we'd
> not have a partial record.  So we should never need to set the flag on an
> already archived page.

Thanks for clarifying that! Unless I misunderstand that, when recovery ends
at a partially-flushed segment-spanning record, it sets
XLP_FIRST_IS_ABORTED_PARTIAL in the next segment before starting writing
new WAL data there. Therefore XLP_FIRST_IS_CONTRECORD or
XLP_FIRST_IS_ABORTED_PARTIAL must be set in the next segment following
a partially-flushed segment-spanning record. If none of them is set,
the validation code in recovery should report an error.

Yes, this design looks good to me.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: prevent immature WAL streaming

From

Andres Freund

Date:

01 September 2021, 17:00:52

Hi,

On 2021-09-01 15:01:43 +0900, Fujii Masao wrote:
> Thanks for clarifying that! Unless I misunderstand that, when recovery ends
> at a partially-flushed segment-spanning record, it sets
> XLP_FIRST_IS_ABORTED_PARTIAL in the next segment before starting writing
> new WAL data there. Therefore XLP_FIRST_IS_CONTRECORD or
> XLP_FIRST_IS_ABORTED_PARTIAL must be set in the next segment following
> a partially-flushed segment-spanning record. If none of them is set,
> the validation code in recovery should report an error.

Right. With the small addition that I think we shouldn't just do this for
segment spanning records, but for all records spanning pages.

Greetings,

Andres Freund

Re: prevent immature WAL streaming

From

"Bossart, Nathan"

Date:

02 September 2021, 00:04:17

On 9/1/21, 10:06 AM, "Andres Freund" <andres@anarazel.de> wrote:
> On 2021-09-01 15:01:43 +0900, Fujii Masao wrote:
>> Thanks for clarifying that! Unless I misunderstand that, when recovery ends
>> at a partially-flushed segment-spanning record, it sets
>> XLP_FIRST_IS_ABORTED_PARTIAL in the next segment before starting writing
>> new WAL data there. Therefore XLP_FIRST_IS_CONTRECORD or
>> XLP_FIRST_IS_ABORTED_PARTIAL must be set in the next segment following
>> a partially-flushed segment-spanning record. If none of them is set,
>> the validation code in recovery should report an error.
>
> Right. With the small addition that I think we shouldn't just do this for
> segment spanning records, but for all records spanning pages.

This approach seems promising.  I like that it avoids adding extra
work in the hot path for writing WAL.  I'm assuming it won't be back-
patchable, though.

Nathan

Re: prevent immature WAL streaming

From

Kyotaro Horiguchi

Date:

02 September 2021, 00:24:15

At Wed, 1 Sep 2021 15:01:43 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2021/09/01 12:15, Andres Freund wrote:
> > Hi,
> > On 2021-09-01 11:34:34 +0900, Fujii Masao wrote:
> >> On 2021/09/01 0:53, Andres Freund wrote:
> >>> Of course, we need to be careful to not weaken WAL validity checking
> >>> too
> >>> much. How about the following:
> >>>
> >>> If we're "aborting" a continued record, we set
> >>> XLP_FIRST_IS_ABORTED_PARTIAL on
> >>> the page at which we do so (i.e. the page after the valid end of the
> >>> WAL).
> >>
> >> When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set
> >> when recovery finds a a partially-flushed segment-spanning record?
> >> But maybe we cannot do that (i.e., cannot overwrite the page) because
> >> the page that the flag is set in might have already been archived. No?
> > I was imagining that XLP_FIRST_IS_ABORTED_PARTIAL would be set in the
> > "tail
> > end" of a partial record. I.e. if there's a partial record starting in
> > the
> > successfully archived segment A, but the end of the record, in B, has
> > not been
> > written to disk before a crash, we'd set XLP_FIRST_IS_ABORTED_PARTIAL
> > at the
> > end of the valid data in B. Which could not have been archived yet, or
> > we'd
> > not have a partial record.  So we should never need to set the flag on
> > an
> > already archived page.
> 
> Thanks for clarifying that! Unless I misunderstand that, when recovery
> ends
> at a partially-flushed segment-spanning record, it sets
> XLP_FIRST_IS_ABORTED_PARTIAL in the next segment before starting
> writing
> new WAL data there. Therefore XLP_FIRST_IS_CONTRECORD or
> XLP_FIRST_IS_ABORTED_PARTIAL must be set in the next segment following
> a partially-flushed segment-spanning record. If none of them is set,
> the validation code in recovery should report an error.
> 
> Yes, this design looks good to me.

So, this is a crude PoC of that.

At the end of recovery:

- When XLogReadRecord misses a page where the next part of the current
  continuation record should be seen, xlogreader->ContRecAbortPtr is
  set to the beginning of the missing page.

- When StartupXLOG receives a valid ContRecAbortPtr, the value is used
  as the next WAL insertion location then sets the same to
  XLogCtl->contAbortedRecPtr.

- When XLogCtl->contAbortedRecPtr is set, AdvanceXLInsertBuffer()
  (called under XLogInsertRecord()) sets XLP_FIRST_IS_ABORTED_PARTIAL
  flag to the page.

While recovery:
- When XLogReadRecord meets a XLP_FIRST_IS_ABORT_PARTIAL page, it
  rereads a record from there.

In this PoC,

1. This patch is written on the current master, but it doesn't
  interfare with the seg-boundary-memorize patch since it removes the
  calls to RegisterSegmentBoundary.

2. Since xlogreader cannot emit a log-message immediately, we don't
  have a means to leave a log message to inform recovery met an
  aborted partial continuation record. (In this PoC, it is done by
  fprintf:p)

3. Myebe we need to pg_waldump to show partial continuation records,
  but I'm not sure how to realize that.

4. By the way, is this (method) applicable in this stage?


The attached first is the PoC including debug-crash aid.  The second
is the repro script.  It failes to reproduce the situation once per
several trials.

The following log messages are shown by a run of the script.

> ...
> TRAP: FailedAssertion("c++ < 1", File: "xlog.c", Line: 2675, PID: 254644)
> ...
> LOG:  database system is shut down
> ...
> 
> LOG:  redo starts at 0/2000028
> LOG:  redo done at 0/6FFFFA8 system usage:...
> LOG:  #### Recovery finished: ContRecAbort: 0/7000000 (EndRecPtr: 0/6FFFFE8)

The record from 6FFFFE8 is missing the trailing part after 7000000.

> LOG:  #### EndOfLog=0/7000000
> LOG:  #### set XLP_FIRST_IS_ABORT_PARTIAL@0/7000000

So, WAL insertion starts from 7000000 and the first page is set the flag.

> LOG:  database system is ready to accept connections
> ...
> LOG:  database system is shut down
> ...
> #########################
> ...
> LOG:  redo starts at 0/2000028
> LOG:  consistent recovery state reached at 0/2000100
> ...
> LOG:  restored log file "000000010000000000000007" from archive
> #### aborted partial continuation record found at 0/6FFFFE8, continue from 0/7000000

The record from 6FFFFE8 is immature so skip it and continue reading
from 7000000.

> LOG:  last completed transaction was at log time 2021-09-01 20:40:21.775295+09
> LOG:  #### Recovery finished: ContRecAbort: 0/0 (EndRecPtr: 0/8000000)
> LOG:  restored log file "000000010000000000000007" from archive
> LOG:  selected new timeline ID: 2
> LOG:  archive recovery complete
> LOG:  #### EndOfLog=0/8000000

Recovery ends.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24165ab03e..b0f18e4e5e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -111,6 +111,7 @@ int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
 int            max_slot_wal_keep_size_mb = -1;
 bool        track_wal_io_timing = false;
+bool        contrec_aborted = false;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -586,6 +587,7 @@ typedef struct XLogCtlData
     XLogRecPtr    replicationSlotMinLSN;    /* oldest LSN needed by any slot */
 
     XLogSegNo    lastRemovedSegNo;    /* latest removed/recycled XLOG segment */
+    XLogRecPtr    contAbortedRecPtr;
 
     /* Fake LSN counter, for unlogged relations. Protected by ulsn_lck. */
     XLogRecPtr    unloggedLSN;
@@ -735,6 +737,10 @@ typedef struct XLogCtlData
     XLogSegNo    latestSegBoundary;
     XLogRecPtr    latestSegBoundaryEndPtr;
 
+    /* BEGIN: FOR DEBUGGING-CRASH USE*/
+    bool        crossseg;
+    /* END: DEBUGGING-CRASH USE*/
+
     slock_t        segtrack_lck;    /* locks shared variables shown above */
 } XLogCtlData;
 
@@ -860,6 +866,7 @@ static XLogSource XLogReceiptSource = XLOG_FROM_ANY;
 /* State information for XLOG reading */
 static XLogRecPtr ReadRecPtr;    /* start of last record read */
 static XLogRecPtr EndRecPtr;    /* end+1 of last record read */
+static XLogRecPtr ContRecAbortPtr;    /* end+1 of last aborted contrec */
 
 /*
  * Local copies of equivalent fields in the control file.  When running
@@ -1178,16 +1185,10 @@ XLogInsertRecord(XLogRecData *rdata,
         XLByteToSeg(StartPos, StartSeg, wal_segment_size);
         XLByteToSeg(EndPos, EndSeg, wal_segment_size);
 
-        /*
-         * Register our crossing the segment boundary if that occurred.
-         *
-         * Note that we did not use XLByteToPrevSeg() for determining the
-         * ending segment.  This is so that a record that fits perfectly into
-         * the end of the segment causes the latter to get marked ready for
-         * archival immediately.
-         */
-        if (StartSeg != EndSeg && XLogArchivingActive())
-            RegisterSegmentBoundary(EndSeg, EndPos);
+        /* BEGIN: FOR DEBUGGING-CRASH USE */
+        if (StartSeg != EndSeg)
+            XLogCtl->crossseg = true;
+        /* END: FOR DEBUGGING-CRASH USE */
 
         /*
          * Advance LogwrtRqst.Write so that it includes new block(s).
@@ -2292,6 +2293,27 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
         if (!Insert->forcePageWrites)
             NewPage->xlp_info |= XLP_BKP_REMOVABLE;
 
+        /*
+         * If the last page ended with an aborted partial continuation record,
+         * mark it to tell the parital record is omittable. Snice this happens
+         * only at the end of crash recovery, no rece condition here.
+         */
+        if (XLogCtl->contAbortedRecPtr >= NewPageBeginPtr)
+        {
+            if (XLogCtl->contAbortedRecPtr == NewPageBeginPtr)
+            {
+                NewPage->xlp_info |= XLP_FIRST_IS_ABORT_PARTIAL;
+                elog(LOG, "#### set XLP_FIRST_IS_ABORT_PARTIAL@%X/%X",
+                     LSN_FORMAT_ARGS(NewPageBeginPtr));
+            }
+            else
+                elog(LOG, "### incosistent abort location %X/%X, expected %X/%X",
+                     LSN_FORMAT_ARGS(XLogCtl->contAbortedRecPtr),
+                     LSN_FORMAT_ARGS(NewPageBeginPtr));
+                     
+            XLogCtl->contAbortedRecPtr = InvalidXLogRecPtr;
+        }
+            
         /*
          * If first page of an XLOG segment file, make it a long header.
          */
@@ -2644,6 +2666,17 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
             {
                 issue_xlog_fsync(openLogFile, openLogSegNo);
 
+                /* BEGIN: FOR DEBUGGING-CRASH USE */
+                if (XLogCtl->crossseg)
+                {
+                    static int c = 0;
+                    struct stat b;
+
+                    if (stat("/tmp/hoge", &b) == 0)
+                        Assert (c++ < 1);
+                }
+                /* END: FOR DEBUGGING-CRASH USE */
+
                 /* signal that we need to wakeup walsenders later */
                 WalSndWakeupRequest();
 
@@ -4568,6 +4601,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
         record = XLogReadRecord(xlogreader, &errormsg);
         ReadRecPtr = xlogreader->ReadRecPtr;
         EndRecPtr = xlogreader->EndRecPtr;
+        ContRecAbortPtr = xlogreader->ContRecAbortPtr;
         if (record == NULL)
         {
             if (readFile >= 0)
@@ -7873,12 +7907,26 @@ StartupXLOG(void)
     StandbyMode = false;
 
     /*
-     * Re-fetch the last valid or last applied record, so we can identify the
-     * exact endpoint of what we consider the valid portion of WAL.
+     * The last record may be an immature continuation record at the end of a
+     * page. We continue writing from ContRecAbortPtr instead of EndRecPtr that
+     * case.
      */
-    XLogBeginRead(xlogreader, LastRec);
-    record = ReadRecord(xlogreader, PANIC, false);
-    EndOfLog = EndRecPtr;
+    elog(LOG, "#### Recovery finished: ContRecAbort: %X/%X (EndRecPtr: %X/%X)", LSN_FORMAT_ARGS(ContRecAbortPtr),
LSN_FORMAT_ARGS(EndRecPtr));
+    if (XLogRecPtrIsInvalid(ContRecAbortPtr))
+    {
+        /*
+         * Re-fetch the last valid or last applied record, so we can identify
+         * the exact endpoint of what we consider the valid portion of WAL.
+         */
+        XLogBeginRead(xlogreader, LastRec);
+        record = ReadRecord(xlogreader, PANIC, false);
+        EndOfLog = EndRecPtr;
+    }
+    else
+    {
+        EndOfLog = ContRecAbortPtr;
+        XLogCtl->contAbortedRecPtr = ContRecAbortPtr;
+    }
 
     /*
      * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
@@ -8013,7 +8061,8 @@ StartupXLOG(void)
     Insert = &XLogCtl->Insert;
     Insert->PrevBytePos = XLogRecPtrToBytePos(LastRec);
     Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
-
+    elog(LOG, "#### EndOfLog=%X/%X", LSN_FORMAT_ARGS(EndOfLog));
+    
     /*
      * Tricky point here: readBuf contains the *last* block that the LastRec
      * record spans, not the one it starts in.  The last block is indeed the
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5cf74e181a..404db7ce4d 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -294,6 +294,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 
     ResetDecoder(state);
 
+    state->ContRecAbortPtr = InvalidXLogRecPtr;
     RecPtr = state->EndRecPtr;
 
     if (state->ReadRecPtr != InvalidXLogRecPtr)
@@ -319,6 +320,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
         randAccess = true;
     }
 
+retry:
     state->currRecPtr = RecPtr;
 
     targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
@@ -444,12 +446,27 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
                                            XLOG_BLCKSZ));
 
             if (readOff < 0)
-                goto err;
+                goto err_partial_contrec;
 
             Assert(SizeOfXLogShortPHD <= readOff);
 
             /* Check that the continuation on next page looks valid */
             pageHeader = (XLogPageHeader) state->readBuf;
+            if (pageHeader->xlp_info & XLP_FIRST_IS_ABORT_PARTIAL)
+            {
+                if (pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD)
+                {
+                    report_invalid_record(state,
+                                          "both XLP_FIRST_IS_CONTRECORD and XLP_FIRST_IS_ABORT_PARTIAL are set at
%X/%X",
+                                          LSN_FORMAT_ARGS(RecPtr));
+                    goto err;
+                }
+                    
+                fprintf(stderr, "#### aborted partial continuation record found at %X/%X, continue from %X/%X\n",
LSN_FORMAT_ARGS(RecPtr),LSN_FORMAT_ARGS(targetPagePtr));
 
+                ResetDecoder(state);
+                RecPtr = targetPagePtr;
+                goto retry;
+            }                
             if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
             {
                 report_invalid_record(state,
@@ -550,6 +567,10 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     else
         return NULL;
 
+err_partial_contrec:
+    state->ContRecAbortPtr = targetPagePtr;
+        fprintf(stderr, "contrec aborted@%X/%X\n", LSN_FORMAT_ARGS(state->ContRecAbortPtr));
+    
 err:
 
     /*
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 3b5eceff65..6390812a5a 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -76,8 +76,10 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 #define XLP_LONG_HEADER                0x0002
 /* This flag indicates backup blocks starting in this page are optional */
 #define XLP_BKP_REMOVABLE            0x0004
+/* This flag indicates the first record in this page breaks a contrecord */
+#define XLP_FIRST_IS_ABORT_PARTIAL    0x0008
 /* All defined flag bits in xlp_info (used for validity checking of header) */
-#define XLP_ALL_FLAGS                0x0007
+#define XLP_ALL_FLAGS                0x000F
 
 #define XLogPageHeaderSize(hdr)        \
     (((hdr)->xlp_info & XLP_LONG_HEADER) ? SizeOfXLogLongPHD : SizeOfXLogShortPHD)
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 21d200d3df..00a03a628c 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -175,6 +175,8 @@ struct XLogReaderState
     XLogRecPtr    ReadRecPtr;        /* start of last record read */
     XLogRecPtr    EndRecPtr;        /* end+1 of last record read */
 
+    XLogRecPtr    ContRecAbortPtr; /* end+1 of aborted partial contrecord if
+                                  * any */
 
     /* ----------------------------------------
      * Decoded representation of current record
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce..26fc123cdb 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -76,6 +76,7 @@ typedef struct CheckPoint
 #define XLOG_END_OF_RECOVERY            0x90
 #define XLOG_FPI_FOR_HINT                0xA0
 #define XLOG_FPI                        0xB0
+#define XLOG_ABORT_CONTRECORD            x0C0
 
 
 /*
PWD=`pwd`
DATA=data
BKUP=bkup
ARCH=$PWD/arch
rm -rf arch
mkdir arch
rm -rf $DATA
initdb -D $DATA
echo "restart_after_crash = off" >> $DATA/postgresql.conf
echo "archive_mode=on" >> $DATA/postgresql.conf
echo "archive_command='cp %p ${ARCH}/%f'" >> $DATA/postgresql.conf
echo "restart_after_crash = off" >> $DATA/postgresql.conf
rm /tmp/hoge
pg_ctl -D $DATA start
rm -rf $BKUP
pg_basebackup -D $BKUP -h /tmp
echo "archive_mode=off" >> $BKUP/postgresql.conf
echo "restore_command='cp ${ARCH}/%f %p'" >> $BKUP/postgresql.conf
touch bkup/recovery.signal

psql -c 'create table t(a int); insert into t (select a from generate_series(0, 600000) a)'
touch /tmp/hoge
psql -c 'insert into t (select a from generate_series(0, 600000) a)'
rm /tmp/hoge
sleep 5
pg_ctl -D $DATA -w start
psql -c 'checkpoint'
pg_ctl -D $DATA -w stop
echo "#########################"
pg_ctl -D $BKUP -w start
sleep 10
pg_ctl -D $BKUP -w stop

Re: prevent immature WAL streaming

From

Kyotaro Horiguchi

Date:

03 September 2021, 07:09:04

At Thu, 2 Sep 2021 18:43:33 -0400, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in 
> On 2021-Sep-02, Kyotaro Horiguchi wrote:
> 
> > So, this is a crude PoC of that.
> 
> I had ended up with something very similar, except I was trying to cram
> the flag via the checkpoint record instead of hacking
> AdvanceXLInsertBuffer().  I removed that stuff and merged both, here's
> the result.
> 
> > 1. This patch is written on the current master, but it doesn't
> >   interfare with the seg-boundary-memorize patch since it removes the
> >   calls to RegisterSegmentBoundary.
> 
> I rebased on top of the revert patch.

Thanks!

> > 2. Since xlogreader cannot emit a log-message immediately, we don't
> >   have a means to leave a log message to inform recovery met an
> >   aborted partial continuation record. (In this PoC, it is done by
> >   fprintf:p)
> 
> Shrug.  We can just use an #ifndef FRONTEND / elog(LOG).  (I didn't keep
> this part, sorry.)

No problem, it was mere a develop-time message for behavior
observation.

> > 3. Myebe we need to pg_waldump to show partial continuation records,
> >   but I'm not sure how to realize that.
> 
> Ah yes, we'll need to fix that.

I just believe 0001 does the right thing.

0002:
> +    XLogRecPtr    abortedContrecordPtr; /* LSN of incomplete record at end of
> +                                       * WAL */

The name sounds like the start LSN. doesn't contrecordAbort(ed)Ptr work?

>              if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
>              {
>                  report_invalid_record(state,
>                                        "there is no contrecord flag at %X/%X",
>                                        LSN_FORMAT_ARGS(RecPtr));
> -                goto err;
> +                goto aborted_contrecord;

This loses the exclusion check between XLP_FIRST_IS_CONTRECORD and
_IS_ABROTED_PARTIAL.  Is it okay?  (I don't object to remove the check.).

I didin't thought this as an aborted contrecord. but on second
thought, when we see a record broken in any style, we stop recovery at
the point. I agree to the change and all the silmiar changes.

+                    /* XXX should we goto aborted_contrecord here? */

I think it should be aborted_contrecord.

When that happens, the loaded bytes actually looked like the first
fragment of a continuation record to xlogreader, even if the cause
were a broken total_len.  So if we abort the record there, the next
time xlogreader will meet XLP_FIRST_IS_ABORTED_PARTIAL at the same
page, and correctly finds a new record there.

On the other hand if we just errored-out there, we will step-back to
the beginning of the broken record in the previous page or segment
then start writing a new record there but that is exactly what we want
to avoid now.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: prevent immature WAL streaming

From

Andres Freund

Date:

03 September 2021, 17:46:46

Hi,

On 2021-09-03 12:55:23 -0400, Alvaro Herrera wrote:
> Oh, but of course we can't modify XLogReaderState in backbranches to add
> the new struct member abortedContrecordPtr (or whatever we end up naming
> that.)

Why is that? Afaict it's always allocated via XLogReaderAllocate(), so adding
a new field at the end should be fine?

That said, I'm worried that this stuff is too complicated to get right in the
backbranches. I suspect letting it stew in master for a while before
backpatching would be a good move.

Greetings,

Andres Freund

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

03 September 2021, 17:56:16

On 2021-Sep-03, Andres Freund wrote:

> Hi,
> 
> On 2021-09-03 12:55:23 -0400, Alvaro Herrera wrote:
> > Oh, but of course we can't modify XLogReaderState in backbranches to add
> > the new struct member abortedContrecordPtr (or whatever we end up naming
> > that.)
> 
> Why is that? Afaict it's always allocated via XLogReaderAllocate(), so adding
> a new field at the end should be fine?

Hmm, true, that works.

> That said, I'm worried that this stuff is too complicated to get right in the
> backbranches. I suspect letting it stew in master for a while before
> backpatching would be a good move.

Sure, we can put it in master now and backpatch before the November
minors if everything goes well.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

From

Andres Freund

Date:

04 September 2021, 00:14:23

Hi,

On 2021-09-03 20:01:50 -0400, Alvaro Herrera wrote:
> From 6abc5026f92b99d704bff527d1306eb8588635e9 Mon Sep 17 00:00:00 2001
> From: Alvaro Herrera <alvherre@alvh.no-ip.org>
> Date: Tue, 31 Aug 2021 20:55:10 -0400
> Subject: [PATCH v3 1/5] Revert "Avoid creating archive status ".ready" files
>  too early"

> This reverts commit 515e3d84a0b58b58eb30194209d2bc47ed349f5b.

I'd prefer to see this pushed soon. I've a bunch of patches to xlog.c that
conflict with the prior changes, and rebasing back and forth isn't that much
fun...



> From f767cdddb3120f1f6c079c8eb00eaff38ea98c79 Mon Sep 17 00:00:00 2001
> From: Alvaro Herrera <alvherre@alvh.no-ip.org>
> Date: Thu, 2 Sep 2021 17:21:46 -0400
> Subject: [PATCH v3 2/5] Implement FIRST_IS_ABORTED_CONTRECORD
>
> ---
>  src/backend/access/transam/xlog.c       | 53 +++++++++++++++++++++++--
>  src/backend/access/transam/xlogreader.c | 39 +++++++++++++++++-
>  src/include/access/xlog_internal.h      | 14 ++++++-
>  src/include/access/xlogreader.h         |  3 ++
>  4 files changed, 103 insertions(+), 6 deletions(-)
>
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index e51a7a749d..411f1618df 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -586,6 +586,8 @@ typedef struct XLogCtlData
>      XLogRecPtr    replicationSlotMinLSN;    /* oldest LSN needed by any slot */
>
>      XLogSegNo    lastRemovedSegNo;    /* latest removed/recycled XLOG segment */
> +    XLogRecPtr    abortedContrecordPtr; /* LSN of incomplete record at end of
> +                                       * WAL */
>
>      /* Fake LSN counter, for unlogged relations. Protected by ulsn_lck. */
>      XLogRecPtr    unloggedLSN;
> @@ -848,6 +850,7 @@ static XLogSource XLogReceiptSource = XLOG_FROM_ANY;
>  /* State information for XLOG reading */
>  static XLogRecPtr ReadRecPtr;    /* start of last record read */
>  static XLogRecPtr EndRecPtr;    /* end+1 of last record read */
> +static XLogRecPtr abortedContrecordPtr;    /* end+1 of incomplete record */
>
>  /*
>   * Local copies of equivalent fields in the control file.  When running
> @@ -2246,6 +2249,30 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
>          if (!Insert->forcePageWrites)
>              NewPage->xlp_info |= XLP_BKP_REMOVABLE;
>
> +        /*
> +         * If the last page ended with an aborted partial continuation record,
> +         * mark the new page to indicate that the partial record can be
> +         * omitted.
> +         *
> +         * This happens only once at the end of recovery, so there's no race
> +         * condition here.
> +         */
> +        if (XLogCtl->abortedContrecordPtr >= NewPageBeginPtr)
> +        {

Can we move this case out of AdvanceXLInsertBuffer()? As the comment says,
this only happens at the end of recovery, so putting it into
AdvanceXLInsertBuffer() doesn't really seem necessary?


> +#ifdef WAL_DEBUG
> +            if (XLogCtl->abortedContrecordPtr != NewPageBeginPtr)
> +                elog(PANIC, "inconsistent aborted contrecord location %X/%X, expected %X/%X",
> +                     LSN_FORMAT_ARGS(XLogCtl->abortedContrecordPtr),
> +                     LSN_FORMAT_ARGS(NewPageBeginPtr));
> +            ereport(LOG,
> +                    (errmsg_internal("setting XLP_FIRST_IS_ABORTED_PARTIAL flag at %X/%X",
> +                                     LSN_FORMAT_ARGS(NewPageBeginPtr))));
> +#endif
> +            NewPage->xlp_info |= XLP_FIRST_IS_ABORTED_PARTIAL;
> +
> +            XLogCtl->abortedContrecordPtr = InvalidXLogRecPtr;
> +        }

>          /*
>           * If first page of an XLOG segment file, make it a long header.
>           */
> @@ -4392,6 +4419,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
>          record = XLogReadRecord(xlogreader, &errormsg);
>          ReadRecPtr = xlogreader->ReadRecPtr;
>          EndRecPtr = xlogreader->EndRecPtr;
> +        abortedContrecordPtr = xlogreader->abortedContrecordPtr;
>          if (record == NULL)
>          {
>              if (readFile >= 0)
> @@ -7691,10 +7719,29 @@ StartupXLOG(void)
>      /*
>       * Re-fetch the last valid or last applied record, so we can identify the
>       * exact endpoint of what we consider the valid portion of WAL.
> +     *
> +     * When recovery ended in an incomplete record, continue writing from the
> +     * point where it went missing.  This leaves behind an initial part of
> +     * broken record, which rescues downstream which have already received
> +     * that first part.
>       */
> -    XLogBeginRead(xlogreader, LastRec);
> -    record = ReadRecord(xlogreader, PANIC, false);
> -    EndOfLog = EndRecPtr;
> +    if (XLogRecPtrIsInvalid(abortedContrecordPtr))
> +    {
> +        XLogBeginRead(xlogreader, LastRec);
> +        record = ReadRecord(xlogreader, PANIC, false);
> +        EndOfLog = EndRecPtr;
> +    }
> +    else
> +    {
> +#ifdef WAL_DEBUG
> +        ereport(LOG,
> +                (errmsg_internal("recovery overwriting broken contrecord at %X/%X (EndRecPtr: %X/%X)",
> +                                 LSN_FORMAT_ARGS(abortedContrecordPtr),
> +                                 LSN_FORMAT_ARGS(EndRecPtr))));
> +#endif

"broken" sounds a bit off. But then, it's just WAL_DEBUG. Which made me
realize, isn't this missing a
if (XLOG_DEBUG)?



> @@ -442,14 +448,28 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
>              readOff = ReadPageInternal(state, targetPagePtr,
>                                         Min(total_len - gotlen + SizeOfXLogShortPHD,
>                                             XLOG_BLCKSZ));
> -
>              if (readOff < 0)
>                  goto err;
>
>              Assert(SizeOfXLogShortPHD <= readOff);
>
> -            /* Check that the continuation on next page looks valid */
>              pageHeader = (XLogPageHeader) state->readBuf;
> +
> +            /*
> +             * If we were expecting a continuation record and got an "aborted
> +             * partial" flag, that means the continuation record was lost.
> +             * Ignore the record we were reading, since we now know it's broken
> +             * and lost forever, and restart the read by assuming the address
> +             * to read is the location where we found this flag.
> +             */
> +            if (pageHeader->xlp_info & XLP_FIRST_IS_ABORTED_PARTIAL)
> +            {
> +                ResetDecoder(state);
> +                RecPtr = targetPagePtr;
> +                goto restart;
> +            }

I think we need to add more validation to this path. What I was proposing
earlier is that we add a new special type of record at the start of an
XLP_FIRST_IS_ABORTED_PARTIAL page, which contains a) lsn of the record we're
aborting, b) checksum of the data up to this point.


Greetings,

Andres Freund

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

04 September 2021, 16:22:05

On 2021-Sep-03, Andres Freund wrote:

> Hi,
> 
> On 2021-09-03 20:01:50 -0400, Alvaro Herrera wrote:
> > From 6abc5026f92b99d704bff527d1306eb8588635e9 Mon Sep 17 00:00:00 2001
> > From: Alvaro Herrera <alvherre@alvh.no-ip.org>
> > Date: Tue, 31 Aug 2021 20:55:10 -0400
> > Subject: [PATCH v3 1/5] Revert "Avoid creating archive status ".ready" files
> >  too early"
> 
> > This reverts commit 515e3d84a0b58b58eb30194209d2bc47ed349f5b.
> 
> I'd prefer to see this pushed soon. I've a bunch of patches to xlog.c that
> conflict with the prior changes, and rebasing back and forth isn't that much
> fun...

Done.

> > +            /*
> > +             * If we were expecting a continuation record and got an "aborted
> > +             * partial" flag, that means the continuation record was lost.
> > +             * Ignore the record we were reading, since we now know it's broken
> > +             * and lost forever, and restart the read by assuming the address
> > +             * to read is the location where we found this flag.
> > +             */
> > +            if (pageHeader->xlp_info & XLP_FIRST_IS_ABORTED_PARTIAL)
> > +            {
> > +                ResetDecoder(state);
> > +                RecPtr = targetPagePtr;
> > +                goto restart;
> > +            }
> 
> I think we need to add more validation to this path. What I was proposing
> earlier is that we add a new special type of record at the start of an
> XLP_FIRST_IS_ABORTED_PARTIAL page, which contains a) lsn of the record we're
> aborting, b) checksum of the data up to this point.

Hmm, a new record type?  Yeah, we can do that, and sounds like it would
make things simpler too -- I wasn't too happy about adding clutter to
AdvanceXLInsertBuffer either, but the alternative I had in mind was that
we'd pass the flag to the checkpointer, which seemed quite annoying API
wise.  But if we add a new record, seems we can write it directly in
StartupXLOG and avoid most nastiness.  I'm not too sure about the
checksum up to this point, though, but I'll spend some time with the
idea to see how bad it is.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
Voy a acabar con todos los humanos / con los humanos yo acabaré
voy a acabar con todos (bis) / con todos los humanos acabaré ¡acabaré! (Bender)

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

04 September 2021, 17:26:24

On 2021-Sep-03, Andres Freund wrote:

> > +#ifdef WAL_DEBUG
> > +        ereport(LOG,
> > +                (errmsg_internal("recovery overwriting broken contrecord at %X/%X (EndRecPtr: %X/%X)",
> > +                                 LSN_FORMAT_ARGS(abortedContrecordPtr),
> > +                                 LSN_FORMAT_ARGS(EndRecPtr))));
> > +#endif
> 
> "broken" sounds a bit off. But then, it's just WAL_DEBUG. Which made me
> realize, isn't this missing a
> if (XLOG_DEBUG)?

Attached are the same patches as last night, except I added a test for
XLOG_DEBUG where pertinent.  (The elog(PANIC) is not made conditional on
that, since it's a cross-check rather than informative.)  Also fixed the
silly pg_rewind mistake I made.

I'll work on the new xlog record early next week.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"I can see support will not be a problem.  10 out of 10."    (Simon Wittber)
      (http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)

Attachment

Re: prevent immature WAL streaming

From

"Bossart, Nathan"

Date:

07 September 2021, 18:41:57

On 9/4/21, 10:26 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> Attached are the same patches as last night, except I added a test for
> XLOG_DEBUG where pertinent.  (The elog(PANIC) is not made conditional on
> that, since it's a cross-check rather than informative.)  Also fixed the
> silly pg_rewind mistake I made.
>
> I'll work on the new xlog record early next week.

Are these patches in a good state for some preliminary testing?  I'd
like to try them out, but I'll hold off if they're not quite ready
yet.

Nathan

Re: prevent immature WAL streaming

From

Kyotaro Horiguchi

Date:

08 September 2021, 07:03:46

At Tue, 7 Sep 2021 18:41:57 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> On 9/4/21, 10:26 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> > Attached are the same patches as last night, except I added a test for
> > XLOG_DEBUG where pertinent.  (The elog(PANIC) is not made conditional on
> > that, since it's a cross-check rather than informative.)  Also fixed the
> > silly pg_rewind mistake I made.
> >
> > I'll work on the new xlog record early next week.
> 
> Are these patches in a good state for some preliminary testing?  I'd
> like to try them out, but I'll hold off if they're not quite ready
> yet.

Thanks!  As my understanding the new record add the ability to
cross-check between a teard-off contrecord and the new record inserted
after the teard-off record.  I didn't test the version by myself but
the previous version implemented the essential machinery and that
won't change fundamentally by the new record.

So I think the current patch deserves to see the algorithm actually
works against the problem.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: prevent immature WAL streaming

From

Kyotaro Horiguchi

Date:

15 September 2021, 03:00:56

At Tue, 14 Sep 2021 22:32:04 -0300, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in 
> On 2021-Sep-14, Alvaro Herrera wrote:
> 
> > On 2021-Sep-08, Kyotaro Horiguchi wrote:
> > 
> > > Thanks!  As my understanding the new record add the ability to
> > > cross-check between a teard-off contrecord and the new record inserted
> > > after the teard-off record.  I didn't test the version by myself but
> > > the previous version implemented the essential machinery and that
> > > won't change fundamentally by the new record.
> > > 
> > > So I think the current patch deserves to see the algorithm actually
> > > works against the problem.
> > 
> > Here's a version with the new record type.  It passes check-world, and
> > it seems to work correctly to prevent overwrite of the tail end of a
> > segment containing a broken record.  This is very much WIP still;
> > comments are missing and I haven't tried to implement any sort of
> > verification that the record being aborted is the right one.
> 
> Here's the attachment I forgot earlier.

(I missed the chance to complain about that:p)

Tnaks for the patch!

-        CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata,
-                            StartPos, EndPos);
+        CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch,
+                            flags & XLOG_SET_ABORTED_PARTIAL,
+                            rdata, StartPos, EndPos);

The new xlog flag XLOG_SET_ABORTED_PARTIAL is used only by
RM_XLOG_ID/XLOG_OVERWRITE_CONTRECORD records, so the flag value is the
equivalent of the record type.  We might instead want a new flag
XLOG_SPECIAL_TREATED_RECORD or something to quickly distinguish
records that need a special treat like XLOG_SWITCH.

  if (flags & XLOG_SPECIAL_TREATED_RECORD)
  {
      if (rechdr->xl_rmid == RM_XLOG_ID)
    {
       if (info ==    XLOG_SWITCH)
          isLogSwitch = true;
       if (info == XLOG_OVERWRITE_CONTRECORD)
          isOverwrite = true;
    }
  }
  ..
  CopyXLogRecrodToWAL(.., isLogSwitch, isOverwrite, rdata, StartPos, EndPos);

+            /* XXX can only happen once in the loop. Verify? */
+            if (set_aborted_partial)
+                pagehdr->xlp_info |= XLP_FIRST_IS_ABORTED_PARTIAL;
+

I'm not sure about the reason for the change from the previous patch
(I might be missing something), this sets the flag on the *next* page
of the page where the record starts.  So in the first place we
shouldn't set the flag there.  The page header flags of the first page
is set by AdvanceXLInsertBuffer. If we want to set the flag in the
function, we need to find the page header for the beginning of the
record and make sure that the record is placed at the beginning of the
page.  (It is the reason that I did that in AdvanceXLInsertBuffer..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: prevent immature WAL streaming

From

"Bossart, Nathan"

Date:

17 September 2021, 17:07:30

On 9/17/21, 9:37 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> OK, this version is much more palatable, because here we verify that the
> OVERWRITE_CONTRECORD we replay matches the record that was lost.  Also,
> I wrote a test script that creates such a broken record (by the simple
> expedient of deleting the WAL file containing the second half while the
> server is down); we then create a standby and we can observe that it
> replays the sequence correctly.
>
> If you have some time to try your reproducers with this new proposed
> fix, I would appreciate it.

I haven't had a chance to look at the patch yet, but it appears to fix
things with my original reproduction steps for the archive_status
stuff [0].

Nathan

[0] https://postgr.es/m/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

17 September 2021, 17:34:20

On 2021-Sep-17, Bossart, Nathan wrote:

> On 9/17/21, 9:37 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:

> > If you have some time to try your reproducers with this new proposed
> > fix, I would appreciate it.
> 
> I haven't had a chance to look at the patch yet, but it appears to fix
> things with my original reproduction steps for the archive_status
> stuff [0].

Thank you, this is good to hear.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Ed is the standard text editor."
      http://groups.google.com/group/alt.religion.emacs/msg/8d94ddab6a9b0ad3

Re: prevent immature WAL streaming

From

"Bossart, Nathan"

Date:

17 September 2021, 18:50:34

On 9/17/21, 10:35 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> On 2021-Sep-17, Bossart, Nathan wrote:
>
>> On 9/17/21, 9:37 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
>
>> > If you have some time to try your reproducers with this new proposed
>> > fix, I would appreciate it.
>>
>> I haven't had a chance to look at the patch yet, but it appears to fix
>> things with my original reproduction steps for the archive_status
>> stuff [0].
>
> Thank you, this is good to hear.

I gave the patch a read-through.  I'm wondering if the
XLOG_OVERWRITE_CONTRECORD records are actually necessary.  IIUC we
will set XLP_FIRST_IS_ABORTED_PARTIAL on the next page, and
xlp_pageaddr on that page will already be validated in
XLogReaderValidatePageHeader().  Does adding this new record also help
ensure the page header is filled in and flushed to disk?

Nathan

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

17 September 2021, 20:31:34

On 2021-Sep-17, Bossart, Nathan wrote:

> I gave the patch a read-through.  I'm wondering if the
> XLOG_OVERWRITE_CONTRECORD records are actually necessary.  IIUC we
> will set XLP_FIRST_IS_ABORTED_PARTIAL on the next page, and
> xlp_pageaddr on that page will already be validated in
> XLogReaderValidatePageHeader().  Does adding this new record also help
> ensure the page header is filled in and flushed to disk?

That was the first implementation, a few versions of the patch ago.  An
added benefit of a separate WAL record is that you can carry additional
data for validation, such as -- as suggested by Andres -- the CRC of the
partial data contained in the message that we're skipping.  I didn't
implement that, but it should be trivial to add it.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"How strange it is to find the words "Perl" and "saner" in such close
proximity, with no apparent sense of irony. I doubt that Larry himself
could have managed it."         (ncm, http://lwn.net/Articles/174769/)

Re: prevent immature WAL streaming

From

"Bossart, Nathan"

Date:

17 September 2021, 21:15:27

On 9/17/21, 1:32 PM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote:
> On 2021-Sep-17, Bossart, Nathan wrote:
>
>> I gave the patch a read-through.  I'm wondering if the
>> XLOG_OVERWRITE_CONTRECORD records are actually necessary.  IIUC we
>> will set XLP_FIRST_IS_ABORTED_PARTIAL on the next page, and
>> xlp_pageaddr on that page will already be validated in
>> XLogReaderValidatePageHeader().  Does adding this new record also help
>> ensure the page header is filled in and flushed to disk?
>
> That was the first implementation, a few versions of the patch ago.  An
> added benefit of a separate WAL record is that you can carry additional
> data for validation, such as -- as suggested by Andres -- the CRC of the
> partial data contained in the message that we're skipping.  I didn't
> implement that, but it should be trivial to add it.

I see.  IMO feels a bit counterintuitive to validate a partial record
that you are ignoring anyway, but I suppose it's still valuable to
know when the WAL is badly broken.  It's not expensive, and it doesn't
add a ton of complexity, either.

Nathan

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

17 September 2021, 21:22:00

On 2021-Sep-17, Bossart, Nathan wrote:

> > That was the first implementation, a few versions of the patch ago.  An
> > added benefit of a separate WAL record is that you can carry additional
> > data for validation, such as -- as suggested by Andres -- the CRC of the
> > partial data contained in the message that we're skipping.  I didn't
> > implement that, but it should be trivial to add it.
> 
> I see.  IMO feels a bit counterintuitive to validate a partial record
> that you are ignoring anyway, but I suppose it's still valuable to
> know when the WAL is badly broken.  It's not expensive, and it doesn't
> add a ton of complexity, either.

Yeah, we don't have any WAL record history validation other than the
verifying the LSN of the previous record; I suppose in this particular
case you could argue that we shouldn't bother with any validation
either.  But it seems safer to do it.  It doesn't hurt anything anyway.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

21 September 2021, 21:19:53

I made one final pass over the whole thing to be sure it's all commented
as thoroughly as possible, and changed the names of things to make them
all consistent.  So here's the last version which I intend to push to
all branches soon.  (The only difference in back-branches is that
the xlogreader struct needs to be adjusted to have the new fields at the
bottom.)

One thing to note is that this is an un-downgradeable minor; and of
course people should upgrade standbys before primaries.

On 2021-Sep-17, Alvaro Herrera wrote:

> On 2021-Sep-17, Bossart, Nathan wrote:

> > I see.  IMO feels a bit counterintuitive to validate a partial record
> > that you are ignoring anyway, but I suppose it's still valuable to
> > know when the WAL is badly broken.  It's not expensive, and it doesn't
> > add a ton of complexity, either.
> 
> Yeah, we don't have any WAL record history validation other than the
> verifying the LSN of the previous record; I suppose in this particular
> case you could argue that we shouldn't bother with any validation
> either.  But it seems safer to do it.  It doesn't hurt anything anyway.

BTW we do validate the CRC for all records, but we don't have any means
to validate the CRC of a partial record; so in theory if we don't add
CRC validation of the ignored contents, there would be no validation at
all for that record.  I don't think we care, but it's true that there
would be a big blob in WAL that we don't really know anything about.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
Thou shalt study thy libraries and strive not to reinvent them without
cause, that thy code may be short and readable and thy days pleasant
and productive. (7th Commandment for C Programmers)

Attachment

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

23 September 2021, 16:39:55

This took some time because backpatching the tests was more work than I
anticipated -- function name changes, operators that don't exist,
definition of the WAL segment size in pg_settings.  I had to remove the
second test in branches 13 and earlier due to lack of LSN+bytes
operator.  Fortunately, the first test (which is not as clean) still
works all the way back.

However, I notice now that the pg_rewind tests reproducibly fail in
branch 14 for reasons I haven't yet understood.  It's strange that no
other branch fails, even when run quite a few times.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

24 September 2021, 02:32:57

On 2021-Sep-23, Alvaro Herrera wrote:

> However, I notice now that the pg_rewind tests reproducibly fail in
> branch 14 for reasons I haven't yet understood.  It's strange that no
> other branch fails, even when run quite a few times.

Turns out that this is a real bug (setting EndOfLog seems insufficient).
I'm looking into it.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"No necesitamos banderas
 No reconocemos fronteras"                  (Jorge González)

Re: prevent immature WAL streaming

From

Hannu Krosing

Date:

24 September 2021, 17:38:45

Hi Alvaro,

I just started reading this thread, but maybe you can confirm or
refute my understanding of what was done.

In the first email you write

> As mentioned in the course of thread [1], we're missing a fix for
streaming replication to avoid sending records that the primary hasn't
fully flushed yet.  This patch is a first attempt at fixing that problem
by retreating the LSN reported as FlushPtr whenever a segment is
registered, based on the understanding that if no registration exists
then the LogwrtResult.Flush pointer can be taken at face value; but if a
registration exists, then we have to stream only till the start LSN of
that registered entry.

So did we end up holding back the wal_sender to not send anything that
is not confirmed as flushed on master

Are there measurements on how much this slows down replication
compared to allowing sending the moment it is written in buffers but
not necessarily flushed locally ?

Did we investigate possibility of sending as fast as possible and
controlling the flush synchronisation by sending separate flush
pointers *both* ways ?

And maybe there was even an alternative considered where we are
looking at a more general Durability, for example 2-out-of-3 where
primary is one of the 3 and not necessarily the most durable one?

-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

On Fri, Sep 24, 2021 at 4:33 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2021-Sep-23, Alvaro Herrera wrote:
>
> > However, I notice now that the pg_rewind tests reproducibly fail in
> > branch 14 for reasons I haven't yet understood.  It's strange that no
> > other branch fails, even when run quite a few times.
>
> Turns out that this is a real bug (setting EndOfLog seems insufficient).
> I'm looking into it.
>
> --
> Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
> "No necesitamos banderas
>  No reconocemos fronteras"                  (Jorge González)
>
>

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

25 September 2021, 00:48:48

On 2021-Sep-23, Alvaro Herrera wrote:

> On 2021-Sep-23, Alvaro Herrera wrote:
> 
> > However, I notice now that the pg_rewind tests reproducibly fail in
> > branch 14 for reasons I haven't yet understood.  It's strange that no
> > other branch fails, even when run quite a few times.
> 
> Turns out that this is a real bug (setting EndOfLog seems insufficient).
> I'm looking into it.

I had misdiagnosed it; the real problem is that this was taking action
in standby mode and breaking things after promotion.  I took quite a
detour ...

Here's the set for all branches, which I think are really final, in case
somebody wants to play and reproduce their respective problem scenarios.
Nathan already confirmed that his reproducer no longer shows a problem,
and this version shouldn't affect that.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"El Maquinismo fue proscrito so pena de cosquilleo hasta la muerte"
(Ijon Tichy en Viajes, Stanislaw Lem)

Attachment

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

25 September 2021, 11:51:36

On 2021-Sep-24, Alvaro Herrera wrote:

> Here's the set for all branches, which I think are really final, in case
> somebody wants to play and reproduce their respective problem scenarios.

I forgot to mention that I'll wait until 14.0 is tagged before getting
anything pushed.


-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
Thou shalt check the array bounds of all strings (indeed, all arrays), for
surely where thou typest "foo" someone someday shall type
"supercalifragilisticexpialidocious" (5th Commandment for C programmers)

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

29 September 2021, 14:43:49

On 2021-Sep-24, Alvaro Herrera wrote:

> Here's the set for all branches, which I think are really final, in case
> somebody wants to play and reproduce their respective problem scenarios.
> Nathan already confirmed that his reproducer no longer shows a problem,
> and this version shouldn't affect that.

Pushed.  Watching for buildfarm fireworks now.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

29 September 2021, 14:48:11

On 2021-Sep-24, Hannu Krosing wrote:

Hi Hannu

> In the first email you write
> 
> > As mentioned in the course of thread [1], we're missing a fix for
> streaming replication to avoid sending records that the primary hasn't
> fully flushed yet.  This patch is a first attempt at fixing that problem
> by retreating the LSN reported as FlushPtr whenever a segment is
> registered, based on the understanding that if no registration exists
> then the LogwrtResult.Flush pointer can be taken at face value; but if a
> registration exists, then we have to stream only till the start LSN of
> that registered entry.
> 
> So did we end up holding back the wal_sender to not send anything that
> is not confirmed as flushed on master

No.  We eventually realized that that approach was a dead end, so I
abandoned the whole thing and attacked the problem differently.  So your
other questions don't apply.  I tried to make the commit message explain
both the problem and the solution in as much detail as possible; please
have a look at that and let me know if something is unclear.

Thanks

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El miedo atento y previsor es la madre de la seguridad" (E. Burke)

Re: prevent immature WAL streaming

From

Tom Lane

Date:

29 September 2021, 20:33:52

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Pushed.  Watching for buildfarm fireworks now.

jacana didn't like it:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2021-09-29%2018%3A25%3A53

The relevant info seems to be

# Running: pg_basebackup -D
/home/pgrunner/bf/root/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/backup/backup
-h127.0.0.1 -p 59502 --checkpoint fast --no-sync 
pg_basebackup: error: connection to server at "127.0.0.1", port 59502 failed: FATAL:  no pg_hba.conf entry for
replicationconnection from host "127.0.0.1", user "pgrunner", no encryption 
Bail out!  system pg_basebackup failed

which looks like a pretty straightforward bogus-connection-configuration
problem, except why wouldn't other BF members show it?

            regards, tom lane

Re: prevent immature WAL streaming

From

Andrew Dunstan

Date:

29 September 2021, 21:04:48

On 9/29/21 4:33 PM, Tom Lane wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
>> Pushed.  Watching for buildfarm fireworks now.
> jacana didn't like it:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2021-09-29%2018%3A25%3A53
>
> The relevant info seems to be
>
> # Running: pg_basebackup -D
/home/pgrunner/bf/root/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/backup/backup
-h127.0.0.1 -p 59502 --checkpoint fast --no-sync
 
> pg_basebackup: error: connection to server at "127.0.0.1", port 59502 failed: FATAL:  no pg_hba.conf entry for
replicationconnection from host "127.0.0.1", user "pgrunner", no encryption
 
> Bail out!  system pg_basebackup failed
>
> which looks like a pretty straightforward bogus-connection-configuration
> problem, except why wouldn't other BF members show it?
>
>             


This:

    # Second test: a standby that receives WAL via archive/restore commands.
    $node = PostgresNode->new('primary2');
    $node->init(
        has_archiving => 1,
        extra         => ['--wal-segsize=1']);


doesn't have "allows_streaming => 1".


cheers


andrew


-- 

Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

29 September 2021, 21:27:54

On 2021-Sep-29, Andrew Dunstan wrote:

> > The relevant info seems to be
> >
> > # Running: pg_basebackup -D
/home/pgrunner/bf/root/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/backup/backup
-h127.0.0.1 -p 59502 --checkpoint fast --no-sync

> > pg_basebackup: error: connection to server at "127.0.0.1", port 59502 failed: FATAL:  no pg_hba.conf entry for
replicationconnection from host "127.0.0.1", user "pgrunner", no encryption

> > Bail out!  system pg_basebackup failed
> >
> > which looks like a pretty straightforward bogus-connection-configuration
> > problem, except why wouldn't other BF members show it?
> 
> This:
> 
>     # Second test: a standby that receives WAL via archive/restore commands.
>     $node = PostgresNode->new('primary2');
>     $node->init(
>         has_archiving => 1,
>         extra         => ['--wal-segsize=1']);
> 
> doesn't have "allows_streaming => 1".

Hmm, but I omitted allows_streaming on purpose -- I only wanted
archiving, not streaming.  I understand that your point is that
set_replication_conf is not called unless allows_streaming is set.

So, do we take the stance that we have no right to expect pg_basebackup
to work if we didn't pass allow_streaming => 1?  If so, the fix is to
add it.  But my preferred fix would be to call set_replication_conf if
either allows_streaming or has_archiving are given.

Another easy fix would be to call $primary2->set_replication_conf in the
test file, but then you'd complain that that's supposed to be an
internal method :-)

(This reminds me that I had to add something that seemed like it should
have been unnecessary: wal_level=replica should become set if I request
archiving, right?  Otherwise the WAL archive is useless.  I also had to
add max_wal_senders=2 so that pg_basebackup would work, but I'm on the
fence about setting that automatically if has_archiving is given.)

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"At least to kernel hackers, who really are human, despite occasional
rumors to the contrary" (LWN.net)

Re: prevent immature WAL streaming

From

Tom Lane

Date:

29 September 2021, 21:29:39

Andrew Dunstan <andrew@dunslane.net> writes:
> On 9/29/21 4:33 PM, Tom Lane wrote:
>> which looks like a pretty straightforward bogus-connection-configuration
>> problem, except why wouldn't other BF members show it?

> This:
> ...
> doesn't have "allows_streaming => 1".

Oh, and that only breaks things on Windows, cf set_replication_conf.

... although I wonder why the fact that sub init otherwise sets
wal_level = minimal doesn't cause a problem for this test.
Maybe the test script is cowboy-ishly overriding that?

            regards, tom lane

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

29 September 2021, 21:38:44

On 2021-Sep-29, Tom Lane wrote:

> ... although I wonder why the fact that sub init otherwise sets
> wal_level = minimal doesn't cause a problem for this test.
> Maybe the test script is cowboy-ishly overriding that?

It is.  (I claim that it should be set otherwise when has_archiving=>1).

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El Maquinismo fue proscrito so pena de cosquilleo hasta la muerte"
(Ijon Tichy en Viajes, Stanislaw Lem)

Re: prevent immature WAL streaming

From

Andrew Dunstan

Date:

29 September 2021, 21:40:47

On 9/29/21 5:27 PM, Alvaro Herrera wrote:
> On 2021-Sep-29, Andrew Dunstan wrote:
>
>>> The relevant info seems to be
>>>
>>> # Running: pg_basebackup -D
/home/pgrunner/bf/root/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/backup/backup
-h127.0.0.1 -p 59502 --checkpoint fast --no-sync
 
>>> pg_basebackup: error: connection to server at "127.0.0.1", port 59502 failed: FATAL:  no pg_hba.conf entry for
replicationconnection from host "127.0.0.1", user "pgrunner", no encryption
 
>>> Bail out!  system pg_basebackup failed
>>>
>>> which looks like a pretty straightforward bogus-connection-configuration
>>> problem, except why wouldn't other BF members show it?
>> This:
>>
>>     # Second test: a standby that receives WAL via archive/restore commands.
>>     $node = PostgresNode->new('primary2');
>>     $node->init(
>>         has_archiving => 1,
>>         extra         => ['--wal-segsize=1']);
>>
>> doesn't have "allows_streaming => 1".
> Hmm, but I omitted allows_streaming on purpose -- I only wanted
> archiving, not streaming.  I understand that your point is that
> set_replication_conf is not called unless allows_streaming is set.
>
> So, do we take the stance that we have no right to expect pg_basebackup
> to work if we didn't pass allow_streaming => 1?  If so, the fix is to
> add it.  But my preferred fix would be to call set_replication_conf if
> either allows_streaming or has_archiving are given.
>
>
> Another easy fix would be to call $primary2->set_replication_conf in the
> test file, but then you'd complain that that's supposed to be an
> internal method :-)


It claims that but it's also used here:


src/bin/pg_basebackup/t/010_pg_basebackup.pl


(Also, good perl style would start purely internal method names with an
underscore.)


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: prevent immature WAL streaming

From

Tom Lane

Date:

29 September 2021, 22:48:32

Andrew Dunstan <andrew@dunslane.net> writes:
> On 9/29/21 5:27 PM, Alvaro Herrera wrote:
>> So, do we take the stance that we have no right to expect pg_basebackup
>> to work if we didn't pass allow_streaming => 1?  If so, the fix is to
>> add it.  But my preferred fix would be to call set_replication_conf if
>> either allows_streaming or has_archiving are given.
>> 
>> Another easy fix would be to call $primary2->set_replication_conf in the
>> test file, but then you'd complain that that's supposed to be an
>> internal method :-)

> It claims that but it's also used here:
> src/bin/pg_basebackup/t/010_pg_basebackup.pl

Given that precedent, it seems like calling set_replication_conf
is a good quick-fix for getting the buildfarm green again.
But +1 for then refactoring this to get rid of these hacks (both
with respect to pg_hba.conf and the postgresql.conf parameters)
in both of these tests.

            regards, tom lane

Re: prevent immature WAL streaming

From

Andres Freund

Date:

30 September 2021, 07:51:52

Hi,

On 2021-09-29 17:04:48 -0400, Andrew Dunstan wrote:
> On 9/29/21 4:33 PM, Tom Lane wrote:
>     # Second test: a standby that receives WAL via archive/restore commands.
>     $node = PostgresNode->new('primary2');
>     $node->init(
>         has_archiving => 1,
>         extra         => ['--wal-segsize=1']);
> 
> 
> doesn't have "allows_streaming => 1".

FWIW, with that fixed I see the test hanging (e.g. [1]):

can't unlink
c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/archives/000000010000000000000008.fail:
Permissiondenied at t/026_overwrite_contrecord.pl line 189.

### Stopping node "primary2" using mode immediate
# Running: pg_ctl -D c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/pgdata -m immediate
stop
waiting for server to shut
down...........................................................................................................................
failed
pg_ctl: server does not shut down
Bail out!  command "pg_ctl -D c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/pgdata -m
immediatestop" exited with value 1

Warning: unable to close filehandle GEN6 properly: Bad file descriptor during global destruction.

The hang seems to be fixed by uncommenting the $h->finish(). Was there a
reason you commented that out? THe test still fails, but at least it doesn't
hang anymore.

Greetings,

Andres Freund

[1]
https://api.cirrus-ci.com/v1/artifact/task/6204050896060416/tap/src/test/recovery/tmp_check/log/regress_log_026_overwrite_contrecord

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

30 September 2021, 12:02:30

Hello

On 2021-Sep-30, Andres Freund wrote:

> FWIW, with that fixed I see the test hanging (e.g. [1]):
> 
> can't unlink
c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/archives/000000010000000000000008.fail:
Permissiondenied at t/026_overwrite_contrecord.pl line 189.
 
> ### Stopping node "primary2" using mode immediate
> # Running: pg_ctl -D c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/pgdata -m
immediatestop
 
> waiting for server to shut
down...........................................................................................................................
failed
> pg_ctl: server does not shut down
> Bail out!  command "pg_ctl -D c:/cirrus/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/pgdata
-mimmediate stop" exited with value 1
 
> Warning: unable to close filehandle GEN6 properly: Bad file descriptor during global destruction.
> 
> The hang seems to be fixed by uncommenting the $h->finish(). Was there a
> reason you commented that out?

Hmm, no -- I was experimenting to see what effect it had, and because I
saw none, I left it like that.  Let me fix both things and see what
happens next.

> THe test still fails, but at least it doesn't hang anymore.

I'll try and get it fixed, but ultimately we can always "fix" this test
by removing it.  It is only in 14 and master; the earlier branches only
have the first part of this test file.  (That's because I ran out of
patience trying to port it to versions lacking LSN operators and such.)
I think the second half doesn't provide all that much additional
coverage anyway.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Ellos andaban todos desnudos como su madre los parió, y también las mujeres,
aunque no vi más que una, harto moza, y todos los que yo vi eran todos
mancebos, que ninguno vi de edad de más de XXX años" (Cristóbal Colón)

Re: prevent immature WAL streaming

From

Andrew Dunstan

Date:

30 September 2021, 14:18:39

On 9/29/21 5:29 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 9/29/21 4:33 PM, Tom Lane wrote:
>>> which looks like a pretty straightforward bogus-connection-configuration
>>> problem, except why wouldn't other BF members show it?
>> This:
>> ...
>> doesn't have "allows_streaming => 1".
> Oh, and that only breaks things on Windows, cf set_replication_conf.
>
> ... although I wonder why the fact that sub init otherwise sets
> wal_level = minimal doesn't cause a problem for this test.
> Maybe the test script is cowboy-ishly overriding that?
>
>             



Regardless of this problem, I think we should simply call
set_replication_conf unconditionally in init().  Replication connections
are now allowed by default on Unix, this would just bring Windows nodes
into line with that.


The function does have this:

    $self->host eq $test_pghost
       or die "set_replication_conf only works with the default host";

I'm not sure when that wouldn't be true.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: prevent immature WAL streaming

From

Tom Lane

Date:

30 September 2021, 14:28:32

Andrew Dunstan <andrew@dunslane.net> writes:
> Regardless of this problem, I think we should simply call
> set_replication_conf unconditionally in init().  Replication connections
> are now allowed by default on Unix, this would just bring Windows nodes
> into line with that.

Yeah, I was thinking along the same lines yesterday.  The fact that
pre-commit testing failed to note the problem is exactly because we
have this random difference between what works by default on Unix
and what works by default on Windows.  Let's close that gap before
it bites us again.

There's still the issue of these tests overriding postgresql.conf
entries made by init(), but maybe we can live with that?

            regards, tom lane

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

30 September 2021, 14:32:57

On 2021-Sep-30, Tom Lane wrote:

> Andrew Dunstan <andrew@dunslane.net> writes:
> > Regardless of this problem, I think we should simply call
> > set_replication_conf unconditionally in init().  Replication connections
> > are now allowed by default on Unix, this would just bring Windows nodes
> > into line with that.
> 
> Yeah, I was thinking along the same lines yesterday.  The fact that
> pre-commit testing failed to note the problem is exactly because we
> have this random difference between what works by default on Unix
> and what works by default on Windows.  Let's close that gap before
> it bites us again.

+1

> There's still the issue of these tests overriding postgresql.conf
> entries made by init(), but maybe we can live with that?

I vote to at least have has_archiving=>1 set wal_level=replica, and
potentially max_wal_senders=2 too (allows pg_basebackup).

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

From

Tom Lane

Date:

30 September 2021, 16:12:54

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2021-Sep-30, Tom Lane wrote:
>> There's still the issue of these tests overriding postgresql.conf
>> entries made by init(), but maybe we can live with that?

> I vote to at least have has_archiving=>1 set wal_level=replica, and
> potentially max_wal_senders=2 too (allows pg_basebackup).

I think this requires a bit more investigation.  I looked quickly
through the pre-existing tests that set has_archiving=>1, and every
single one of them also sets allows_streaming=>1, which no doubt
explains why this issue hasn't come up before.  So now I'm wondering
if any of those other tests is setting allows_streaming only because
of this configuration issue.

More to the point, since we've not previously used the combination
of has_archiving without allows_streaming, I wonder exactly how
we want to define it to work.  I'm not really convinced that it
should be defined as "allows basebackup even though replication
is supposed to be off".

Perhaps a compromise could be to invent a third option
"allows_basebackup", so that init() actually knows what's going on?

            regards, tom lane

Re: prevent immature WAL streaming

From

Tom Lane

Date:

30 September 2021, 18:53:56

Just when you thought it was safe to go back in the water:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-09-29%2022%3A05%3A44

which is complaining that the (misspelled, BTW) log message
'sucessfully skipped missing contrecord at' doesn't show up.

This machine is old, slow, and 32-bit bigendian.  I first thought
the problem might be "didn't wait long enough", but it seems like
waiting for replay ought to be sufficient.  What I'm now guessing
is that the test case is making unwarranted assumptions about how
much WAL will be generated, such that no page-crossing contrecord
actually appears.

Also, digging around, I see hornet showed the same problem:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2021-09-29%2018%3A19%3A55

hornet is 64-bit bigendian ... so maybe this actually reduces to
an endianness question?

            regards, tom lane

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

30 September 2021, 20:04:34

On 2021-Sep-30, Tom Lane wrote:

> Just when you thought it was safe to go back in the water:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2021-09-29%2022%3A05%3A44
> 
> which is complaining that the (misspelled, BTW)

Ah, the case of the missing juxtaposed consonants.  Les Luthiers have
something to say on that matter.
https://www.youtube.com/watch?v=ptorPqV7D5s

> log message 'sucessfully skipped missing contrecord at' doesn't show
> up.

Hmm.  Well, as I said, maybe this part of the test isn't worth much
anyway.  Rather than spending time trying to figure out why isn't this
triggering the WAL overwriting, I compared the coverage report for
running only the first test to the coverage report of running only the
second test.  It turns out that there's no relevant coverage increase in
the second test.  So I propose just removing that part.

(The reason I added that test in the first place was to try to reproduce
the problem without having to physically unlink a WAL file from the
primary's pg_wal subdir.  But maybe it's just make-work.)

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Attachment

0001-Remove-test-fix-typo.patch

Re: prevent immature WAL streaming

From

Tom Lane

Date:

30 September 2021, 20:26:04

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Hmm.  Well, as I said, maybe this part of the test isn't worth much
> anyway.  Rather than spending time trying to figure out why isn't this
> triggering the WAL overwriting, I compared the coverage report for
> running only the first test to the coverage report of running only the
> second test.  It turns out that there's no relevant coverage increase in
> the second test.  So I propose just removing that part.

Seems reasonable.  We don't need to spend buildfarm cycles forevermore
on a test that's not adding useful coverage.

            regards, tom lane

Re: prevent immature WAL streaming

From

Amul Sul

Date:

07 October 2021, 11:11:37

Hi,

On Wed, Sep 29, 2021 at 8:14 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2021-Sep-24, Alvaro Herrera wrote:
>
> > Here's the set for all branches, which I think are really final, in case
> > somebody wants to play and reproduce their respective problem scenarios.
> > Nathan already confirmed that his reproducer no longer shows a problem,
> > and this version shouldn't affect that.
>
> Pushed.  Watching for buildfarm fireworks now.
>

While reading this commit (ff9f111bce24), wondered can't we skip
missingContrecPtr global variable declaration and calculate that from
abortedRecPtr value whenever it needed. IIUC, missingContrecPtr is the
next page to the page that abortedRecPtr contain and that can be
calculated as "abortedRecPtr + (XLOG_BLCKSZ - (abortedRecPtr %
XLOG_BLCKSZ))", thoughts? Please correct me if I'm missing something,
thanks.

Regards,
Amul

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

07 October 2021, 13:11:20

On 2021-Oct-07, Amul Sul wrote:

> While reading this commit (ff9f111bce24), wondered can't we skip
> missingContrecPtr global variable declaration and calculate that from
> abortedRecPtr value whenever it needed. IIUC, missingContrecPtr is the
> next page to the page that abortedRecPtr contain and that can be
> calculated as "abortedRecPtr + (XLOG_BLCKSZ - (abortedRecPtr %
> XLOG_BLCKSZ))", thoughts? Please correct me if I'm missing something,
> thanks.

I don't think that works -- what if the missing record is not on the
next page but on some future one?  Imagine an enormously large record
that starts on segment 1, covers all of segment 2 and ends in segment 3.
We could have flushed segment 2 already, so with your idea we would skip
ahead only to that position, but really we need to skip all the way to
the first page of segment 3.

This is easier to imagine if you set wal segment size to 1 MB, but it is
possible with the default segment size too, since commit records can be
arbitrarily large, and "logical message" records as well.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

From

Amul Sul

Date:

07 October 2021, 13:22:39

On Thu, 7 Oct 2021 at 6:41 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2021-Oct-07, Amul Sul wrote:

> While reading this commit (ff9f111bce24), wondered can't we skip
> missingContrecPtr global variable declaration and calculate that from
> abortedRecPtr value whenever it needed. IIUC, missingContrecPtr is the
> next page to the page that abortedRecPtr contain and that can be
> calculated as "abortedRecPtr + (XLOG_BLCKSZ - (abortedRecPtr %
> XLOG_BLCKSZ))", thoughts? Please correct me if I'm missing something,
> thanks.

I don't think that works -- what if the missing record is not on the
next page but on some future one? Imagine an enormously large record
that starts on segment 1, covers all of segment 2 and ends in segment 3.
We could have flushed segment 2 already, so with your idea we would skip
ahead only to that position, but really we need to skip all the way to
the first page of segment 3.

This is easier to imagine if you set wal segment size to 1 MB, but it is
possible with the default segment size too, since commit records can be
arbitrarily large, and "logical message" records as well.

Make sense, thanks for the explanation.

Regards,

Amul

Regards,
Amul Sul
EDB: http://www.enterprisedb.com

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

07 October 2021, 13:27:55

On 2021-Oct-07, Amul Sul wrote:

> Make sense, thanks for the explanation.

You're welcome.  Also, I forgot: thank you for taking the time to review
the code.  Much appreciated.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

From

Amul Sul

Date:

13 October 2021, 06:00:18

Hi,

On Thu, Oct 7, 2021 at 6:57 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2021-Oct-07, Amul Sul wrote:
>
> > Make sense, thanks for the explanation.
>
> You're welcome.  Also, I forgot: thank you for taking the time to review
> the code.  Much appreciated.

:)

>
>

I have one more question, regarding the need for other global
variables i.e. abortedRecPtr.  (Sorry for coming back after so long.)

Instead of abortedRecPtr point, isn't enough to write
overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr?  I think both
are pointing to the same location then can't we use
lastReplayedEndRecPtr instead of abortedRecPtr to write
overwrite-contrecord and remove need of extra global variable, like
attached?

You might wonder why I am so concerned about the global variable. The
reason is that I am working on another thread[1] where we are trying
to club all the WAL write operations that happen at the end of
StartupXLOG into a separate function. In the future, we might want to
allow executing this function from other processes (e.g.
Checkpointer). For that, we need to remove the dependency of those WAL
write operations having on the global variables which are mostly valid
in the startup process. The easiest way to do that is simply copy all
the global variables into shared memory but that will not be an
optimised solution, the goal is to try to see if we could leverage the existing
information available into shared memory. I would be grateful if you
could share your thoughts on the same, thanks.

Regards,
Amul

1] https://postgr.es/m/CAAJ_b97KZzdJsffwRK7w0XU5HnXkcgKgTR69t8cOZztsyXjkQw@mail.gmail.com

Attachment

remove-abortedRecPtr-global-variable.patch

RE: prevent immature WAL streaming

From

Jakub Wartak

Date:

13 October 2021, 12:53:37

On 2021-Sep-25, Alvaro Herrera wrote:
>> On 2021-Sep-24, Alvaro Herrera wrote:
>>
>> > Here's the set for all branches, which I think are really final, in
>> > case somebody wants to play and reproduce their respective problem
>> scenarios.
>>
>> I forgot to mention that I'll wait until 14.0 is tagged before getting anything
>> pushed.

Hi Alvaro, sorry for being late to the party, but to add some reassurance that v2-commited-fix this really fixes solves
theinitial production problem, I've done limited test on it (just like with the v1-patch idea earlier/ with using
wal_keep_segments,wal_init_zero=on, archive_mode=on and archive_command='/bin/true') 

- On 12.8, I was able like last time to manually reproduce it on 3 out of 3 tries and I've got: 2x "invalid contrecord
length",1x "there is no contrecord flag" on standby. 

- On soon-to-be-become-12.9 REL_12_STABLE (with commit 1df0a914d58f2bdb03c11dfcd2cb9cd01c286d59 ) on 4 out of 4 tries,
I'vegot beautiful insight into what happened: 
LOG:  started streaming WAL from primary at 1/EC000000 on timeline 1
LOG:  sucessfully skipped missing contrecord at 1/EBFFFFF8, overwritten at 2021-10-13 11:22:37.48305+00
CONTEXT:  WAL redo at 1/EC000028 for XLOG/OVERWRITE_CONTRECORD: lsn 1/EBFFFFF8; time 2021-10-13 11:22:37.48305+00
...and slave was able to carry-on automatically. In 4th test, the cascade was tested too (m -> s1 -> s11) and both
{s1,s11}did behave properly and log the above message. Also additional check proved that after simulating ENOSPC crash
onmaster the data contents were identical everywhere (m1=s1=s11).  

Thank you Alvaro and also to everybody else who participated in solving this challenging and really edge-case nasty
bug.

-J.

Re: prevent immature WAL streaming

From

Robert Haas

Date:

13 October 2021, 17:14:32

On Wed, Oct 13, 2021 at 2:01 AM Amul Sul <sulamul@gmail.com> wrote:
> Instead of abortedRecPtr point, isn't enough to write
> overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr?  I think both
> are pointing to the same location then can't we use
> lastReplayedEndRecPtr instead of abortedRecPtr to write
> overwrite-contrecord and remove need of extra global variable, like
> attached?

I think you mean missingContrecPtr, not abortedRecPtr. If I understand
correctly, abortedRecPtr is going to be the location in some WAL
segment which we replayed where a long record began, but
missingContrecPtr seems like it would have to point to the beginning
of the first segment we were unable to find to continue replay; and
thus it ought to be the same as lastReplayedEndRecPtr. But the
committed code doesn't seem to check that these are the same or verify
the relationship between them in any way, so I'm worried there is some
other case here. The comments in XLogReadRecord also suggest this:

         * We get here when a record that spans multiple pages needs to be
         * assembled, but something went wrong -- perhaps a contrecord piece
         * was lost.  If caller is WAL replay, it will know where the aborted

Saying that "perhaps" a contrecord piece was lost seems to imply that
other explanations are possible as well, but I'm not sure what.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

13 October 2021, 17:27:58

On 2021-Oct-13, Amul Sul wrote:

> I have one more question, regarding the need for other global
> variables i.e. abortedRecPtr.  (Sorry for coming back after so long.)
> 
> Instead of abortedRecPtr point, isn't enough to write
> overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr?  I think both
> are pointing to the same location then can't we use
> lastReplayedEndRecPtr instead of abortedRecPtr to write
> overwrite-contrecord and remove need of extra global variable, like
> attached?

I'm a bit fuzzy on the difference "the end+1" and "the start of the next
record".  Are they always the same?  We do have XLogRecPtrToBytePos()
and XLogBytePosToEndRecPtr() to convert unadorned XLogRecPtr values to
"usable byte positions", which suggests to me that the proposed patch
may fail if end+1 is a page or segment boundary.

The other difference is that abortedRecPtr is set if we fail to read a
record, but XLogCtl->lastReplayedEndRecPtr is set even if we read the
record successfully.  So you'd have need a bool flag that the overwrite
contrecord record needs to be written.  Your patch is using the fact
that missingContrecPtr is non-invalid as such a flag ... I can't see
anything wrong with that.  So maybe your patch is okay in this aspect.

> You might wonder why I am so concerned about the global variable. The
> reason is that I am working on another thread[1] where we are trying
> to club all the WAL write operations that happen at the end of
> StartupXLOG into a separate function. In the future, we might want to
> allow executing this function from other processes (e.g.
> Checkpointer). For that, we need to remove the dependency of those WAL
> write operations having on the global variables which are mostly valid
> in the startup process.

Seems a fine goal.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

13 October 2021, 17:39:33

On 2021-Oct-13, Robert Haas wrote:

> On Wed, Oct 13, 2021 at 2:01 AM Amul Sul <sulamul@gmail.com> wrote:
> > Instead of abortedRecPtr point, isn't enough to write
> > overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr?  I think both
> > are pointing to the same location then can't we use
> > lastReplayedEndRecPtr instead of abortedRecPtr to write
> > overwrite-contrecord and remove need of extra global variable, like
> > attached?
> 
> I think you mean missingContrecPtr, not abortedRecPtr. If I understand
> correctly, abortedRecPtr is going to be the location in some WAL
> segment which we replayed where a long record began, but
> missingContrecPtr seems like it would have to point to the beginning
> of the first segment we were unable to find to continue replay; and
> thus it ought to be the same as lastReplayedEndRecPtr.

So abortedRecPtr and missingContrecPtr both point to the same long
record: the former is the start of the record, and the latter is some
intermediate position where we failed to find the contrecord.
lastReplayedEndRecPtr is the end+1 of the record *prior* to the long
record.

> But the committed code doesn't seem to check that these are the same
> or verify the relationship between them in any way, so I'm worried
> there is some other case here.

Yeah, the only reason they are the same is that xlogreader sets both to
Invalid when reading a record, and then sets both when a read fails.

> The comments in XLogReadRecord also suggest this:
> 
>          * We get here when a record that spans multiple pages needs to be
>          * assembled, but something went wrong -- perhaps a contrecord piece
>          * was lost.  If caller is WAL replay, it will know where the aborted
> 
> Saying that "perhaps" a contrecord piece was lost seems to imply that
> other explanations are possible as well, but I'm not sure what.

Other explanations are possible.  Imagine cosmic rays alter one byte in
the last contrecord.  WAL replay will stop there, and the contrecord
will have been found all right, but CRC check would have failed to pass,
so we would set xlogreader->missingContrecPtr to the final contrecord of
that record (I didn't actually verify this particular scenario.)

In fact, anything that causes xlogreader.c:XLogReadRecord to return NULL
after setting "assembled=true" would set both abortedRecPtr and
missingContrecPtr -- except DecodeXLogRecord failure, which perhaps
should be handled in the same way.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)

Re: prevent immature WAL streaming

From

Andres Freund

Date:

13 October 2021, 18:03:39

Hi,

It seems that 026_overwrite_contrecord.pl test often fails under valgrind. I
first thought that related failures on skink were due to me migrating the
animal to a new host (and then running out of space due to a mistake in ccache
config). But it happened again after I fixed those, and I just reproduced the
issue locally.

It's a bit odd that we didn't start to see these failures immediately, but
only in the last few days. I'd blame skink being migrated to a new home,
except that I can see the issue locally.

FWIW, the way skink runs all postgres instances through valgrind is by
replacing the postgres binary. Here's my local version of that:

andres@awork3:~/build/postgres/dev-assert/vpath$ cat
tmp_install/home/andres/build/postgres/dev-assert/install/bin/postgres
#!/bin/bash

exec /usr/bin/valgrind \
    --quiet \
    --error-exitcode=128 \
    --suppressions=/home/andres/src/postgresql/src/tools/valgrind.supp \
    --trace-children=yes --track-origins=yes --read-var-info=no \
    --leak-check=no \
    --run-libc-freeres=no \
    --vgdb=no \
    --error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END \

/home/andres/build/postgres/dev-assert/vpath/tmp_install/home/andres/build/postgres/dev-assert/install/bin/postgres.orig
\
    "$@"

make -C src/test/recovery/ check PROVE_FLAGS='-v' PROVE_TESTS='t/026_overwrite_contrecord.pl' NO_TEMP_INSTALL=1
...

not ok 1 - 000000010000000000000002 differs from 000000010000000000000002

#   Failed test '000000010000000000000002 differs from 000000010000000000000002'


Regards,

Andres

Re: prevent immature WAL streaming

From

Andres Freund

Date:

13 October 2021, 18:31:29

Hi,

On 2021-10-13 11:03:39 -0700, Andres Freund wrote:
> It seems that 026_overwrite_contrecord.pl test often fails under valgrind. I
> first thought that related failures on skink were due to me migrating the
> animal to a new host (and then running out of space due to a mistake in ccache
> config). But it happened again after I fixed those, and I just reproduced the
> issue locally.
> 
> It's a bit odd that we didn't start to see these failures immediately, but
> only in the last few days. I'd blame skink being migrated to a new home,
> except that I can see the issue locally.
> 
> FWIW, the way skink runs all postgres instances through valgrind is by
> replacing the postgres binary. Here's my local version of that:
> 
> andres@awork3:~/build/postgres/dev-assert/vpath$ cat
tmp_install/home/andres/build/postgres/dev-assert/install/bin/postgres
> #!/bin/bash
> 
> exec /usr/bin/valgrind \
>     --quiet \
>     --error-exitcode=128 \
>     --suppressions=/home/andres/src/postgresql/src/tools/valgrind.supp \
>     --trace-children=yes --track-origins=yes --read-var-info=no \
>     --leak-check=no \
>     --run-libc-freeres=no \
>     --vgdb=no \
>     --error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END \
>
/home/andres/build/postgres/dev-assert/vpath/tmp_install/home/andres/build/postgres/dev-assert/install/bin/postgres.orig
\
>     "$@"
> 
> make -C src/test/recovery/ check PROVE_FLAGS='-v' PROVE_TESTS='t/026_overwrite_contrecord.pl' NO_TEMP_INSTALL=1
> ...
> 
> not ok 1 - 000000010000000000000002 differs from 000000010000000000000002
> 
> #   Failed test '000000010000000000000002 differs from 000000010000000000000002'

I added LSNs to the error message:
not ok 1 - 000000010000000000000002 (0/2002350) differs from 000000010000000000000002 (0/2099600)

It appears that the problem is that inbetween the determination of
rows_walsize the insert LSN moved to the next segment separately from the
insertion, presumably due to autovacuum/analayze or such.
<retries, with log_autovacuum_min_duration_statement=0, log_min_duration_statement=0>


2021-10-13 11:23:25.659 PDT [1491455] 026_overwrite_contrecord.pl LOG:  statement: insert into filler select * from
generate_series(1,1000)
 
2021-10-13 11:23:26.467 PDT [1491455] 026_overwrite_contrecord.pl LOG:  duration: 861.112 ms
2021-10-13 11:23:27.055 PDT [1491458] 026_overwrite_contrecord.pl LOG:  statement: select pg_current_wal_insert_lsn() -
'0/0'
2021-10-13 11:23:27.357 PDT [1491458] 026_overwrite_contrecord.pl LOG:  duration: 347.888 ms
2021-10-13 11:23:27.980 PDT [1491461] 026_overwrite_contrecord.pl LOG:  statement: WITH setting AS (
      SELECT setting::int AS wal_segsize
        FROM pg_settings WHERE name = 'wal_segment_size'
    )
    INSERT INTO filler
    SELECT g FROM setting,
      generate_series(1, 1000 * (wal_segsize - ((pg_current_wal_insert_lsn() - '0/0') % wal_segsize)) / 64232) g
2021-10-13 11:24:25.173 PDT [1491550] LOG:  automatic analyze of table "postgres.public.filler"
    avg read rate: 3.185 MB/s, avg write rate: 0.039 MB/s
    buffer usage: 96 hits, 731 misses, 9 dirtied
    system usage: CPU: user: 1.79 s, system: 0.00 s, elapsed: 1.81 s
2021-10-13 11:24:26.255 PDT [1491461] 026_overwrite_contrecord.pl LOG:  duration: 58360.811 ms
2021-10-13 11:24:26.857 PDT [1491557] 026_overwrite_contrecord.pl LOG:  statement: SELECT pg_current_wal_insert_lsn()
2021-10-13 11:24:27.120 PDT [1491557] 026_overwrite_contrecord.pl LOG:  duration: 300.562 ms


Hm. I guess we can disable autovac. But that's not a great solution, there
might be WAL files due to catalog access etc too.

Seems like it might be worth doing the "filling" of the segment with a loop in
a DO block instead, where the end condition is to be within some distance to
the end of the segment? With plenty headroom?


Another thing: filling a segment by inserting lots of very tiny rows is pretty
expensive. Can't we use something a bit wider? Perhaps even emit_message?

Greetings,

Andres Freund

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

13 October 2021, 18:52:46

On 2021-Oct-13, Andres Freund wrote:

> I added LSNs to the error message:
> not ok 1 - 000000010000000000000002 (0/2002350) differs from 000000010000000000000002 (0/2099600)
> 
> It appears that the problem is that inbetween the determination of
> rows_walsize the insert LSN moved to the next segment separately from the
> insertion, presumably due to autovacuum/analayze or such.
> <retries, with log_autovacuum_min_duration_statement=0, log_min_duration_statement=0>

Oh, of course.

> Hm. I guess we can disable autovac. But that's not a great solution, there
> might be WAL files due to catalog access etc too.

Well, we don't expect anything else to happen -- the cluster is
otherwise idle.  I think we should do it regardless of any other
changes, just to keep things steadier.

> Seems like it might be worth doing the "filling" of the segment with a loop in
> a DO block instead, where the end condition is to be within some distance to
> the end of the segment? With plenty headroom?

Eh, good idea ... didn't think of that, but it should keep things more
stable under strange conditions.

> Another thing: filling a segment by inserting lots of very tiny rows is pretty
> expensive. Can't we use something a bit wider? Perhaps even emit_message?

I think I realized partway through writing the test that I could use
emit_message instead of using a batched row insert ... so, yeah, we
can use it here also.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"La verdad no siempre es bonita, pero el hambre de ella sí"

Re: prevent immature WAL streaming

From

Andres Freund

Date:

13 October 2021, 19:13:45

Hi,

On 2021-10-13 15:52:46 -0300, Alvaro Herrera wrote:
> > Hm. I guess we can disable autovac. But that's not a great solution, there
> > might be WAL files due to catalog access etc too.
> 
> Well, we don't expect anything else to happen -- the cluster is
> otherwise idle.  I think we should do it regardless of any other
> changes, just to keep things steadier.

IDK, it seems good to have a bit of variance as well. But I don't have a
strong opinion on it.


> > Another thing: filling a segment by inserting lots of very tiny rows is pretty
> > expensive. Can't we use something a bit wider? Perhaps even emit_message?

FWIW, the count of inserted rows is something like 171985 ;)


> I think I realized partway through writing the test that I could use
> emit_message instead of using a batched row insert ... so, yeah, we
> can use it here also.

Cool. Even if we want to use inserts, lets at least make the rows wide...

I think it'd be good to have a bit of variance in record width. So perhaps
adding a bit of random() in to influence record width would be a good idea?

Greetings,

Andres Freund

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

13 October 2021, 19:57:33

On 2021-Oct-13, Andres Freund wrote:

> > > Another thing: filling a segment by inserting lots of very tiny rows is pretty
> > > expensive. Can't we use something a bit wider? Perhaps even emit_message?
> 
> FWIW, the count of inserted rows is something like 171985 ;)

This does ~1600 iterations to fill one segment, 10 rows per iteration,
row size is variable; exits when two BLCKSZ remain to complete the WAL
segment:

create table filler (a int, b text);
do $$
declare
        wal_segsize int := setting::int from pg_settings where name = 'wal_segment_size';
        remain int;
        iters int := 0;
begin
        loop
                insert into filler
                select g, repeat(md5(g::text), (random() * 60 + 1)::int)
                from generate_series(1, 10) g;

                remain := wal_segsize - (pg_current_wal_insert_lsn() - '0/0') % wal_segsize;
                raise notice '(%) remain: %', iters, remain;
                if remain < 2 * setting::int from pg_settings where name = 'block_size' then
                        exit;
                end if;
                iters := iters + 1;
        end loop;
end
$$ ;

(Of course, I'm not proposing that the 'raise notice' be there in the
committed form.)

If I enlarge the 'repeat' count, it gets worse (more iterations
required) because a lot of the rows become toasted and thus subject to
compression.  If I do 20 rows per iteration rather than 10, the risk is
that we'll do too many near the end of the segment and we'll have to
continue running until completing the next one.

So, this seems good enough.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Once again, thank you and all of the developers for your hard work on
PostgreSQL.  This is by far the most pleasant management experience of
any database I've worked on."                             (Dan Harris)
http://archives.postgresql.org/pgsql-performance/2006-04/msg00247.php

Re: prevent immature WAL streaming

From

Andres Freund

Date:

13 October 2021, 20:36:55

Hi,

On 2021-10-13 12:13:45 -0700, Andres Freund wrote:
> On 2021-10-13 15:52:46 -0300, Alvaro Herrera wrote:
> > I think I realized partway through writing the test that I could use
> > emit_message instead of using a batched row insert ... so, yeah, we
> > can use it here also.
>
> Cool. Even if we want to use inserts, lets at least make the rows wide...
>
> I think it'd be good to have a bit of variance in record width. So perhaps
> adding a bit of random() in to influence record width would be a good idea?

Something very roughly like the attached. Perhaps that's going a bit overboard
though. But it seems like it might be something we could use in a few tests?

Attachment

fill_seg.sql

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

13 October 2021, 20:42:37

This works nicely with the TAP test:



-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Postgres is bloatware by design: it was built to house
 PhD theses." (Joey Hellerstein, SIGMOD annual conference 2002)

Attachment

0001-do-away-with-small-tuple-writing.patch

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

13 October 2021, 20:46:53

Hi

On 2021-Oct-13, Andres Freund wrote:

> Something very roughly like the attached. Perhaps that's going a bit overboard
> though. But it seems like it might be something we could use in a few tests?

Hah, our emails crossed.  If you want to turn this into a patch to the
026 test file, please go ahead.  Failing that, I'd just push the patch I
just sent in my other reply.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"You're _really_ hosed if the person doing the hiring doesn't understand
relational systems: you end up with a whole raft of programmers, none of
whom has had a Date with the clue stick."              (Andrew Sullivan)

Re: prevent immature WAL streaming

From

Andres Freund

Date:

13 October 2021, 21:15:17

Hi,

On 2021-10-13 17:46:53 -0300, Alvaro Herrera wrote:
> On 2021-Oct-13, Andres Freund wrote:
> 
> > Something very roughly like the attached. Perhaps that's going a bit overboard
> > though. But it seems like it might be something we could use in a few tests?
> 
> Hah, our emails crossed.  If you want to turn this into a patch to the
> 026 test file, please go ahead.  Failing that, I'd just push the patch I
> just sent in my other reply.

Yea, let's go for your patch then. I've verified that at least locally it
passes under valgrind.

Greetings,

Andres Freund

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

13 October 2021, 22:09:28

On 2021-Oct-13, Andres Freund wrote:

> Hi,
> 
> On 2021-10-13 17:46:53 -0300, Alvaro Herrera wrote:
> > On 2021-Oct-13, Andres Freund wrote:
> > 
> > > Something very roughly like the attached. Perhaps that's going a bit overboard
> > > though. But it seems like it might be something we could use in a few tests?
> > 
> > Hah, our emails crossed.  If you want to turn this into a patch to the
> > 026 test file, please go ahead.  Failing that, I'd just push the patch I
> > just sent in my other reply.
> 
> Yea, let's go for your patch then. I've verified that at least locally it
> passes under valgrind.

Ah great, thanks.  Pushed then.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"But static content is just dynamic content that isn't moving!"
                http://smylers.hates-software.com/2007/08/15/fe244d0c.html

Re: prevent immature WAL streaming

From

Amul Sul

Date:

14 October 2021, 12:44:20

On Wed, Oct 13, 2021 at 10:58 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2021-Oct-13, Amul Sul wrote:
>
> > I have one more question, regarding the need for other global
> > variables i.e. abortedRecPtr.  (Sorry for coming back after so long.)
> >
> > Instead of abortedRecPtr point, isn't enough to write
> > overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr?  I think both
> > are pointing to the same location then can't we use
> > lastReplayedEndRecPtr instead of abortedRecPtr to write
> > overwrite-contrecord and remove need of extra global variable, like
> > attached?
>
> I'm a bit fuzzy on the difference "the end+1" and "the start of the next
> record".  Are they always the same?  We do have XLogRecPtrToBytePos()
> and XLogBytePosToEndRecPtr() to convert unadorned XLogRecPtr values to
> "usable byte positions", which suggests to me that the proposed patch
> may fail if end+1 is a page or segment boundary.
>

Yes, you are correct, that could be a possible failure.

How about calculating that from the lastReplayedEndRecPtr by
converting it first to "usable byte positions" and then recalculating
the record pointer from that, like attached?

> The other difference is that abortedRecPtr is set if we fail to read a
> record, but XLogCtl->lastReplayedEndRecPtr is set even if we read the
> record successfully.  So you'd have need a bool flag that the overwrite
> contrecord record needs to be written.  Your patch is using the fact
> that missingContrecPtr is non-invalid as such a flag ... I can't see
> anything wrong with that.  So maybe your patch is okay in this aspect.
>
> > You might wonder why I am so concerned about the global variable. The
> > reason is that I am working on another thread[1] where we are trying
> > to club all the WAL write operations that happen at the end of
> > StartupXLOG into a separate function. In the future, we might want to
> > allow executing this function from other processes (e.g.
> > Checkpointer). For that, we need to remove the dependency of those WAL
> > write operations having on the global variables which are mostly valid
> > in the startup process.
>
> Seems a fine goal.

Thanks for looking at the patch.

Regards,
Amul

Attachment

v2_remove-abortedRecPtr-global-variable.patch

Re: prevent immature WAL streaming

From

Amul Sul

Date:

22 October 2021, 13:13:52

On Thu, Oct 14, 2021 at 6:14 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Wed, Oct 13, 2021 at 10:58 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > On 2021-Oct-13, Amul Sul wrote:
> >
> > > I have one more question, regarding the need for other global
> > > variables i.e. abortedRecPtr.  (Sorry for coming back after so long.)
> > >
> > > Instead of abortedRecPtr point, isn't enough to write
> > > overwrite-contrecord at XLogCtl->lastReplayedEndRecPtr?  I think both
> > > are pointing to the same location then can't we use
> > > lastReplayedEndRecPtr instead of abortedRecPtr to write
> > > overwrite-contrecord and remove need of extra global variable, like
> > > attached?
> >
> > I'm a bit fuzzy on the difference "the end+1" and "the start of the next
> > record".  Are they always the same?  We do have XLogRecPtrToBytePos()
> > and XLogBytePosToEndRecPtr() to convert unadorned XLogRecPtr values to
> > "usable byte positions", which suggests to me that the proposed patch
> > may fail if end+1 is a page or segment boundary.
> >
>
> Yes, you are correct, that could be a possible failure.
>
> How about calculating that from the lastReplayedEndRecPtr by
> converting it first to "usable byte positions" and then recalculating
> the record pointer from that, like attached?
>

Any thoughts about the patch posted previously?

Regards,
Amul

Re: prevent immature WAL streaming

From

Kyotaro Horiguchi

Date:

25 October 2021, 01:32:52

At Fri, 22 Oct 2021 18:43:52 +0530, Amul Sul <sulamul@gmail.com> wrote in 
> Any thoughts about the patch posted previously?

Honestly, xlogreader looks fine with the current shape. The reason is
that it seems cleaner as an interface boundary since the caller of
xlogreader doesn't need to know about the details of xlogreader. The
current code nicely hides the end+1 confusion.

Even if we want to get rid of global variables in xlog.c, I don't
understand why we remove only abortedRecPtr. That change makes things
more complex as a whole by letting xlog.c be more conscious of
xlogreader's internals.  I'm not sure I like that aspect of the patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: prevent immature WAL streaming

From

Amul Sul

Date:

25 October 2021, 05:04:27

On Mon, Oct 25, 2021 at 7:02 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Fri, 22 Oct 2021 18:43:52 +0530, Amul Sul <sulamul@gmail.com> wrote in
> > Any thoughts about the patch posted previously?
>
> Honestly, xlogreader looks fine with the current shape. The reason is
> that it seems cleaner as an interface boundary since the caller of
> xlogreader doesn't need to know about the details of xlogreader. The
> current code nicely hides the end+1 confusion.
>
> Even if we want to get rid of global variables in xlog.c, I don't
> understand why we remove only abortedRecPtr. That change makes things
> more complex as a whole by letting xlog.c be more conscious of
> xlogreader's internals.  I'm not sure I like that aspect of the patch.
>

Because we have other ways to get abortedRecPtr without having a
global variable, but we don't have such a way for missingContrecPtr,
AFAICU.

I agree using global variables makes things a bit easier, but those
are inefficient when you want to share those with other processes --
that would add extra burden to shared memory.

Regards,
Amul

Re: prevent immature WAL streaming

From

Kyotaro Horiguchi

Date:

26 October 2021, 01:50:20

At Mon, 25 Oct 2021 10:34:27 +0530, Amul Sul <sulamul@gmail.com> wrote in 
> On Mon, Oct 25, 2021 at 7:02 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Fri, 22 Oct 2021 18:43:52 +0530, Amul Sul <sulamul@gmail.com> wrote in
> > > Any thoughts about the patch posted previously?
> >
> > Honestly, xlogreader looks fine with the current shape. The reason is
> > that it seems cleaner as an interface boundary since the caller of
> > xlogreader doesn't need to know about the details of xlogreader. The
> > current code nicely hides the end+1 confusion.
> >
> > Even if we want to get rid of global variables in xlog.c, I don't
> > understand why we remove only abortedRecPtr. That change makes things
> > more complex as a whole by letting xlog.c be more conscious of
> > xlogreader's internals.  I'm not sure I like that aspect of the patch.
> >
> 
> Because we have other ways to get abortedRecPtr without having a
> global variable, but we don't have such a way for missingContrecPtr,
> AFAICU.

That depends on the reason why you want to get rid of the glboal
variables. Since we restart WAL reading before reading the two
variables so we can not rely on the xlogreader's corresponding
members. So we need another set of variables to preserve the values
beyond the restart.

> I agree using global variables makes things a bit easier, but those
> are inefficient when you want to share those with other processes --
> that would add extra burden to shared memory.

We could simply add a new member in XLogCtlData. Or we can create
another struct for ReadRecord's (not XLogReader's) state then allocate
shared memory to it.  I don't think it is the right solution to infer
it from another variable using knowledge of xlogreader's internals.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: prevent immature WAL streaming

From

Tom Lane

Date:

02 November 2021, 19:38:01

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2021-Oct-13, Andres Freund wrote:
>> Yea, let's go for your patch then. I've verified that at least locally it
>> passes under valgrind.

> Ah great, thanks.  Pushed then.

Seems like this hasn't fixed the problem: skink still fails on
this test occasionally.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2021-10-22%2013%3A52%3A00
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2021-11-02%2000%3A28%3A30

Both of these look like

#   Failed test '000000010000000000000002 differs from 000000010000000000000002'
#   at t/026_overwrite_contrecord.pl line 61.
# Looks like you failed 1 test of 3.
t/026_overwrite_contrecord.pl ........
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/3 subtests

which looks like the same thing we were seeing before.
010e52337 seems to have just decreased the probability of failure.

            regards, tom lane

Re: prevent immature WAL streaming

From

Alexander Lakhin

Date:

08 November 2021, 04:00:00

Hello Alvaro,
14.10.2021 01:09, Alvaro Herrera wrote:
>> Yea, let's go for your patch then. I've verified that at least locally it
>> passes under valgrind.
> Ah great, thanks.  Pushed then.
>
While translating messages I've noticed that the version of the patch
ported to REL9_6_STABLE..REL_13_STABLE contains a typo "sucessfully".
Please consider committing the fix.

Best regards,
Alexander

Attachment

fix-sucessfully,patch

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

08 November 2021, 12:21:13

On 2021-Nov-08, Alexander Lakhin wrote:

> Hello Alvaro,
> 14.10.2021 01:09, Alvaro Herrera wrote:
> >> Yea, let's go for your patch then. I've verified that at least locally it
> >> passes under valgrind.
> > Ah great, thanks.  Pushed then.
> >
> While translating messages I've noticed that the version of the patch
> ported to REL9_6_STABLE..REL_13_STABLE contains a typo "sucessfully".
> Please consider committing the fix.

Thanks, pushed.  I also modified the .po files that contained the typo
so as not to waste the translators' efforts.

I blamed the wrong commit in the 9.6 commit message.  Sigh.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
Syntax error: function hell() needs an argument.
Please choose what hell you want to involve.

Re: prevent immature WAL streaming

From

Tom Lane

Date:

09 November 2021, 23:08:07

I wrote:
> Seems like this hasn't fixed the problem: skink still fails on
> this test occasionally.
> #   Failed test '000000010000000000000002 differs from 000000010000000000000002'
> #   at t/026_overwrite_contrecord.pl line 61.

This is still happening off and on, which makes it look like a
timing-sensitive problem.  Confirming that, I can make it fail
every time by adding a long sleep just ahead of where
026_overwrite_contrecord.pl captures $initfile.  On reflection
I think the problem is obvious: if autovacuum does anything
concurrently with the test's startup portion, it will cause the
carefully-established WAL insertion point to move into the
next segment.  I propose to add "autovacuum = off" to the
test's postmaster configuration.

Also, I think we want

-ok($initfile != $endfile, "$initfile differs from $endfile");
+ok($initfile ne $endfile, "$initfile differs from $endfile");

The existing coding works as long as all characters of these
WAL segment names happen to be decimal digits, but ...

            regards, tom lane

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

10 November 2021, 12:09:03

On 2021-Nov-09, Tom Lane wrote:

> This is still happening off and on, which makes it look like a
> timing-sensitive problem.  Confirming that, I can make it fail
> every time by adding a long sleep just ahead of where
> 026_overwrite_contrecord.pl captures $initfile.  On reflection
> I think the problem is obvious: if autovacuum does anything
> concurrently with the test's startup portion, it will cause the
> carefully-established WAL insertion point to move into the
> next segment.  I propose to add "autovacuum = off" to the
> test's postmaster configuration.

Ooh, of course.

> Also, I think we want
> 
> -ok($initfile != $endfile, "$initfile differs from $endfile");
> +ok($initfile ne $endfile, "$initfile differs from $endfile");
> 
> The existing coding works as long as all characters of these
> WAL segment names happen to be decimal digits, but ...

Argh!

Thanks for taking care of these issues.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La conclusión que podemos sacar de esos estudios es que
no podemos sacar ninguna conclusión de ellos" (Tanenbaum)

Re: prevent immature WAL streaming

From

Dagfinn Ilmari Mannsåker

Date:

10 November 2021, 13:31:06

Tom Lane <tgl@sss.pgh.pa.us> writes:

> Also, I think we want
>
> -ok($initfile != $endfile, "$initfile differs from $endfile");
> +ok($initfile ne $endfile, "$initfile differs from $endfile");
>
> The existing coding works as long as all characters of these
> WAL segment names happen to be decimal digits, but ...

Even better style (IMO) would be:

  isnt($initfile, $endfile, "WAL file name has changed");

Or some other more descriptive message of _why_ it should have changed.

- ilmari

Re: prevent immature WAL streaming

From

Tom Lane

Date:

23 November 2021, 19:04:19

We're *still* not out of the woods with 026_overwrite_contrecord.pl,
as we are continuing to see occasional "mismatching overwritten LSN"
failures, further down in the test where it tries to start up the
standby:

  sysname   |    branch     |      snapshot       |     stage     |
l                                                      

------------+---------------+---------------------+---------------+------------------------------------------------------------------------------------------------------------
 spurfowl   | REL_13_STABLE | 2021-10-18 03:56:26 | recoveryCheck | 2021-10-18 00:08:09.324 EDT [2455:6] FATAL:
mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 
 sidewinder | HEAD          | 2021-10-19 04:32:36 | recoveryCheck | 2021-10-19 06:46:23.168 CEST [26393:6] FATAL:
mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 
 francolin  | REL9_6_STABLE | 2021-10-26 01:41:39 | recoveryCheck | 2021-10-26 01:48:05.646 UTC [3417202][][1/0:0]
FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000 
 petalura   | HEAD          | 2021-11-05 00:20:03 | recoveryCheck | 2021-11-05 02:58:12.146 CET [61848fb3.28d157:6]
FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000 
 lapwing    | REL_11_STABLE | 2021-11-05 17:24:49 | recoveryCheck | 2021-11-05 17:39:29.741 UTC [9831:6] FATAL:
mismatchingoverwritten LSN 0/1FFE014 -> 0/1FFE000 
 morepork   | HEAD          | 2021-11-10 02:51:12 | recoveryCheck | 2021-11-10 04:03:33.576 CET [73561:6] FATAL:
mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 
 petalura   | HEAD          | 2021-11-16 15:20:03 | recoveryCheck | 2021-11-16 18:16:47.875 CET [6193e77f.35b87f:6]
FATAL: mismatching overwritten LSN 0/1FFE018 -> 0/1FFE000 
 morepork   | HEAD          | 2021-11-17 03:45:36 | recoveryCheck | 2021-11-17 04:57:04.359 CET [32089:6] FATAL:
mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 
 spurfowl   | REL_10_STABLE | 2021-11-22 22:21:03 | recoveryCheck | 2021-11-22 17:29:35.520 EST [16011:6] FATAL:
mismatchingoverwritten LSN 0/1FFE018 -> 0/1FFE000 
(9 rows)

Looking at adjacent successful runs, it seems that the exact point
where the "missing contrecord" starts varies substantially, even after
our previous fix to disable autovacuum in this test.  How could that be?

It's probably for the best though, because I think this is exposing
an actual bug that we would not have seen if the start point were
completely consistent.  I have not dug into the code, but it looks to
me like if the "consistent recovery state" is reached exactly at a
page boundary (0/1FFE000 in all these cases), then the standby expects
that to be what the OVERWRITE_CONTRECORD record will point at.  But
actually it points to the first WAL record on that page, resulting
in a bogus failure.

            regards, tom lane

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

23 November 2021, 20:40:35

On 2021-Nov-23, Tom Lane wrote:

> We're *still* not out of the woods with 026_overwrite_contrecord.pl,
> as we are continuing to see occasional "mismatching overwritten LSN"
> failures, further down in the test where it tries to start up the
> standby:

Augh.

> Looking at adjacent successful runs, it seems that the exact point
> where the "missing contrecord" starts varies substantially, even after
> our previous fix to disable autovacuum in this test.  How could that be?

Well, there is intentionally some variability.  Maybe not as much as one
would wish, but I expect that that should explain why that point is not
always the same.

> It's probably for the best though, because I think this is exposing
> an actual bug that we would not have seen if the start point were
> completely consistent.  I have not dug into the code, but it looks to
> me like if the "consistent recovery state" is reached exactly at a
> page boundary (0/1FFE000 in all these cases), then the standby expects
> that to be what the OVERWRITE_CONTRECORD record will point at.  But
> actually it points to the first WAL record on that page, resulting
> in a bogus failure.

So what is happening is that we set state->overwrittenRecPtr to the LSN
of page start, ignoring the page header.  Is that the LSN of the first
record in a page?  I'll see if I can reproduce the problem.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"La persona que no quería pecar / estaba obligada a sentarse
 en duras y empinadas sillas    / desprovistas, por cierto
 de blandos atenuantes"                          (Patricio Vogel)

Re: prevent immature WAL streaming

From

Amul Sul

Date:

25 November 2021, 06:08:42

On Wed, Nov 24, 2021 at 2:10 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2021-Nov-23, Tom Lane wrote:
>
> > We're *still* not out of the woods with 026_overwrite_contrecord.pl,
> > as we are continuing to see occasional "mismatching overwritten LSN"
> > failures, further down in the test where it tries to start up the
> > standby:
>
> Augh.
>
> > Looking at adjacent successful runs, it seems that the exact point
> > where the "missing contrecord" starts varies substantially, even after
> > our previous fix to disable autovacuum in this test.  How could that be?
>
> Well, there is intentionally some variability.  Maybe not as much as one
> would wish, but I expect that that should explain why that point is not
> always the same.
>
> > It's probably for the best though, because I think this is exposing
> > an actual bug that we would not have seen if the start point were
> > completely consistent.  I have not dug into the code, but it looks to
> > me like if the "consistent recovery state" is reached exactly at a
> > page boundary (0/1FFE000 in all these cases), then the standby expects
> > that to be what the OVERWRITE_CONTRECORD record will point at.  But
> > actually it points to the first WAL record on that page, resulting
> > in a bogus failure.
>
> So what is happening is that we set state->overwrittenRecPtr to the LSN
> of page start, ignoring the page header.  Is that the LSN of the first
> record in a page?  I'll see if I can reproduce the problem.
>

In XLogReadRecord(), both the variables being compared have
inconsistency in the assignment -- one gets assigned from
state->currRecPtr where other is from RecPtr.

.....
state->overwrittenRecPtr = state->currRecPtr;
.....
state->abortedRecPtr = RecPtr;
.....

Before the place where assembled flag sets, there is a bunch of code
that adjusts RecPtr. I think instead of  RecPtr, the latter assignment
should use state->currRecPtr as well.

Regards,
Amul

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

25 November 2021, 18:30:20

On 2021-Nov-25, Amul Sul wrote:

> In XLogReadRecord(), both the variables being compared have
> inconsistency in the assignment -- one gets assigned from
> state->currRecPtr where other is from RecPtr.
> 
> .....
> state->overwrittenRecPtr = state->currRecPtr;
> .....
> state->abortedRecPtr = RecPtr;
> .....
> 
> Before the place where assembled flag sets, there is a bunch of code
> that adjusts RecPtr. I think instead of  RecPtr, the latter assignment
> should use state->currRecPtr as well.

You're exactly right.  I managed to reproduce the problem shown by
buildfarm members, and indeed this fixes it.  And it makes sense: the
adjustment you refer to, is precisely to skip the page header when the
LSN is the start of the page, which is exactly the problem we're seeing
in the buildfarm ... except that on lapwing branch REL_11_STABLE, we're
seeing the LSN is off by 0x14 instead of 0x18.  That seems very strange.
I think the reason for this is that lapwing has MAXALIGN 4, so
MAXALIGN(sizeof(XLogPageHeaderData)) is 20, not 24 as is the case in the
other failing members.

... checks buildfarm ...

Yeah, all the others in Tom's list are x86-64.

I'm pushing the fix in a minute.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"¿Qué importan los años?  Lo que realmente importa es comprobar que
a fin de cuentas la mejor edad de la vida es estar vivo"  (Mafalda)

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

25 November 2021, 18:32:31

Oh, but also I think I should push a mitigation in case a production
system hits this problem: maybe reduce the message from FATAL to WARNING
if the registered LSN is at a page boundary.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Entristecido, Wutra                     (canción de Las Barreras)
echa a Freyr a rodar
y a nosotros al mar"

Re: prevent immature WAL streaming

From

Tom Lane

Date:

25 November 2021, 18:52:10

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Oh, but also I think I should push a mitigation in case a production
> system hits this problem: maybe reduce the message from FATAL to WARNING
> if the registered LSN is at a page boundary.

Uh, why?  The fix should remove the problem, and if it doesn't, we're
still looking at inconsistent WAL aren't we?

            regards, tom lane

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

25 November 2021, 19:15:48

On 2021-Nov-25, Tom Lane wrote:

> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > Oh, but also I think I should push a mitigation in case a production
> > system hits this problem: maybe reduce the message from FATAL to WARNING
> > if the registered LSN is at a page boundary.
> 
> Uh, why?  The fix should remove the problem, and if it doesn't, we're
> still looking at inconsistent WAL aren't we?

The problem is that the bug occurs while writing the WAL record.  Fixed
servers won't produce such records, but if you run an unpatched server
and it happens to write one, without a mitigation you cannot get away
from FATAL during replay.

Since this bug exists in released minors, we should allow people to
upgrade to a newer version if they hit it.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Nadie está tan esclavizado como el que se cree libre no siéndolo" (Goethe)

Re: prevent immature WAL streaming

From

Tom Lane

Date:

25 November 2021, 19:18:20

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2021-Nov-25, Tom Lane wrote:
>> Uh, why?  The fix should remove the problem, and if it doesn't, we're
>> still looking at inconsistent WAL aren't we?

> The problem is that the bug occurs while writing the WAL record.  Fixed
> servers won't produce such records, but if you run an unpatched server
> and it happens to write one, without a mitigation you cannot get away
> from FATAL during replay.

Really?  AFAICS the WAL record contains the correct value, or at least
we should define that one as being correct, for precisely this reason.

            regards, tom lane

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

25 November 2021, 19:58:28

On 2021-Nov-25, Tom Lane wrote:

> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> 
> > The problem is that the bug occurs while writing the WAL record.  Fixed
> > servers won't produce such records, but if you run an unpatched server
> > and it happens to write one, without a mitigation you cannot get away
> > from FATAL during replay.
> 
> Really?  AFAICS the WAL record contains the correct value, or at least
> we should define that one as being correct, for precisely this reason.

I don't know what is the correct value for a record that comes exactly
after the page header.  But here's a patch that fixes the problem; and
if a standby replays WAL written by an unpatched primary, it will be
able to read past instead of dying of FATAL.

I originally wrote this to have a WARNING in VerifyOverwriteContrecord
(in the cases that are new), with the idea that it'd prompt people to
upgrade, but that's probably a waste of time.

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)

Attachment

0001-Fix-LSN-in-OVERWRITTEN_CONTRECORD.patch

Re: prevent immature WAL streaming

From

Tom Lane

Date:

25 November 2021, 20:12:14

Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> On 2021-Nov-25, Tom Lane wrote:
>> Really?  AFAICS the WAL record contains the correct value, or at least
>> we should define that one as being correct, for precisely this reason.

> I don't know what is the correct value for a record that comes exactly
> after the page header.  But here's a patch that fixes the problem; and
> if a standby replays WAL written by an unpatched primary, it will be
> able to read past instead of dying of FATAL.

Meh ... but given the simplicity of the write-side fix, maybe changing
it is appropriate.

However, this seems too forgiving:

+    if (xlrec->overwritten_lsn != state->overwrittenRecPtr &&
+        xlrec->overwritten_lsn - SizeOfXLogShortPHD != state->overwrittenRecPtr &&
+        xlrec->overwritten_lsn - SizeOfXLogLongPHD != state->overwrittenRecPtr)

The latter two cases should only be accepted if overwrittenRecPtr is
exactly at a page boundary.

            regards, tom lane

Re: prevent immature WAL streaming

From

Tom Lane

Date:

25 November 2021, 20:15:49

I wrote:
> However, this seems too forgiving:

... also, I don't know if you intended this already, but the
VerifyOverwriteContrecord change should only be applied in
back branches.  There's no need for it in HEAD.

            regards, tom lane

Re: prevent immature WAL streaming

From

Amul Sul

Date:

26 November 2021, 04:24:30

On Fri, Nov 26, 2021 at 1:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > On 2021-Nov-25, Tom Lane wrote:
> >> Really?  AFAICS the WAL record contains the correct value, or at least
> >> we should define that one as being correct, for precisely this reason.
>
> > I don't know what is the correct value for a record that comes exactly
> > after the page header.  But here's a patch that fixes the problem; and
> > if a standby replays WAL written by an unpatched primary, it will be
> > able to read past instead of dying of FATAL.
>
> Meh ... but given the simplicity of the write-side fix, maybe changing
> it is appropriate.
>
> However, this seems too forgiving:
>
> +    if (xlrec->overwritten_lsn != state->overwrittenRecPtr &&
> +        xlrec->overwritten_lsn - SizeOfXLogShortPHD != state->overwrittenRecPtr &&
> +        xlrec->overwritten_lsn - SizeOfXLogLongPHD != state->overwrittenRecPtr)
>

Unless I am missing something, I am not sure why need this adjustment
if we are going to use state->currRecPtr value which doesn't seem to
be changing at all. AFAICU, state->currRecPtr will be unchanged value
whether going to set overwrittenRecPtr or abortedRecPtr. Do primary
and standby see state->currRecPtr differently, I guess not, never?

Regards,
Amul

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

26 November 2021, 13:49:50

On 2021-Nov-26, Amul Sul wrote:

> On Fri, Nov 26, 2021 at 1:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Meh ... but given the simplicity of the write-side fix, maybe changing
> > it is appropriate.

Actually, fixing the other side is equally simple, and it is also more
correct.  What changed my mind is that upon completing a successful read
of a record, what we set as state->ReadRecPtr is the local variable
RecPtr -- so that is what becomes the true LSN of the record.  Using
state->currRecPtr is inconsistent with that definition.

> Unless I am missing something, I am not sure why need this adjustment
> if we are going to use state->currRecPtr value which doesn't seem to
> be changing at all. AFAICU, state->currRecPtr will be unchanged value
> whether going to set overwrittenRecPtr or abortedRecPtr. Do primary
> and standby see state->currRecPtr differently, I guess not, never?

You're right for the wrong reason.  We don't need the adjustment in the
verify routine.  The reason we don't is that we're not going to use
state->currRecPtr anymore, but rather RecPtr in both places.  You're
thinking that primary and standby would never "see state->currRecPtr
differently", but that's only if they are both running the same code.
If you had a primary running 14.1 and a standby running 14.2, with the
previously proposed fix (using state->currRecPtr), you would be in
trouble.  With this fix (using RecPtr) it works fine.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

From

Alvaro Herrera

Date:

23 March 2022, 18:38:33

On 2021-Sep-03, Alvaro Herrera wrote:

> The last commit is something I noticed in pg_rewind ...

I had missed this one; it's pushed now.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"I can see support will not be a problem.  10 out of 10."    (Simon Wittber)
      (http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)