Thread: standby promotion can create unreadable WAL

standby promotion can create unreadable WAL

From
Robert Haas
Date:
My colleague Dilip Kumar and I have discovered what I believe to be a
bug in the recently-added "overwrite contrecord" stuff. I'm not sure
whether or not this bug has any serious consequences. I think that
there may be a scenario where it does, but I'm not sure about that.

Suppose you have a primary and a standby, and the standby is promoted
after reading a partial WAL record. The attached script, which was
written by Dilip and slightly modified by me, creates this scenario by
setting up an archiving-only standby, writing a record that crosses a
segment boundary, and then promoting the standby. If you then try to
run pg_waldump on the WAL on timeline 2, it goes boom:

[rhaas pg_wal]$ pg_waldump 000000020000000000000004
000000020000000000000005 2>&1 | tail -n4
rmgr: Heap        len (rec/tot):   1959/  1959, tx:        728, lsn:
0/04FFE7B0, prev 0/04FFDFF0, desc: INSERT off 4 flags 0x00, blkref #0:
rel 1663/5/16384 blk 2132
rmgr: Heap        len (rec/tot):   1959/  1959, tx:        728, lsn:
0/04FFEF58, prev 0/04FFE7B0, desc: INSERT+INIT off 1 flags 0x00,
blkref #0: rel 1663/5/16384 blk 2133
rmgr: Heap        len (rec/tot):   1959/  1959, tx:        728, lsn:
0/04FFF700, prev 0/04FFEF58, desc: INSERT off 2 flags 0x00, blkref #0:
rel 1663/5/16384 blk 2133
pg_waldump: error: error in WAL record at 0/4FFF700: invalid record
length at 0/4FFFEA8: wanted 24, got 0

What's happening here is that the last WAL segment from timeline 1,
which is 000000010000000000000004, gets copied over to the new
timeline up to the point where the last complete record on that
timeline ends, namely, 0/4FFFEA8. I think that the first record on the
new timeline should be written starting at that LSN, but that's not
what happens. Instead, the rest of that WAL segment remains zeroed,
and the first WAL record on the new timeline is written at the
beginning of the next segment:

[rhaas pg_wal]$ pg_waldump 000000020000000000000005 2>&1 | head -n4
rmgr: XLOG        len (rec/tot):     42/    42, tx:          0, lsn:
0/05000028, prev 0/04FFF700, desc: OVERWRITE_CONTRECORD lsn 0/4FFFEA8;
time 2022-08-22 13:49:22.874435 EDT
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn:
0/05000058, prev 0/05000028, desc: CHECKPOINT_SHUTDOWN redo 0/5000058;
tli 2; prev tli 1; fpw true; xid 0:729; oid 24576; multi 1; offset 0;
oldest xid 719 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
timestamp xid: 0/0; oldest running xid 0; shutdown
rmgr: XLOG        len (rec/tot):     30/    30, tx:          0, lsn:
0/050000D0, prev 0/05000058, desc: NEXTOID 32768
rmgr: Storage     len (rec/tot):     42/    42, tx:          0, lsn:
0/050000F0, prev 0/050000D0, desc: CREATE base/5/24576

Nothing that uses xlogreader is going to be able to bridge the gap
between file #4 and file #5. In this case it doesn't matter very much,
because we immediately write a checkpoint record into file #5, so if
we crash we won't try to replay file #4 anyway. However, if anything
did try to look at file #4 it would get confused. Maybe that can
happen if this is a streaming standby, where we only write an
end-of-recovery record upon promotion, rather than a checkpoint, or
maybe if there are cascading standbys someone could try to actually
use the 000000020000000000000004 file for something. I'm not sure. But
unless I'm missing something, that file is bogus, and our only hope of
not having problems is that perhaps no one will ever look at it.

I think that the cause of this problem is this code right here:

    /*
     * Actually, if WAL ended in an incomplete record, skip the parts that
     * made it through and start writing after the portion that persisted.
     * (It's critical to first write an OVERWRITE_CONTRECORD message, which
     * we'll do as soon as we're open for writing new WAL.)
     */
    if (!XLogRecPtrIsInvalid(missingContrecPtr))
    {
        Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
        EndOfLog = missingContrecPtr;
    }

It seems to me that this if-statement should also test that the TLI
has not changed i.e. if (newTLI != endOfRecoveryInfo->lastRecTLI &&
!XLogRecPtrIsInvalid(missingContrecPtr)). If the TLI hasn't changed,
then everything the comment says is correct and I think that what the
code does is also correct. However, if the TLI *has* changed, then I
think we must not advance EndOfLog here, because the WAL that was
copied from the old timeline to the new timeline ends at the point in
the file corresponding to the value of EndOfLog just before executing
this code. When this code then moves EndOfLog forward to the beginning
of the next segment, it leaves the unused portion of the previous
segment as all zeroes, which creates the problem described above.

(Incidentally, there's also a bug in pg_waldump here: it's reporting
the wrong LSN as the source of the error. 0/4FFF700 is not the record
that's busted, as shown by the fact that it was successfully decoded
and shown in the output. The relevant code in pg_waldump should be
using EndRecPtr instead of ReadRecPtr to report the error. If it did,
it would complain about 0/4FFFEA8, which is where the problem really
is. This is of the same vintage as the bug fixed by
d9fbb8862959912c5266364059c0abeda0c93bbf, though in that case the
issue was reporting all errors using the start LSN of the first of
several records read no matter where the error actually happened,
whereas in this case the error is using the start LSN of the previous
record instead of the current one.)

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: standby promotion can create unreadable WAL

From
Nathan Bossart
Date:
On Mon, Aug 22, 2022 at 02:36:36PM -0400, Robert Haas wrote:
> (Incidentally, there's also a bug in pg_waldump here: it's reporting
> the wrong LSN as the source of the error. 0/4FFF700 is not the record
> that's busted, as shown by the fact that it was successfully decoded
> and shown in the output. The relevant code in pg_waldump should be
> using EndRecPtr instead of ReadRecPtr to report the error. If it did,
> it would complain about 0/4FFFEA8, which is where the problem really
> is. This is of the same vintage as the bug fixed by
> d9fbb8862959912c5266364059c0abeda0c93bbf, though in that case the
> issue was reporting all errors using the start LSN of the first of
> several records read no matter where the error actually happened,
> whereas in this case the error is using the start LSN of the previous
> record instead of the current one.)

There was some previous discussion on this [0] [1].

[0] https://postgr.es/m/2B4510B2-3D70-4990-BFE3-0FE64041C08A%40amazon.com
[1] https://postgr.es/m/20220127.100738.1985658263632578184.horikyota.ntt%40gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: standby promotion can create unreadable WAL

From
Robert Haas
Date:
On Mon, Aug 22, 2022 at 10:38 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> There was some previous discussion on this [0] [1].
>
> [0] https://postgr.es/m/2B4510B2-3D70-4990-BFE3-0FE64041C08A%40amazon.com
> [1] https://postgr.es/m/20220127.100738.1985658263632578184.horikyota.ntt%40gmail.com

Thanks. It seems like there are various doubts on those threads about
whether EndRecPtr is really the right thing, but I'm not able to
understand what the problem is exactly. Certainly, in the common case,
it is, and as far as I can tell from looking at the code, it's what
we're intended to use. So I would like to see some concrete evidence
of it being wrong before we conclude that we need to do anything other
than a trivial change.

But the main issue for this thread is that we seem to be generating
invalid WAL. That seems like something we'd better get fixed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: standby promotion can create unreadable WAL

From
Dilip Kumar
Date:
On Tue, Aug 23, 2022 at 12:06 AM Robert Haas <robertmhaas@gmail.com> wrote:

> Nothing that uses xlogreader is going to be able to bridge the gap
> between file #4 and file #5. In this case it doesn't matter very much,
> because we immediately write a checkpoint record into file #5, so if
> we crash we won't try to replay file #4 anyway. However, if anything
> did try to look at file #4 it would get confused. Maybe that can
> happen if this is a streaming standby, where we only write an
> end-of-recovery record upon promotion, rather than a checkpoint, or
> maybe if there are cascading standbys someone could try to actually
> use the 000000020000000000000004 file for something. I'm not sure. But
> unless I'm missing something, that file is bogus, and our only hope of
> not having problems is that perhaps no one will ever look at it.

Yeah, this analysis looks correct to me.

> I think that the cause of this problem is this code right here:
>
>     /*
>      * Actually, if WAL ended in an incomplete record, skip the parts that
>      * made it through and start writing after the portion that persisted.
>      * (It's critical to first write an OVERWRITE_CONTRECORD message, which
>      * we'll do as soon as we're open for writing new WAL.)
>      */
>     if (!XLogRecPtrIsInvalid(missingContrecPtr))
>     {
>         Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
>         EndOfLog = missingContrecPtr;
>     }

Yeah, this statement as well as another statement that creates the
overwrite contrecord.  After changing these two lines the problem is
fixed for me.  Although I haven't yet thought of all the scenarios
that whether it is safe in all the cases.  I agree that after timeline
changes we are pointing to the end of the last valid record we can
start writing the next record from that point onward.  But I think we
should need to think hard that whether it will break any case for
which the overwrite contrecord was actually introduced.

diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index 7602fc8..3d38613 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5491,7 +5491,7 @@ StartupXLOG(void)
         * (It's critical to first write an OVERWRITE_CONTRECORD message, which
         * we'll do as soon as we're open for writing new WAL.)
         */
-       if (!XLogRecPtrIsInvalid(missingContrecPtr))
+       if (newTLI == endOfRecoveryInfo->lastRecTLI &&
!XLogRecPtrIsInvalid(missingContrecPtr))
        {
                Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
                EndOfLog = missingContrecPtr;
@@ -5589,7 +5589,7 @@ StartupXLOG(void)
        LocalSetXLogInsertAllowed();

        /* If necessary, write overwrite-contrecord before doing
anything else */
-       if (!XLogRecPtrIsInvalid(abortedRecPtr))
+       if (newTLI == endOfRecoveryInfo->lastRecTLI &&
!XLogRecPtrIsInvalid(abortedRecPtr))
        {
                Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
                CreateOverwriteContrecordRecord(abortedRecPtr,
missingContrecPtr, newTLI);

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: standby promotion can create unreadable WAL

From
Kyotaro Horiguchi
Date:
Nice find!

At Wed, 24 Aug 2022 11:09:44 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in 
> On Tue, Aug 23, 2022 at 12:06 AM Robert Haas <robertmhaas@gmail.com> wrote:
> 
> > Nothing that uses xlogreader is going to be able to bridge the gap
> > between file #4 and file #5. In this case it doesn't matter very much,
> > because we immediately write a checkpoint record into file #5, so if
> > we crash we won't try to replay file #4 anyway. However, if anything
> > did try to look at file #4 it would get confused. Maybe that can
> > happen if this is a streaming standby, where we only write an
> > end-of-recovery record upon promotion, rather than a checkpoint, or
> > maybe if there are cascading standbys someone could try to actually
> > use the 000000020000000000000004 file for something. I'm not sure. But
> > unless I'm missing something, that file is bogus, and our only hope of
> > not having problems is that perhaps no one will ever look at it.
> 
> Yeah, this analysis looks correct to me.

(I didn't reproduce the case but understand what is happening.)

Me, too.  There are two ways to deal with this, I think. One is start
writing new records from abortedContRecPtr as if it were not
exist. Another is copying WAL file up to missingContRecPtr. Since the
first segment of the new timeline doesn't need to be identcal to the
last one of the previous timeline, so I think the former way is
cleaner.  XLogInitNewTimeline or near seems to be be the place for fix
to me. Clearing abortedRecPtr and missingContrecPtr just before the
call to findNewestTimeLine will work?

====
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 87b243e0d4..27e01153e7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5396,6 +5396,13 @@ StartupXLOG(void)
          */
         XLogInitNewTimeline(EndOfLogTLI, EndOfLog, newTLI);
 
+        /*
+         * EndOfLog doesn't cover aborted contrecord even if the last record
+         * was that, then the next timeline starts writing from there. Forget
+         * about aborted and missing contrecords even if any.
+         */
+        abortedRecPtr = missingContrecPtr = InvalidXLogRecPtr;
+
         /*
          * Remove the signal files out of the way, so that we don't
          * accidentally re-enter archive recovery mode in a subsequent crash.
====

> > I think that the cause of this problem is this code right here:
> >
> >     /*
> >      * Actually, if WAL ended in an incomplete record, skip the parts that
> >      * made it through and start writing after the portion that persisted.
> >      * (It's critical to first write an OVERWRITE_CONTRECORD message, which
> >      * we'll do as soon as we're open for writing new WAL.)
> >      */
> >     if (!XLogRecPtrIsInvalid(missingContrecPtr))
> >     {
> >         Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
> >         EndOfLog = missingContrecPtr;
> >     }
> 
> Yeah, this statement as well as another statement that creates the
> overwrite contrecord.  After changing these two lines the problem is
> fixed for me.  Although I haven't yet thought of all the scenarios
> that whether it is safe in all the cases.  I agree that after timeline
> changes we are pointing to the end of the last valid record we can
> start writing the next record from that point onward.  But I think we
> should need to think hard that whether it will break any case for
> which the overwrite contrecord was actually introduced.
> 
> diff --git a/src/backend/access/transam/xlog.c
> b/src/backend/access/transam/xlog.c
> index 7602fc8..3d38613 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -5491,7 +5491,7 @@ StartupXLOG(void)
>          * (It's critical to first write an OVERWRITE_CONTRECORD message, which
>          * we'll do as soon as we're open for writing new WAL.)
>          */
> -       if (!XLogRecPtrIsInvalid(missingContrecPtr))
> +       if (newTLI == endOfRecoveryInfo->lastRecTLI &&
> !XLogRecPtrIsInvalid(missingContrecPtr))
>         {
>                 Assert(!XLogRecPtrIsInvalid(abortedRecPtr));
>                 EndOfLog = missingContrecPtr;
> @@ -5589,7 +5589,7 @@ StartupXLOG(void)
>         LocalSetXLogInsertAllowed();
> 
>         /* If necessary, write overwrite-contrecord before doing
> anything else */
> -       if (!XLogRecPtrIsInvalid(abortedRecPtr))
> +       if (newTLI == endOfRecoveryInfo->lastRecTLI &&
> !XLogRecPtrIsInvalid(abortedRecPtr))
>         {
>                 Assert(!XLogRecPtrIsInvalid(missingContrecPtr));
>                 CreateOverwriteContrecordRecord(abortedRecPtr,
> missingContrecPtr, newTLI);

This also seems to work, of course.

# However, I haven't managed to reproduce that, yet...

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: standby promotion can create unreadable WAL

From
Robert Haas
Date:
On Wed, Aug 24, 2022 at 4:40 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> Me, too.  There are two ways to deal with this, I think. One is start
> writing new records from abortedContRecPtr as if it were not
> exist. Another is copying WAL file up to missingContRecPtr. Since the
> first segment of the new timeline doesn't need to be identcal to the
> last one of the previous timeline, so I think the former way is
> cleaner.

I agree, mostly because that gets us back to the way all of this
worked before the contrecord stuff went in. This case wasn't broken
then, because the breakage had to do with it being unsafe to back up
and rewrite WAL that might have already been shipped someplace, and
that's not an issue when we're first creating a totally new timeline.
It seems safer to me to go back to the way this worked before the fix
went in than to change over to a new system.

Honestly, in a vacuum, I might prefer to get rid of this thing where
the WAL segment gets copied over from the old timeline to the new, and
just always switch TLIs at segment boundaries. And while we're at it,
I'd also like TLIs to be 64-bit random numbers instead of integers
assigned in ascending order. But those kinds of design changes seem
best left for a future master-only development effort. Here, we need
to back-patch the fix, and should try to just unbreak what's currently
broken.

> XLogInitNewTimeline or near seems to be be the place for fix
> to me. Clearing abortedRecPtr and missingContrecPtr just before the
> call to findNewestTimeLine will work?

Hmm, yeah, that seems like a good approach.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: standby promotion can create unreadable WAL

From
Dilip Kumar
Date:
On Tue, Aug 23, 2022 at 12:06 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
However, if anything
> did try to look at file #4 it would get confused. Maybe that can
> happen if this is a streaming standby, where we only write an
> end-of-recovery record upon promotion, rather than a checkpoint, or
> maybe if there are cascading standbys someone could try to actually
> use the 000000020000000000000004 file for something. I'm not sure. But
> unless I'm missing something, that file is bogus, and our only hope of
> not having problems is that perhaps no one will ever look at it.

I tried in streaming mode, but it seems in the streaming mode we will
never create this bogus file because of this check [1].  So if the
StandbyMode is true then we are never setting "abortedRecPtr" and
"missingContrecPtr" which means we will never create that 0-filled gap
in the WAL file that we are discussing in this thread.

Do we need to set it?  I feel we don't.  Why? because on this thread
we are also discussing that if the timeline switches then we don’t
need to create that 0-filled gap and that is the actual problem we are
discussing here.  And we know that if we are coming out of the
StandbyMode then we will always switch the timeline so we don’t create
that 0-filled gap.  OTOH if we are coming out of the archive recovery
then also we will switch the timeline so in that case also we do not
need that.  So practically we need to 0 fill that partial record only
when we are doing the crash recovery is that understanding correct?
If so then we can simply avoid setting these variables if
ArchiveRecoveryRequested is true.  So in the below check[1] instead of
(!StandbyMode), we can just put (! ArchiveRecoveryRequested), and then
we don't need any other fix.  Am I missing anything?

[1]
ReadRecord{
..record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
        if (record == NULL)
        {
            /*
             * When not in standby mode we find that WAL ends in an incomplete
             * record, keep track of that record.  After recovery is done,
             * we’ll write a record to indicate to downstream WAL readers that
             * that portion is to be ignored.
             */
            if (!StandbyMode &&
                !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
            {
                abortedRecPtr = xlogreader->abortedRecPtr;
                missingContrecPtr = xlogreader->missingContrecPtr;
            }

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: standby promotion can create unreadable WAL

From
Robert Haas
Date:
On Fri, Aug 26, 2022 at 8:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> ArchiveRecoveryRequested is true.  So in the below check[1] instead of
> (!StandbyMode), we can just put (! ArchiveRecoveryRequested), and then
> we don't need any other fix.  Am I missing anything?
>
> [1]
> ReadRecord{
> ..record = XLogPrefetcherReadRecord(xlogprefetcher, &errormsg);
>         if (record == NULL)
>         {
>             /*
>              * When not in standby mode we find that WAL ends in an incomplete
>              * record, keep track of that record.  After recovery is done,
>              * we’ll write a record to indicate to downstream WAL readers that
>              * that portion is to be ignored.
>              */
>             if (!StandbyMode &&
>                 !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
>             {
>                 abortedRecPtr = xlogreader->abortedRecPtr;
>                 missingContrecPtr = xlogreader->missingContrecPtr;
>             }

I agree. Testing StandbyMode here seems bogus. I thought initially
that the test should perhaps be for InArchiveRecovery rather than
ArchiveRecoveryRequested, but I see that the code which switches to a
new timeline cares about ArchiveRecoveryRequested, so I think that is
the correct thing to test here as well.

Concretely, I propose the following patch.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: standby promotion can create unreadable WAL

From
Alvaro Herrera
Date:
On 2022-Aug-26, Robert Haas wrote:

> I agree. Testing StandbyMode here seems bogus. I thought initially
> that the test should perhaps be for InArchiveRecovery rather than
> ArchiveRecoveryRequested, but I see that the code which switches to a
> new timeline cares about ArchiveRecoveryRequested, so I think that is
> the correct thing to test here as well.

Yeah, I think you had already established elsewhere that testing
StandbyMode was the wrong thing to do.  Testing ArchiveRecoveryRequested
here seems quite odd at first, but given the copying behavior, I agree
that it seems a correct thing to do.

There's a small typo in the comment: "When find that".  I suppose that
was meant to be "When we find that".  You end that para with "and thus
we should not do this", but that sounds like it wouldn't matter if we
did.  Maybe "and thus doing this would be wrong, so skip it." or
something like that.  (Perhaps be even more specific and say "if we did
this, we would later create an overwrite record in the wrong place,
breaking everything")

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/



Re: standby promotion can create unreadable WAL

From
Robert Haas
Date:
On Fri, Aug 26, 2022 at 10:06 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> There's a small typo in the comment: "When find that".  I suppose that
> was meant to be "When we find that".  You end that para with "and thus
> we should not do this", but that sounds like it wouldn't matter if we
> did.  Maybe "and thus doing this would be wrong, so skip it." or
> something like that.  (Perhaps be even more specific and say "if we did
> this, we would later create an overwrite record in the wrong place,
> breaking everything")

I think that saying that someone should not do something implies
pretty clearly that it would be bad if they did. But I have no problem
with your more specific language, and as a general rule, it's good to
be specific, so let's use that.

v2 attached.

Thanks for chiming in.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: standby promotion can create unreadable WAL

From
"Imseih (AWS), Sami"
Date:
>    I agree. Testing StandbyMode here seems bogus. I thought initially
>    that the test should perhaps be for InArchiveRecovery rather than
>    ArchiveRecoveryRequested, but I see that the code which switches to a
>    new timeline cares about ArchiveRecoveryRequested, so I think that is
>    the correct thing to test here as well.

>    Concretely, I propose the following patch.

This  patch looks similar to the change suggested in 
https://www.postgresql.org/message-id/FB0DEA0B-E14E-43A0-811F-C1AE93D00FF3%40amazon.com
to deal with panics after promoting a standby.

The difference is the patch tests !ArchiveRecoveryRequested instead
of !StandbyModeRequested as proposed in the mentioned thread.


Thanks
--
Sami Imseih
Amazon Web Services


Re: standby promotion can create unreadable WAL

From
Robert Haas
Date:
On Fri, Aug 26, 2022 at 11:59 AM Imseih (AWS), Sami <simseih@amazon.com> wrote:
> >    I agree. Testing StandbyMode here seems bogus. I thought initially
> >    that the test should perhaps be for InArchiveRecovery rather than
> >    ArchiveRecoveryRequested, but I see that the code which switches to a
> >    new timeline cares about ArchiveRecoveryRequested, so I think that is
> >    the correct thing to test here as well.
>
> >    Concretely, I propose the following patch.
>
> This  patch looks similar to the change suggested in
> https://www.postgresql.org/message-id/FB0DEA0B-E14E-43A0-811F-C1AE93D00FF3%40amazon.com
> to deal with panics after promoting a standby.
>
> The difference is the patch tests !ArchiveRecoveryRequested instead
> of !StandbyModeRequested as proposed in the mentioned thread.

OK, I didn't realize this bug had been independently discovered and it
looks like I was even involved in the previous discussion. I just
totally forgot about it.

I think, however, that your fix is wrong and this one is right.
Fundamentally, the server is either in normal running, or crash
recovery, or archive recovery. Standby mode is just an optional
behavior of archive recovery, controlling whether or not we keep
retrying once the end of WAL is reached. But there's no reason why the
server should put the contrecord at a different location when recovery
ends depending on that retry behavior. The only thing that matters is
whether we're going to switch timelines.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: standby promotion can create unreadable WAL

From
"Imseih (AWS), Sami"
Date:
>    I think, however, that your fix is wrong and this one is right.
>    Fundamentally, the server is either in normal running, or crash
>    recovery, or archive recovery. Standby mode is just an optional
>    behavior of archive recovery

Good point. Thanks for clearing my understanding.

Thanks
--
Sami Imseih
Amazon Web Services


Re: standby promotion can create unreadable WAL

From
Dilip Kumar
Date:
On Fri, Aug 26, 2022 at 7:53 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Aug 26, 2022 at 10:06 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > There's a small typo in the comment: "When find that".  I suppose that
> > was meant to be "When we find that".  You end that para with "and thus
> > we should not do this", but that sounds like it wouldn't matter if we
> > did.  Maybe "and thus doing this would be wrong, so skip it." or
> > something like that.  (Perhaps be even more specific and say "if we did
> > this, we would later create an overwrite record in the wrong place,
> > breaking everything")
>
> I think that saying that someone should not do something implies
> pretty clearly that it would be bad if they did. But I have no problem
> with your more specific language, and as a general rule, it's good to
> be specific, so let's use that.
>
> v2 attached.

The patch LGTM, this patch will apply on master and v15.  PFA patch
for back branches.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: standby promotion can create unreadable WAL

From
Kyotaro Horiguchi
Date:
At Sun, 28 Aug 2022 10:16:21 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in 
> On Fri, Aug 26, 2022 at 7:53 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > v2 attached.
> 
> The patch LGTM, this patch will apply on master and v15.  PFA patch
> for back branches.

StandbyMode is obviously wrong.  On the other hand I thought that
!ArchiveRecoveryRequested is somewhat wrong, too (, as I stated in the
pointed thread).  On second thought, I changed my mind that it is
right. After aborted contrec is found, The cause of the confusion is
that I somehow thought that archive recovery continues from the
aborted-contrec record.  However, that assumption is wrong. The next
redo starts from the beginning of the aborted contrecord so we should
forget abouat the old missing/aborted contrec info when archive
recovery is requested.

In the end, the point is that we need to set the global variables only
when XLogPrefetcherReadRecord() (or XLogReadRecord()) returns NULL and
we return it to the caller.  Is it worth to do a small refactoring
like the attached?  If no, I'm fine with the proposed patch including
the added assertion.

# I havent reproduce the issue of the OP in the other thread yet, and
# also not found how to reproduce this issue, though..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment

Re: standby promotion can create unreadable WAL

From
Kyotaro Horiguchi
Date:
At Mon, 29 Aug 2022 13:13:52 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> we return it to the caller.  Is it worth to do a small refactoring
> like the attached?  If no, I'm fine with the proposed patch including
> the added assertion.

Mmm. That seems wrong. So forget about that.  The proposed patch looks
fine to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: standby promotion can create unreadable WAL

From
Dilip Kumar
Date:
On Fri, Aug 26, 2022 at 6:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Aug 23, 2022 at 12:06 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> However, if anything
> > did try to look at file #4 it would get confused. Maybe that can
> > happen if this is a streaming standby, where we only write an
> > end-of-recovery record upon promotion, rather than a checkpoint, or
> > maybe if there are cascading standbys someone could try to actually
> > use the 000000020000000000000004 file for something. I'm not sure. But
> > unless I'm missing something, that file is bogus, and our only hope of
> > not having problems is that perhaps no one will ever look at it.

I tried to see the problem with the cascading standby, basically the
setup is like below
pgprimary->pgstandby(archive only)->pgcascade(streaming + archive).

The second node has to be archive only because this 0 filled gap is
created in archive only mode.  With that I have noticed that the when
cascading standby is getting that 0 filled gap it report same error
what we seen with pg_waldump and that it keep waiting forever on that
file.  I have attached a test case, but I think timing is not done
perfectly in this test so before the cascading standby setup some of
the WAL file get removed by the pgstandby so I just put direct return
in RemoveOldXlogFiles() to test this[2].  And this problem is getting
resolved with the patch given by Robert upthread.

[1]
2022-08-25 16:21:26.413 IST [18235] LOG:  invalid record length at
0/FFFFEA8: wanted 24, got 0

[2]
diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index eb5115f..990a879 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3558,6 +3558,7 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr
lastredoptr, XLogRecPtr endptr,
        XLogSegNo       endlogSegNo;
        XLogSegNo       recycleSegNo;

+       return;
        /* Initialize info about where to try to recycle to */
        XLByteToSeg(endptr, endlogSegNo, wal_segment_size);
        recycleSegNo = XLOGfileslop(lastredoptr);

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: standby promotion can create unreadable WAL

From
Robert Haas
Date:
On Mon, Aug 29, 2022 at 12:21 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> Mmm. That seems wrong. So forget about that.  The proposed patch looks
> fine to me.

Thanks for thinking it over. Committed and back-patched as far as v10,
since that's the oldest supported release.

-- 
Robert Haas
EDB: http://www.enterprisedb.com