Re: Crash by targetted recovery - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: Crash by targetted recovery
Date
Msg-id 0db8f83f-5b66-75e9-6d99-93f19fc33165@oss.nttdata.com
Whole thread Raw
In response to Re: Crash by targetted recovery  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: Crash by targetted recovery  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers

On 2020/03/05 12:08, Kyotaro Horiguchi wrote:
> At Mon, 2 Mar 2020 20:54:04 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
>>> And random access during StandbyMode ususally (always?) lets RecPtr go
>>> back.  I'm not sure WaitForWALToBecomeAvailable works correctly if we
>>> don't have a file in pg_wal and the REDO point is far back by more
>>> than a segment from the initial checkpoint record.
>>
>> It works correctly. This is why WaitForWALToBecomeAvailable() uses
>> fetching_ckpt argument.
> 
> Hmm. You're right. We start streaming from RedoStartLSN when
> fetching_ckpt.  So that doesn't happen.
> 
>>> If we go back to XLOG_FROM_ARCHIVE by random access, it correctly
>>> re-connects to the primary for the past segment.
>>
>> But this can lead to unnecessary restart of walreceiver. Since
>> fetching_ckpt ensures that the WAL file containing the REDO
>> starting record exists in pg_wal, there is no need to reconnect
>> to the primary when reading the REDO starting record.
>>
>> Is there other case where we need to go back to XLOG_FROM_ARCHIVE
>> by random access?
> 
> I understand that the reconnection for REDO record is useless. Ok I
> take the !StandbyMode way.
> 
> The attached is the test script that is changed to count the added
> test, and the slight revised main patch.

Thanks for the patch!

+        /* Wal receiver should not active when entring XLOG_FROM_ARCHIVE */
+        Assert(!WalRcvStreaming());

+1 to add this assertion check.

Isn't it better to always check this while trying to read WAL from
archive or pg_wal? So, what about the following change?

                 {
                         case XLOG_FROM_ARCHIVE:
                         case XLOG_FROM_PG_WAL:
+                               /*
+                                * WAL receiver should not be running while trying to
+                                * read WAL from archive or pg_wal.
+                                */
+                               Assert(!WalRcvStreaming());
+
                                 /* Close any old file we might have open. */
                                 if (readFile >= 0)


+        lastSourceFailed = false; /* We haven't failed on the new source */

Is this really necessary? Since ReadRecord() always reset
lastSourceFailed to false, it seems not necessary.


-    else if (currentSource == 0)
+    else if (currentSource == 0 ||

Though this is a *separate topic*, 0 should be XLOG_FROM_ANY?
There are some places where 0 is used as the value of currentSource.
IMO they should be updated so that XLOG_FROM_ANY is used instead of 0.

Regards,

-- 
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters



pgsql-hackers by date:

Previous
From: Darafei "Komяpa" Praliaskouski
Date:
Subject: Re: Berserk Autovacuum (let's save next Mandrill)
Next
From: Fujii Masao
Date:
Subject: Re: Some problems of recovery conflict wait events