Thread: Race condition in recovery?

Race condition in recovery?

From
Dilip Kumar
Date:
While analyzing one of the customer issues, based on the log it
appeared that there is a race condition in the recovery process.

The summary of the issue is, that one of the standby is promoted as
the new primary (Node2) and another standby (Node3) is restarted and
set the primary_info and the restore_command so that it can
stream/restore from Node2 (new primary).  The problem is that during
the promotion the timeline switch happened in the middle of the
segment in Node2 and the Node3 is able to restore the newTLI.history
file from the archive but the WAL file with the new TLI is not yet
archived.  Now, we will try to stream the wal file from the primary
but if we are fetching the checkpoint that time we will not use the
latest timeline instead we will try with the checkpoint timeline and
walsender will send the WAL from the new timeline file because
requested WAL recptr is before the TLI switch.   And once that
happened the expectedTLEs will be set based on the oldTLI.history
file.  Now, whenever we try to restore the required WAL file and
recoveryTargetTimeLineGoal is set to the latest we again try to rescan
the latest timeline (rescanLatestTimeLine) but the problem is
recoveryTargetTLI was already set to the latest timeline.  So now the
problem is expectedTLEs is set to oldTLI and recoveryTargetTLI is set
to newTLI and rescanLatestTimeLine will never update the expectedTLEs.
Now,  walsender will fail to stream new WAL using the old TLI and the
archiver process will also fail because that file doesn't not exists
anymore (converted to .partial).  Basically, now we will never try
with the newTLI.

I have given the sequence of the events based on my analysis.

Refer to the sequence of event
-----------------------------------------
Node1 primary, Node2 standby1, Node3 standby2

1. Node2 got promoted to new primary, and node 2 picked new TL 13 in
the middle of the segment.
2. Node3, restarted with new primary info of Node2 and restore command
3. Node3, found the newest TL13 in validateRecoveryParameters()
Because the latest TL was requested in recovery.conf (history file
restored from TL13) and set recoveryTargetTLI to 13
So point to note is recoveryTargetTLI is set to 13 but expectedTLEs is
not yet set.
4. Node3, entered into the standby mode.
5. Node3, tries to read the checkpoint Record, on Node3 still the
checkpoint TL (ControlFile->checkPointCopy.ThisTimeLineID) is 12.
6. Node3, tries to get the checkpoint record file using new TL13 from
the archive which it should get ideally but it may not if the Node2
haven't yet archived it.
7. Node3, tries to stream from primary but using TL12 because
ControlFile->checkPointCopy.ThisTimeLineID is 12.
8. Node3, get it because walsender of Node2 read it from TL13 and send
it and Node2 write in the new WAL file but with TL12.

WalSndSegmentOpen()
{
/*-------
* When reading from a historic timeline, and there is a timeline switch
* within this segment, read from the WAL segment belonging to the new
* timeline.
}

9. Node3, now set the expectedTLEs to 12 because that is what
walreceiver has streamed the WAL using.

10. Node3, now recoveryTargetTLI is 13 and expectedTLEs is 12.  So
whenever it tries to find the latest TLE (rescanLatestTimeLine ) it
finds it is 13 which is the same as recoveryTargetTLI so nothing to
change but expectedTLEs is 12 using which it will try to
restore/stream further WAL and fail every time.

rescanLatestTimeLine(void)
{
....
newtarget = findNewestTimeLine(recoveryTargetTLI);
if (newtarget == recoveryTargetTLI)
{
      /* No new timelines found */
        return false;
}

11.  So now the situation is that ideally in rescanLatestTimeLine() we
should get the latest TLI but it assumes that it is already on the
latest TLI because recoveryTargetTLI is on the latest TLI.

Other points to be noted:
- The actual issue happened on 9.6.11 but based on the code comparison
it appeared that same can occur on the latest code as well.
- After Node3 is shutdown wal from its pg_wal/ directory were removed
so that it can follow the new primary.

Based on the sequence of events It is quite clear that something is
wrong in rescanLatestTimeLine, maybe after selecting the latest TLI we
should compare it with the head of the expectedTLEs as well instead of
just comparing it to the recoveryTargetTLI?


Log from Node2:
2020-12-22 04:49:02 UTC [1019]: [9-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  received promote request
2020-12-22 04:49:02 UTC [1019]: [10-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  redo done at 0/F8000028
2020-12-22 04:49:02 UTC [1019]: [11-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  last completed transaction was at log time
2020-12-22 04:48:01.833476+00
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000D.history" failed:
No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
2020-12-22 04:49:02 UTC [1019]: [12-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  selected new timeline ID: 13
2020-12-22 04:49:02 UTC [1019]: [13-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  archive recovery complete


Log From Node3 (with pointwise analysis):

1. Node3 restarted, restored 0000000D.history from archive and
recoveryTargetTLI is set to 13
2020-12-22 04:49:40 UTC [2896]: [2-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  database system is shut down
2020-12-22 04:49:40 UTC [2872]: [6-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  database system is shut down
2020-12-22 04:49:41 UTC [9082]: [1-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  database system was shut down in recovery at
2020-12-22 04:49:40 UTC
2020-12-22 04:49:41 UTC [9082]: [2-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  creating missing WAL directory
"pg_xlog/archive_status"
2020-12-22 04:49:41 UTC [9082]: [3-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  restored log file "0000000D.history" from archive
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed:
No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
2020-12-22 04:49:41 UTC [9082]: [4-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  entering standby mode
2020-12-22 04:49:41 UTC [9082]: [5-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  restored log file "0000000D.history" from archive


2. Tries to get the WAL file with different timelines from the archive
but did not get so expectedTLEs is not yet set

rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000D00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000B00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000100000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]

3. Since fetching the checkpoint record so use the checkpoint TLI
which is 12,  although primary doesn’t have 0000000C00000000000000F8
file as it renamed it to 0000000C00000000000000F8.partial

But there is the logic in walsender that if requested wal is there in
the current TLI then send from their so it will stream from
0000000D00000000000000F8 file

2020-12-22 04:49:42 UTC [9105]: [1-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  fetching timeline history file for timeline 12
from primary server
2020-12-22 04:49:42 UTC [9105]: [2-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  started streaming WAL from primary at 0/F8000000
on timeline 12
2020-12-22 04:49:42 UTC [9105]: [3-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  replication terminated by primary server
2020-12-22 04:49:42 UTC [9105]: [4-1] user=; db=; app=; client=;
SQLcode=00000 DETAIL:  End of WAL reached on timeline 12 at
0/F8000098.



4. Now since walreciver assumes that it has restore the WAL from the
TL 12 the recieveTLI is 12 and the expectedTLEs is set base on the
0000000C.history.
See below Logic in WaitForWalToBecomeAvailable
                        if (readFile < 0)
                        {
                            if (!expectedTLEs)
                                expectedTLEs = readTimeLineHistory(receiveTLI);


2020-12-22 04:49:42 UTC [9082]: [6-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  restored log file "0000000C.history" from archive
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed:
No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
2020-12-22 04:49:42 UTC [9082]: [7-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  restored log file "0000000C.history" from archive
2020-12-22 04:49:42 UTC [9082]: [8-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  consistent recovery state reached at 0/F8000098
2020-12-22 04:49:42 UTC [9082]: [9-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  invalid record length at 0/F8000098: wanted 24,
got 0
2020-12-22 04:49:42 UTC [9074]: [3-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  database system is ready to accept read only
connections
2020-12-22 04:49:42 UTC [9105]: [5-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  restarted WAL streaming at 0/F8000000 on timeline
12
2020-12-22 04:49:42 UTC [9105]: [6-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  replication terminated by primary server
2020-12-22 04:49:42 UTC [9105]: [7-1] user=; db=; app=; client=;
SQLcode=00000 DETAIL:  End of WAL reached on timeline 12 at
0/F8000098.


4. Now, expectedTLEs head is as 12 and recoveryTargetTLI is 13 so in
rescanLatestTimeLine we always assume we are at the latest Ali but we
try to archive from expectedTLEs which is old TLI.

rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed:
No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
2020-12-22 04:49:47 UTC [9105]: [8-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  restarted WAL streaming at 0/F8000000 on timeline
12
2020-12-22 04:49:47 UTC [9105]: [9-1] user=; db=; app=; client=;
SQLcode=00000 LOG:  replication terminated by primary server
2020-12-22 04:49:47 UTC [9105]: [10-1] user=; db=; app=; client=;
SQLcode=00000 DETAIL:  End of WAL reached on timeline 12 at
0/F8000098.
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed:
No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8"
failed: No such file or directory (2)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Robert Haas
Date:
On Thu, Jan 21, 2021 at 4:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 8. Node3, get it because walsender of Node2 read it from TL13 and send
> it and Node2 write in the new WAL file but with TL12.
>
> WalSndSegmentOpen()
> {
> /*-------
> * When reading from a historic timeline, and there is a timeline switch
> * within this segment, read from the WAL segment belonging to the new
> * timeline.
> }
>
> 9. Node3, now set the expectedTLEs to 12 because that is what
> walreceiver has streamed the WAL using.

This seems to be incorrect, because the comment for expectedTLEs says:

 * expectedTLEs: a list of TimeLineHistoryEntries for
recoveryTargetTLI and the timelines of
 * its known parents, newest first (so recoveryTargetTLI is always the
 * first list member).  Only these TLIs are expected to be seen in the WAL
 * segments we read, and indeed only these TLIs will be considered as
 * candidate WAL files to open at all.

But in your scenario apparently we end up with a situation that
contradicts that, because you go on to say:

> 10. Node3, now recoveryTargetTLI is 13 and expectedTLEs is 12.  So

As I understand, expectedTLEs should end up being a list where the
first element is the timeline we want to end up on, and the last
element is the timeline where we are now, and every timeline in the
list branches off of the next timeline in the list. So here if 13
branches of 12 then the list should be 13,12 not just 12; and if 13
does not branch off of 12 OR if 13 branches off of 12 at an earlier
point in the WAL stream than where we are now, then that should be an
error that shuts down the standby, because then there is no way for
replay to get from where it is now to the correct timeline.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Fri, Jan 22, 2021 at 2:05 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 21, 2021 at 4:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > 8. Node3, get it because walsender of Node2 read it from TL13 and send
> > it and Node2 write in the new WAL file but with TL12.
> >
> > WalSndSegmentOpen()
> > {
> > /*-------
> > * When reading from a historic timeline, and there is a timeline switch
> > * within this segment, read from the WAL segment belonging to the new
> > * timeline.
> > }
> >
> > 9. Node3, now set the expectedTLEs to 12 because that is what
> > walreceiver has streamed the WAL using.
>
> This seems to be incorrect, because the comment for expectedTLEs says:
>
>  * expectedTLEs: a list of TimeLineHistoryEntries for
> recoveryTargetTLI and the timelines of
>  * its known parents, newest first (so recoveryTargetTLI is always the
>  * first list member).  Only these TLIs are expected to be seen in the WAL
>  * segments we read, and indeed only these TLIs will be considered as
>  * candidate WAL files to open at all.
>
> But in your scenario apparently we end up with a situation that
> contradicts that, because you go on to say:
>
> > 10. Node3, now recoveryTargetTLI is 13 and expectedTLEs is 12.  So
>
> As I understand, expectedTLEs should end up being a list where the
> first element is the timeline we want to end up on, and the last
> element is the timeline where we are now, and every timeline in the
> list branches off of the next timeline in the list. So here if 13
> branches of 12 then the list should be 13,12 not just 12; and if 13
> does not branch off of 12 OR if 13 branches off of 12 at an earlier
> point in the WAL stream than where we are now, then that should be an
> error that shuts down the standby, because then there is no way for
> replay to get from where it is now to the correct timeline.

Yeah, I agree with this.  So IMHO the expectedTLEs should be set based
on the recoveryTargetTLI instead of receiveTLI.  Based on the
expectedTLEs definition it can never be correct to set it based on the
receiveTLI.

Basically, the simple fix could be this.

diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index b18257c198..465bc7c929 100644

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12533,7 +12533,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr,
bool randAccess,
                                                if (readFile < 0)
                                                {
                                                        if (!expectedTLEs)
-
expectedTLEs = readTimeLineHistory(receiveTLI);
+
expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
+
                                                       readFile =
XLogFileRead(readSegNo, PANIC,

                                 receiveTLI,

                                 XLOG_FROM_STREAM, false);

But I am afraid that whether this adjustment (setting based on
receiveTLI) is done based on some analysis or part of some bug fix.  I
will try to find the history of this and maybe based on that we can
make a better decision.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Tue, Mar 2, 2021 at 3:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> =====
> ee994272ca50f70b53074f0febaec97e28f83c4e
> Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>  2013-01-03 14:11:58
> Committer: Heikki Linnakangas <heikki.linnakangas@iki.fi>  2013-01-03 14:11:58
>
>     Delay reading timeline history file until it's fetched from master.
>
>     Streaming replication can fetch any missing timeline history files from the
>     master, but recovery would read the timeline history file for the target
>     timeline before reading the checkpoint record, and before walreceiver has
>     had a chance to fetch it from the master. Delay reading it, and the sanity
>     checks involving timeline history, until after reading the checkpoint
>     record.
>
>     There is at least one scenario where this makes a difference: if you take
>     a base backup from a standby server right after a timeline switch, the
>     WAL segment containing the initial checkpoint record will begin with an
>     older timeline ID. Without the timeline history file, recovering that file
>     will fail as the older timeline ID is not recognized to be an ancestor of
>     the target timeline. If you try to recover from such a backup, using only
>     streaming replication to fetch the WAL, this patch is required for that to
>     work.
> =====

The above commit avoid initializing the expectedTLEs from the
recoveryTargetTLI as shown in below hunk from this commit.

@@ -5279,49 +5299,6 @@ StartupXLOG(void)
      */
     readRecoveryCommandFile();

-    /* Now we can determine the list of expected TLIs */
-    expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
-

I think the fix for the problem will be that, after reading/validating
the checkpoint record, we can free the current value of expectedTLEs
and reinitialize it based on the recoveryTargetTLI as shown in the
attached patch?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Tue, 4 May 2021 17:41:06 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in 
> I think the fix for the problem will be that, after reading/validating
> the checkpoint record, we can free the current value of expectedTLEs
> and reinitialize it based on the recoveryTargetTLI as shown in the
> attached patch?

I'm not sure I understand the issue here.  I think that the attached
should reproduce the issue mentioned here, but didn't for me.

The result of running the attached test script is shown below.  TLIs
are adjusted in your descriptions cited below. The lines prefixed by
NodeN> are the server log lines written while running the attached
test script.

> 1. Node2 got promoted to new primary, and node 2 picked new TL 2 in
>   the middle of the segment 3.

Node2>  LOG:  selected new timeline ID: 2

> 2. Node3, restarted with new primary info of Node2 and restore command

Node2> node_3 LOG:  received replication command: IDENTIFY_SYSTEM

> 3. Node3, found the newest TL2 in validateRecoveryParameters() Because
>   the latest TL was requested in recovery.conf (history file restored
>   from TL2) and set recoveryTargetTLI to 2 So point to note is
>   recoveryTargetTLI is set to 2 but expectedTLEs is not yet set.

This means you specified recovery_target_timeline? Either way,
expectedTLEs is not relevant to the behavior here.  Even if
recovery_target_timeline is set to latest, findNewestTimeLine doesn't
look it.

Node3> LOG:  restored log file "00000002.history" from archive
  
> 4. Node3, entered into the standby mode.

Node3> LOG:  entering standby mode

> 5. Node3, tries to read the checkpoint Record, on Node3 still the
>   checkpoint TL (ControlFile->checkPointCopy.ThisTimeLineID) is 1.

expectedTLEs is loaded just before fetching the last checkpoint.

ReadCheckpointRecord doesn't consider checkPointCopy.ThisTimeLineID.

The reason for the checkpoint TLI is that the segment file was that of
the newest TLI in expectedTLEs found in pg_wal directory.  If the
segment for TLI=2 containing the last checkpoint had been archived,
checkpoint record would be read as TLI=2. Replication starts at TLI=2
in this case because archive recovery has reached that timeline.
(Turn on the optional section in the attached test script to see this
behavior.)  This is the expected behavior since we assume that the
segment files for TLI=n and n+1 are identical in the TLI=n part.

Anyway the checkpoint that is read is on TLI=1 in this case and
replication starts at TLI=1.

Node3> LOG:  Checkpoint record: TLI=1, 0/3014F78
  
> 6. Node3, tries to get the checkpoint record file using new TL2 from
>   the archive which it should get ideally but it may not if the Node2
>   haven't yet archived it.

This doesn't happen for me.  Instead, node3 runs recovery from the
checkpoint up to the end of the archived WAL. In this case the end
point is 3014FF0@TLI=1.

Node3> LOG:  invalid record length at 0/3014FF0: wanted 24, got 0

Then, node3 connects to node2 requesting TLI=1 because the history
file (or expectedTLEs) told that the LSN belongs to TLI=1.

Node3> LOG:  0/3014FF0 is on TLI 1
Node3> LOG:  started streaming WAL from primary at 0/3000000 on timeline 1

After a while node2 finds a timeline switch and disconnects the
replication.

Node3> LOG:  replication terminated by primary server
Node3> DETAIL:  End of WAL reached on timeline 1 at 0/3029A68.

After scanning the archive and pg_wal ends in failure, node3 correctly
requests node2 for TLI=2 because expectedTLEs told that the current
LSN belongs to TLI=2.

Node3> LOG:  0/3029A68 is on TLI 2
Node3> LOG:  restarted WAL streaming at 0/3000000 on timeline 2

Finally the items below don't happen for me, because node3 needs not
to go back to the last checkpoint any longer.  Perhaps the script is
failing to reproduce your issue correctly.

> 7. Node3, tries to stream from primary but using TL1 because
>   ControlFile->checkPointCopy.ThisTimeLineID is 1.

As mentioned above, the checkPointCopy.ThisTimeLineID on either the
primary and secondary is irrelevant to the timline the primary
sends. The primary streams the timeline requested by the secondary.

> 8. Node3, get it because walsender of Node2 read it from TL2 and send
> it and Node2 write in the new WAL file but with TL1.

Walsender strems the requested TLI from walreceiver, then disconnects
at the end of the TLI notifying node3 of the next TLI.  Node3
re-establishes replication with the new TLI.  Looking into pg_wal of
node3, segment 3 for both TLI=1 and 2 are filled by the correct
content.

So,,, I don't understand what are you saying is the race condition..

An issue that may be slightly relevant to this case have been raised
[1]. But it is about writing end-of-recovery checkpoint into the older
timeline.

Could you please fix the test script so that it causes your issue
correctly? And/or elaborate a bit more?

The attached first file is the debugging aid logging. The second is
the test script, to be placed in src/test/recovery/t.


1: https://www.postgresql.org/message-id/CAE-ML%2B_EjH_fzfq1F3RJ1%3DXaaNG%3D-Jz-i3JqkNhXiLAsM3z-Ew%40mail.gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c175..9483fd055c 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -55,6 +55,7 @@ restoreTimeLineHistoryFiles(TimeLineID begin, TimeLineID end)
 
     for (tli = begin; tli < end; tli++)
     {
+        elog(LOG, "Trying restoring history file for TLI=%d", tli);
         if (tli == 1)
             continue;
 
@@ -95,6 +96,7 @@ readTimeLineHistory(TimeLineID targetTLI)
 
     if (ArchiveRecoveryRequested)
     {
+        elog(LOG, "Trying reading history file for TLI=%d", targetTLI);
         TLHistoryFileName(histfname, targetTLI);
         fromArchive =
             RestoreArchivedFile(path, histfname, "RECOVERYHISTORY", 0, false);
@@ -231,6 +233,7 @@ existsTimeLineHistory(TimeLineID probeTLI)
 
     if (ArchiveRecoveryRequested)
     {
+        elog(LOG, "Probing history file for TLI=%d", probeTLI);
         TLHistoryFileName(histfname, probeTLI);
         RestoreArchivedFile(path, histfname, "RECOVERYHISTORY", 0, false);
     }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index adfc6f67e2..e31e1f0ce3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3732,6 +3732,7 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
             snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
                      xlogfname);
             set_ps_display(activitymsg);
+            elog(LOG, "Trying fetching history file for TLI=%d", tli);
             restoredFromArchive = RestoreArchivedFile(path, xlogfname,
                                                       "RECOVERYXLOG",
                                                       wal_segment_size,
@@ -3825,7 +3826,10 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
     if (expectedTLEs)
         tles = expectedTLEs;
     else
+    {
+        elog(LOG, "Loading history file for TLI=%d (2)", recoveryTargetTLI);
         tles = readTimeLineHistory(recoveryTargetTLI);
+    }
 
     foreach(cell, tles)
     {
@@ -3839,6 +3843,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
          * Skip scanning the timeline ID that the logfile segment to read
          * doesn't belong to
          */
+        elog(LOG, "scanning segment %lX TLI %d, source %d", segno, tli, source);
+
         if (hent->begin != InvalidXLogRecPtr)
         {
             XLogSegNo    beginseg = 0;
@@ -3865,6 +3871,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
                               XLOG_FROM_ARCHIVE, true);
             if (fd != -1)
             {
+                elog(LOG, "found segment %lX TLI %d, from archive", segno, tli);
                 elog(DEBUG1, "got WAL segment from archive");
                 if (!expectedTLEs)
                     expectedTLEs = tles;
@@ -3878,6 +3885,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
                               XLOG_FROM_PG_WAL, true);
             if (fd != -1)
             {
+                elog(LOG, "found segment %lX TLI %d, from PG_WAL", segno, tli);
                 if (!expectedTLEs)
                     expectedTLEs = tles;
                 return fd;
@@ -8421,7 +8429,7 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 
     XLogBeginRead(xlogreader, RecPtr);
     record = ReadRecord(xlogreader, LOG, true);
-
+    elog(LOG, "Checkpoint record: TLI=%d, %X/%X, rectargetTLI=%d, exptles=%p", xlogreader->seg.ws_tli,
LSN_FORMAT_ARGS(xlogreader->ReadRecPtr),recoveryTargetTLI, expectedTLEs);
 
     if (record == NULL)
     {
         if (!report)
@@ -12628,7 +12636,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                              * TLI, rather than the position we're reading.
                              */
                             tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
-
+                            elog(LOG, "%X/%X is on TLI %X", LSN_FORMAT_ARGS(tliRecPtr), tli);
                             if (curFileTLI > 0 && tli < curFileTLI)
                                 elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but
previousrecovered WAL file came from timeline %u",
 
                                      LSN_FORMAT_ARGS(tliRecPtr),
@@ -12697,7 +12705,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                         if (readFile < 0)
                         {
                             if (!expectedTLEs)
+                            {
+                                elog(LOG, "Loading expectedTLEs for %d", receiveTLI);
                                 expectedTLEs = readTimeLineHistory(receiveTLI);
+                            }
                             readFile = XLogFileRead(readSegNo, PANIC,
                                                     receiveTLI,
                                                     XLOG_FROM_STREAM, false);
# Minimal test testing streaming replication
use strict;
use warnings;
use PostgresNode;
use TestLib;
use Test::More tests => 2;

my $primary = get_new_node('primary');
$primary->init(allows_streaming => 1);
$primary->start;
my $backup_name = 'my_backup';

$primary->backup($backup_name);

my $node_2 = get_new_node('node_2');
$node_2->init_from_backup($primary, $backup_name,
      has_streaming => 1,
      allows_streaming => 1);
$node_2->append_conf('postgresql.conf', "archive_mode = always");
my $archdir = $node_2->data_dir . "/../archive";
$node_2->append_conf('postgresql.conf', "archive_command = 'cp %p $archdir/%f'");

mkdir($archdir);

$node_2->start;

# Create streaming standby linking to primary
my $node_3 = get_new_node('node_3');
$node_3->init_from_backup($primary, $backup_name,
    has_streaming => 1);
$node_3->append_conf('postgresql.conf', "restore_command = 'cp $archdir/%f %p'");
$node_3->start;

$primary->psql('postgres', 'SELECT pg_switch_wal(); CREATE TABLE t(); DROP TABLE t; CHECKPOINT;');
$primary->wait_for_catchup($node_2, 'write',
                                $primary->lsn('insert'));
$primary->wait_for_catchup($node_3, 'write',
                                $primary->lsn('insert'));
$node_3->stop;

# put node3 a bit behind to cause streaming on the old timeline
$primary->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;');
$primary->wait_for_catchup($node_2, 'write',
                           $primary->lsn('insert'));

$primary->stop;

# promote node2
$node_2->psql('postgres', 'SELECT pg_promote()');

# optional: archive segment 3 of TLI=2 on node2 and advance one more segment
if (0)
{
    my $lastwal = $node_2->safe_psql('postgres', 'select last_archived_wal from pg_stat_archiver');
    $node_2->psql('postgres', 'SELECT pg_switch_wal();');
    $node_2->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;');
    $node_2->poll_query_until('postgres', "SELECT '$lastwal' <> last_archived_wal from pg_stat_archiver");

    $lastwal = $node_2->safe_psql('postgres', 'select last_archived_wal from pg_stat_archiver');
    $node_2->psql('postgres', 'SELECT pg_switch_wal();');
    $node_2->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;');
    $node_2->poll_query_until('postgres', "SELECT '$lastwal' <> last_archived_wal from pg_stat_archiver");
}

# attach node3 as a standby of node2
$node_3->enable_streaming($node_2);
$node_3->append_conf('postgresql.conf', "recovery_target_timeline = 2");

# *restart# node3, not just reloading to trigger archive recovery
$node_3->start;
$node_2->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;');
$node_2->psql('postgres', 'SELECT pg_switch_wal();');

# XXX: another defect comes out without this X(
$node_2->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;');

$node_2->wait_for_catchup($node_3, 'write',
                                $node_2->lsn('insert'));
my $result = 
  $node_2->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn() = write_lsn FROM pg_stat_replication');

ok($result eq 't', 'check');

# set 0 to leave data directories after a successful run
ok(1, 'break');


Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Fri, May 7, 2021 at 8:23 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Tue, 4 May 2021 17:41:06 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in
> Could you please fix the test script so that it causes your issue
> correctly? And/or elaborate a bit more?
>
> The attached first file is the debugging aid logging. The second is
> the test script, to be placed in src/test/recovery/t.

I will look into your test case and try to see whether we can
reproduce the issue.  But let me summarise what is the exact issue.
Basically, the issue is that first in validateRecoveryParameters if
the recovery target is the latest then we fetch the latest history
file and set the recoveryTargetTLI timeline to the latest available
timeline assume it's 2 but we delay updating the expectedTLEs (as per
commit ee994272ca50f70b53074f0febaec97e28f83c4e).  Now, while reading
the checkpoint record if we don't get the required WAL from the
archive then we try to get from primary, and while getting checkpoint
from primary we use "ControlFile->checkPointCopy.ThisTimeLineID"
suppose that is older timeline 1.  Now after reading the checkpoint we
will set the expectedTLEs based on the timeline from which we got the
checkpoint record.

See below Logic in WaitForWalToBecomeAvailable
                        if (readFile < 0)
                        {
                            if (!expectedTLEs)
                                expectedTLEs = readTimeLineHistory(receiveTLI);

Now, the first problem is we are breaking the sanity of expectedTLEs
because as per the definition it should already start with
recoveryTargetTLI but it is starting with the older TLI.  Now, in
rescanLatestTimeLine we are trying to fetch the latest TLI which is
still 2, so this logic returns without reinitializing the expectedTLEs
because it assumes that if recoveryTargetTLI is pointing to 2 then
expectedTLEs must be correct and need not be changed.

See below logic:
rescanLatestTimeLine(void)
{
....
newtarget = findNewestTimeLine(recoveryTargetTLI);
if (newtarget == recoveryTargetTLI)
{
/* No new timelines found */
return false;
}
...
newExpectedTLEs = readTimeLineHistory(newtarget);
...
expectedTLEs = newExpectedTLEs;


Solution:
1. Find better way to fix the problem of commit
(ee994272ca50f70b53074f0febaec97e28f83c4e) which is breaking the
sanity of expectedTLEs.
2. Assume, we have to live with fix 1 and we have to initialize
expectedTLEs with an older timeline for validating the checkpoint in
absence of tl.hostory file (as this commit claims).  Then as soon as
we read and validate the checkpoint, fix the expectedTLEs and set it
based on the history file of recoveryTargetTLI.

Does this explanation make sense?  If not please let me know what part
is not clear in the explanation so I can point to that code.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
Thanks.


At Fri, 7 May 2021 11:04:53 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in 
> On Fri, May 7, 2021 at 8:23 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Tue, 4 May 2021 17:41:06 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in
> > Could you please fix the test script so that it causes your issue
> > correctly? And/or elaborate a bit more?
> >
> > The attached first file is the debugging aid logging. The second is
> > the test script, to be placed in src/test/recovery/t.
> 
> I will look into your test case and try to see whether we can
> reproduce the issue.  But let me summarise what is the exact issue.
> Basically, the issue is that first in validateRecoveryParameters if
> the recovery target is the latest then we fetch the latest history
> file and set the recoveryTargetTLI timeline to the latest available
> timeline assume it's 2 but we delay updating the expectedTLEs (as per
> commit ee994272ca50f70b53074f0febaec97e28f83c4e).  Now, while reading

I think it is right up to here.

> the checkpoint record if we don't get the required WAL from the
> archive then we try to get from primary, and while getting checkpoint
> from primary we use "ControlFile->checkPointCopy.ThisTimeLineID"
> suppose that is older timeline 1.  Now after reading the checkpoint we
> will set the expectedTLEs based on the timeline from which we got the
> checkpoint record.

I doubt this point.  ReadCheckpointRecord finally calls
XLogFileReadAnyTLI and it uses the content of the 00000002.history as
the local timeline entry list, since expectedTLEs is NIL and
recoveryTargetTLI has been updated to 2 by
validateRecoveryParameters(). But node 3 was having only the segment
on TLI=1 so ReadCheckPointRecord() finds the wanted checkpoint recrod
on TLI=1.  XLogFileReadAnyTLI() copies the local TLE list based on
TLI=2 to expectedTLEs just after that because the wanted checkpoint
record was available based on the list.

So I don't think checkPointCopy.ThisTimeLineID cannot affect this
logic, and don't think expectedTLEs is left with NIL.  It's helpful
that you could show the specific code path to cause that.

> See below Logic in WaitForWalToBecomeAvailable
>                         if (readFile < 0)
>                         {
>                             if (!expectedTLEs)
>                                 expectedTLEs = readTimeLineHistory(receiveTLI);
> 
> Now, the first problem is we are breaking the sanity of expectedTLEs
> because as per the definition it should already start with
> recoveryTargetTLI but it is starting with the older TLI.  Now, in

If my description above is correct, expectedTLEs has been always
filled by TLI=2's hisotry so readTimeLineHistory is not called there.

After that the things are working as described in my previous mail. So
The following is not an issue if I'm not missing something.


> rescanLatestTimeLine we are trying to fetch the latest TLI which is
> still 2, so this logic returns without reinitializing the expectedTLEs
> because it assumes that if recoveryTargetTLI is pointing to 2 then
> expectedTLEs must be correct and need not be changed.
> 
> See below logic:
> rescanLatestTimeLine(void)
> {
> ....
> newtarget = findNewestTimeLine(recoveryTargetTLI);
> if (newtarget == recoveryTargetTLI)
> {
> /* No new timelines found */
> return false;
> }
> ...
> newExpectedTLEs = readTimeLineHistory(newtarget);
> ...
> expectedTLEs = newExpectedTLEs;
> 
> 
> Solution:
> 1. Find better way to fix the problem of commit
> (ee994272ca50f70b53074f0febaec97e28f83c4e) which is breaking the
> sanity of expectedTLEs.
> 2. Assume, we have to live with fix 1 and we have to initialize
> expectedTLEs with an older timeline for validating the checkpoint in
> absence of tl.hostory file (as this commit claims).  Then as soon as
> we read and validate the checkpoint, fix the expectedTLEs and set it
> based on the history file of recoveryTargetTLI.
> 
> Does this explanation make sense?  If not please let me know what part
> is not clear in the explanation so I can point to that code.

So, unfortunately not.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Dilip Kumar
Date:
 On Fri, May 7, 2021 at 2:33 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Thanks.
>
>
> At Fri, 7 May 2021 11:04:53 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in
> > On Fri, May 7, 2021 at 8:23 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > >
> > > At Tue, 4 May 2021 17:41:06 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in
> > > Could you please fix the test script so that it causes your issue
> > > correctly? And/or elaborate a bit more?
> > >
> > > The attached first file is the debugging aid logging. The second is
> > > the test script, to be placed in src/test/recovery/t.
> >
> > I will look into your test case and try to see whether we can
> > reproduce the issue.  But let me summarise what is the exact issue.
> > Basically, the issue is that first in validateRecoveryParameters if
> > the recovery target is the latest then we fetch the latest history
> > file and set the recoveryTargetTLI timeline to the latest available
> > timeline assume it's 2 but we delay updating the expectedTLEs (as per
> > commit ee994272ca50f70b53074f0febaec97e28f83c4e).  Now, while reading
>
> I think it is right up to here.
>
> > the checkpoint record if we don't get the required WAL from the
> > archive then we try to get from primary, and while getting checkpoint
> > from primary we use "ControlFile->checkPointCopy.ThisTimeLineID"
> > suppose that is older timeline 1.  Now after reading the checkpoint we
> > will set the expectedTLEs based on the timeline from which we got the
> > checkpoint record.
>
> I doubt this point.  ReadCheckpointRecord finally calls
> XLogFileReadAnyTLI and it uses the content of the 00000002.history as
> the local timeline entry list, since expectedTLEs is NIL and
> recoveryTargetTLI has been updated to 2 by
> validateRecoveryParameters(). But node 3 was having only the segment
> on TLI=1 so ReadCheckPointRecord() finds the wanted checkpoint recrod
> on TLI=1.  XLogFileReadAnyTLI() copies the local TLE list based on
> TLI=2 to expectedTLEs just after that because the wanted checkpoint
> record was available based on the list.

Okay, I got your point, now, consider the scenario that we are trying
to get the checkpoint record in XLogFileReadAnyTLI, you are right that
it returns history file 00000002.history.  I think I did not mention
one point, basically, the tool while restarting node 3 after promoting
node 2 is deleting all the local WAL of node3 (so that node 3 can
follow node2).  So now node3 doesn't have the checkpoint in the local
segment.  Suppose checkpoint record was in segment
000000010000000000000001, but after TL switch 000000010000000000000001
is renamed to 000000010000000000000001.partial on node2 so now
practically doesn't have 000000010000000000000001 file anywhere.
However if TL switch mid-segment then we copy that segment with new TL
so we have 000000020000000000000001 which contains the checkpoint
record, but node 2 haven't yet archived it.

So now you come out of XLogFileReadAnyTLI, without reading checkpoint
record and without setting expectedTLEs.  Because expectedTLEs is only
set if we are able to read the checkpoint record.  Make sense?

> So I don't think checkPointCopy.ThisTimeLineID cannot affect this
> logic, and don't think expectedTLEs is left with NIL.  It's helpful
> that you could show the specific code path to cause that.

So now expectedTLEs is still NULL and you go to get the checkpoint
record from primary and use checkPointCopy.ThisTimeLineID.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Fri, 7 May 2021 15:16:03 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in 
> Okay, I got your point, now, consider the scenario that we are trying
> to get the checkpoint record in XLogFileReadAnyTLI, you are right that
> it returns history file 00000002.history.  I think I did not mention
> one point, basically, the tool while restarting node 3 after promoting
> node 2 is deleting all the local WAL of node3 (so that node 3 can
> follow node2).  So now node3 doesn't have the checkpoint in the local
> segment.  Suppose checkpoint record was in segment
...
> So now you come out of XLogFileReadAnyTLI, without reading checkpoint
> record and without setting expectedTLEs.  Because expectedTLEs is only
> set if we are able to read the checkpoint record.  Make sense?

Thanks. I understood the case and reproduced. Although I don't think
removing WAL files from non-backup cluster is legit, I also think we
can safely start archive recovery from a replicated segment.

> So now expectedTLEs is still NULL and you go to get the checkpoint
> record from primary and use checkPointCopy.ThisTimeLineID.

I don't think erasing expectedTLEs after once set is the right fix
because expectedTLEs are supposed to be set just once iff we are sure
that we are going to follow the history, until rescan changes it as
the only exception.

It seems to me the issue here is not a race condition but
WaitForWALToBecomeAvailable initializing expectedTLEs with the history
of a improper timeline. So using recoveryTargetTLI instead of
receiveTLI for the case fixes this issue.

-                            if (!expectedTLEs)
-                                expectedTLEs = readTimeLineHistory(receiveTLI);

I thought that the reason using receiveTLI instead of
recoveryTargetTLI here is that there's a case where receiveTLI is the
future of recoveryTarrgetTLI but I haven't successfully had such a
situation.  If I set recovoryTargetTLI to a TLI that standby doesn't
know but primary knows, validateRecoveryParameters immediately
complains about that before reaching there.  Anyway the attached
assumes receiveTLI may be the future of recoveryTargetTLI.

Just inserting if() into the exising code makes the added lines stick
out of the right side edge of 80 columns so I refactored there a bit
to lower indentation.


I believe the 004_timeline_switch.pl detects your issue.  And the
attached change fixes it.

Any suggestions are welcome.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 52eba6b6ef8d8fda078d3acd27b6ce7406078df8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Mon, 10 May 2021 16:57:24 +0900
Subject: [PATCH] Choose correct timeline when received historic timelines

When a standby starts streaming before determining expectedTLEs, it
determines the expected timeline list based on the timeline of the WAL
segment just streamed from the primary. If we fetched the last
checkpoint in the older timeline, this behavior prevents recovery from
proceeding. When primary streamed over a WAL file in a historical
timeline, use recoveryTargetTLI, which must be a history of the
primary.  There's one scenario where this makes a difference: suppose
two standbys are connected to a primary both by archive and
streaming. In the case where one of the standby promotes, then another
reconnects to the promoted standby before archiving the first segment
of the new timeline, and the content of pg_wal of the new standby is
removed before reconnection, the standby fetches the history file for
the new timeline but the first segment for the timeline is available
only via streaming.  In this case, the new standby thought that the
primary always sends newer timeline than the current recovery target
but that is not the case of this scenario.
---
 src/backend/access/transam/xlog.c          | 45 ++++++++++---
 src/test/recovery/t/004_timeline_switch.pl | 74 +++++++++++++++++++++-
 2 files changed, 108 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c1d4415a43..1ca55b7ec0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12656,22 +12656,47 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                          * info is set correctly and XLogReceiptTime isn't
                          * changed.
                          */
-                        if (readFile < 0)
-                        {
-                            if (!expectedTLEs)
-                                expectedTLEs = readTimeLineHistory(receiveTLI);
-                            readFile = XLogFileRead(readSegNo, PANIC,
-                                                    receiveTLI,
-                                                    XLOG_FROM_STREAM, false);
-                            Assert(readFile >= 0);
-                        }
-                        else
+
+                        if (readFile >= 0)
                         {
                             /* just make sure source info is correct... */
                             readSource = XLOG_FROM_STREAM;
                             XLogReceiptSource = XLOG_FROM_STREAM;
                             return true;
                         }
+
+                        readFile = XLogFileRead(readSegNo, PANIC,
+                                                receiveTLI,
+                                                XLOG_FROM_STREAM, false);
+                        Assert(readFile >= 0);
+
+                        if (expectedTLEs)
+                            break;
+
+                        /* initialize expectedTLEs */
+                        expectedTLEs = readTimeLineHistory(receiveTLI);
+
+                        if (!tliInHistory(recoveryTargetTLI, expectedTLEs))
+                        {
+                            /*
+                             * Received timeline is not the future of the
+                             * recovery target timeline. The upstream must have
+                             * instead sent a historic timeline of ours. This
+                             * can happen when this standby's pg_wal has been
+                             * cleaned up before restart but archive offered
+                             * the history file recovery target. Pick up the
+                             * history of the recovery target timeline.
+                             */
+                            expectedTLEs =
+                                readTimeLineHistory(recoveryTargetTLI);
+                            if (!tliInHistory(receiveTLI, expectedTLEs))
+                                ereport(FATAL,
+                                        (errcode(ERRCODE_DATA_CORRUPTED),
+                                         errmsg ("the primary server has sent incompatible timeline %d with recovery
targettimeline %d",
 
+                                                 receiveTLI,
+                                                 recoveryTargetTLI)));
+                        }
+
                         break;
                     }
 
diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl
index c101980e9e..bca7ad165e 100644
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ b/src/test/recovery/t/004_timeline_switch.pl
@@ -7,7 +7,7 @@ use warnings;
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 3;
+use Test::More tests => 4;
 
 $ENV{PGDATABASE} = 'postgres';
 
@@ -109,3 +109,75 @@ $node_primary_2->wait_for_catchup($node_standby_3, 'replay',
 my $result_2 =
   $node_standby_3->safe_psql('postgres', "SELECT count(*) FROM tab_int");
 is($result_2, qq(1), 'check content of standby 3');
+
+#####
+# Check if standby can follow primary when the segment for the new
+# time line is not avaiable for the standby in archive.
+
+# setup two standby nodes, one of them archives WAL files
+$node_primary_2->psql('postgres', 'CREATE TABLE t (a int)');
+my $node_standby_4 = get_new_node('standby_3_1');
+$node_standby_4->init_from_backup($node_primary_2, $backup_name,
+    has_streaming => 1,    allows_streaming => 1, has_archiving => 1);
+my $archdir = $node_standby_4->archive_dir;
+
+# keep 3 segments
+my $keepsize = $node_primary_2->safe_psql('postgres',
+                                          q[
+select setting from pg_settings where name = 'wal_segment_size';
+                                          ]) * 3;
+$node_standby_4->append_conf('postgresql.conf', qq[
+wal_keep_size = $keepsize
+archive_mode = always
+archive_command = 'cp %p $archdir/%f'
+]);
+
+$node_standby_4->start;
+
+my $node_standby_5 = get_new_node('node_standby_5');
+$node_standby_5->init_from_backup($node_primary_2, $backup_name,
+    has_streaming => 1);
+$node_standby_5->start;
+
+$node_primary_2->psql('postgres', qq[
+                      SELECT pg_switch_wal();
+                      INSERT INTO t VALUES (0);
+                      CHECKPOINT;
+]);
+$node_primary_2->wait_for_catchup($node_standby_4, 'write',
+                                $node_primary_2->lsn('insert'));
+$node_primary_2->wait_for_catchup($node_standby_5, 'write',
+                                $node_primary_2->lsn('insert'));
+
+# disconnect the second standby then connect to the promoted first standby
+$node_standby_5->stop;
+$node_standby_4->psql('postgres', 'SELECT pg_promote()');
+$node_standby_5->enable_streaming($node_standby_4);
+$node_standby_5->append_conf('postgresql.conf',
+                             "restore_command = 'cp $archdir/%f %p'");
+
+# flush pg_wal on the second standby (new standby)
+my $pgwaldir = $node_standby_5->data_dir. "/pg_wal";
+opendir my $dh, $pgwaldir or die "failed to open $pgwaldir of node_standby_5";
+while (my $f = readdir($dh))
+{
+    unlink("$pgwaldir/$f") if (-f "$pgwaldir/$f");
+}
+closedir($dh);
+
+# restart the second standby, not just reloading to trigger archive recovery
+$node_standby_5->start;
+$node_standby_4->psql('postgres', qq[
+                        INSERT INTO t VALUES(0);
+                        SELECT pg_switch_wal();
+                        INSERT INTO t VALUES(0); -- avoid segment border
+]);
+
+$node_standby_4->wait_for_catchup($node_standby_5, 'write',
+                                    $node_standby_4->lsn('insert'));
+
+# check if the change is correctly replicated
+my $result_3 =
+  $node_standby_4->safe_psql('postgres',
+                               'SELECT pg_current_wal_insert_lsn() = write_lsn FROM pg_stat_replication');
+is($result_3, 't', 'check receiving historic timeline from primary');
-- 
2.27.0


Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Mon, May 10, 2021 at 2:05 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

> I thought that the reason using receiveTLI instead of
> recoveryTargetTLI here is that there's a case where receiveTLI is the
> future of recoveryTarrgetTLI but I haven't successfully had such a
> situation.  If I set recovoryTargetTLI to a TLI that standby doesn't
> know but primary knows, validateRecoveryParameters immediately
> complains about that before reaching there.  Anyway the attached
> assumes receiveTLI may be the future of recoveryTargetTLI.

If you see the note in this commit. It says without the timeline
history file, so does it trying to say that although receiveTLI is the
ancestor of recovoryTargetTLI,  it can not detect that because of the
absence of the TL.history file ?

ee994272ca50f70b53074f0febaec97e28f83c4e
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>  2013-01-03 14:11:58
Committer: Heikki Linnakangas <heikki.linnakangas@iki.fi>  2013-01-03 14:11:58
.....
  Without the timeline history file, recovering that file
    will fail as the older timeline ID is not recognized to be an ancestor of
    the target timeline. If you try to recover from such a backup, using only
    streaming replication to fetch the WAL, this patch is required for that to
    work.
=====

>
> I believe the 004_timeline_switch.pl detects your issue.  And the
> attached change fixes it.

I think this fix looks better to me, but I will think more about it
and give my feedback.  Thanks for quickly coming up with the
reproducible test case.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Mon, 10 May 2021 14:27:21 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in 
> On Mon, May 10, 2021 at 2:05 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> 
> > I thought that the reason using receiveTLI instead of
> > recoveryTargetTLI here is that there's a case where receiveTLI is the
> > future of recoveryTarrgetTLI but I haven't successfully had such a
> > situation.  If I set recovoryTargetTLI to a TLI that standby doesn't
> > know but primary knows, validateRecoveryParameters immediately
> > complains about that before reaching there.  Anyway the attached
> > assumes receiveTLI may be the future of recoveryTargetTLI.
> 
> If you see the note in this commit. It says without the timeline
> history file, so does it trying to say that although receiveTLI is the
> ancestor of recovoryTargetTLI,  it can not detect that because of the
> absence of the TL.history file ?

Yeah, it reads so for me and it works as described.  What I don't
understand is that why the patch uses receiveTLI, not
recovoryTargetTLI to load timeline hisotry in
WaitForWALToBecomeAvailable.  The only possible reason is that there
could be a case where receivedTLI is the future of recoveryTargetTLI.
However, AFAICS it's impossible for that case to happen.  At
replication start, requsting TLI is that of the last checkpoint, which
is the same to recoveryTargetTLI, or anywhere in exising expectedTLEs
which must be the past of recoveryTargetTLI. That seems to be already
true at the time replication was made possible to follow a timeline
switch (abfd192b1b).

So I was tempted to just load history for recoveryTargetTLI then
confirm that receiveTLI is in the history.  Actually that change
doesn't harm any of the recovery TAP tests.  It is way simpler than
the last patch. However, I'm not confident that it is right.. ;(

> ee994272ca50f70b53074f0febaec97e28f83c4e
> Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>  2013-01-03 14:11:58
> Committer: Heikki Linnakangas <heikki.linnakangas@iki.fi>  2013-01-03 14:11:58
> .....
>   Without the timeline history file, recovering that file
>     will fail as the older timeline ID is not recognized to be an ancestor of
>     the target timeline. If you try to recover from such a backup, using only
>     streaming replication to fetch the WAL, this patch is required for that to
>     work.
> =====
> 
> >
> > I believe the 004_timeline_switch.pl detects your issue.  And the
> > attached change fixes it.
> 
> I think this fix looks better to me, but I will think more about it
> and give my feedback.  Thanks for quickly coming up with the
> reproducible test case.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Tue, May 11, 2021 at 1:42 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Mon, 10 May 2021 14:27:21 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in
> > On Mon, May 10, 2021 at 2:05 PM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> >
> > > I thought that the reason using receiveTLI instead of
> > > recoveryTargetTLI here is that there's a case where receiveTLI is the
> > > future of recoveryTarrgetTLI but I haven't successfully had such a
> > > situation.  If I set recovoryTargetTLI to a TLI that standby doesn't
> > > know but primary knows, validateRecoveryParameters immediately
> > > complains about that before reaching there.  Anyway the attached
> > > assumes receiveTLI may be the future of recoveryTargetTLI.
> >
> > If you see the note in this commit. It says without the timeline
> > history file, so does it trying to say that although receiveTLI is the
> > ancestor of recovoryTargetTLI,  it can not detect that because of the
> > absence of the TL.history file ?
>
> Yeah, it reads so for me and it works as described.  What I don't
> understand is that why the patch uses receiveTLI, not
> recovoryTargetTLI to load timeline hisotry in
> WaitForWALToBecomeAvailable.  The only possible reason is that there
> could be a case where receivedTLI is the future of recoveryTargetTLI.
> However, AFAICS it's impossible for that case to happen.  At
> replication start, requsting TLI is that of the last checkpoint, which
> is the same to recoveryTargetTLI, or anywhere in exising expectedTLEs
> which must be the past of recoveryTargetTLI. That seems to be already
> true at the time replication was made possible to follow a timeline
> switch (abfd192b1b).
>
> So I was tempted to just load history for recoveryTargetTLI then
> confirm that receiveTLI is in the history.  Actually that change
> doesn't harm any of the recovery TAP tests.  It is way simpler than
> the last patch. However, I'm not confident that it is right.. ;(

I first thought of fixing like as you describe that instead of loading
history of receiveTLI, load history for recoveryTargetTLI.  But then,
this commit (ee994272ca50f70b53074f0febaec97e28f83c4e) has especially
used the history file of receiveTLI to solve a particular issue which
I did not clearly understand.  I am not sure that whether it is a good
idea to directly using recoveryTargetTLI, without exactly
understanding why this commit was using receiveTLI.  It doesn't seem
like an oversight to me, it seems intentional.  Maybe Heikki can
comment on this?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Robert Haas
Date:
On Mon, May 10, 2021 at 4:35 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> It seems to me the issue here is not a race condition but
> WaitForWALToBecomeAvailable initializing expectedTLEs with the history
> of a improper timeline. So using recoveryTargetTLI instead of
> receiveTLI for the case fixes this issue.

I agree.

> I believe the 004_timeline_switch.pl detects your issue.  And the
> attached change fixes it.

So why does this use recoveryTargetTLI instead of receiveTLI only
conditionally? Why not do it all the time?

The hard thing about this code is that the assumptions are not very
clear. If we don't know why something is a certain way, then we might
break things if we change it. Worse yet, if nobody else knows why it's
like that either, then who knows what assumptions they might be
making? It's hard to be sure that any change is safe.

But that being said, we have a clear definition from the comments for
what expectedTLEs is supposed to contain, and it's only going to end
up with those contents if we initialize it from recoveryTargetTLI. So
I am inclined to think that we ought to do that always, and if it
breaks something, then that's a sign that some other part of the code
also needs fixing, because apparently that hypothetical other part of
the code doesn't work if expctedTLEs contains what the comments say
that it should.

Now maybe that's the wrong idea. But if so, then we're saying that the
definition of expectedTLEs needs to be changed, and we should update
the comments with the new definition, whatever it is. A lot of the
confusion here results from the fact that the code and comments are
inconsistent and we can't tell whether that's intentional or
inadvertent. Let's not leave the next person who looks at this code
wondering the same thing about whatever changes we make.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Fri, May 14, 2021 at 2:37 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> So why does this use recoveryTargetTLI instead of receiveTLI only
> conditionally? Why not do it all the time?
>
> The hard thing about this code is that the assumptions are not very
> clear. If we don't know why something is a certain way, then we might
> break things if we change it. Worse yet, if nobody else knows why it's
> like that either, then who knows what assumptions they might be
> making? It's hard to be sure that any change is safe.
>
> But that being said, we have a clear definition from the comments for
> what expectedTLEs is supposed to contain, and it's only going to end
> up with those contents if we initialize it from recoveryTargetTLI. So
> I am inclined to think that we ought to do that always, and if it
> breaks something, then that's a sign that some other part of the code
> also needs fixing, because apparently that hypothetical other part of
> the code doesn't work if expctedTLEs contains what the comments say
> that it should.
>
> Now maybe that's the wrong idea. But if so, then we're saying that the
> definition of expectedTLEs needs to be changed, and we should update
> the comments with the new definition, whatever it is. A lot of the
> confusion here results from the fact that the code and comments are
> inconsistent and we can't tell whether that's intentional or
> inadvertent. Let's not leave the next person who looks at this code
> wondering the same thing about whatever changes we make.

I am not sure that have you noticed the commit id which changed the
definition of expectedTLEs, Heikki has committed that change so adding
him in the list to know his opinion.

=====
ee994272ca50f70b53074f0febaec97e28f83c4e
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>  2013-01-03 14:11:58
Committer: Heikki Linnakangas <heikki.linnakangas@iki.fi>  2013-01-03 14:11:58

    Delay reading timeline history file until it's fetched from master.
    .....
     Without the timeline history file, recovering that file
    will fail as the older timeline ID is not recognized to be an ancestor of
    the target timeline. If you try to recover from such a backup, using only
    streaming replication to fetch the WAL, this patch is required for that to
    work.
=====

Part of this commit message says that it will not identify the
recoveryTargetTLI as the ancestor of the checkpoint timeline (without
history file).  I did not understand what it is trying to say.  Does
it is trying to say that even though the recoveryTargetTLI is the
ancestor of the checkpoint timeline but we can not track that because
we don't have a history file?  So to handle this problem change the
definition of expectedTLEs to directly point to the checkpoint
timeline?

Because before this commit, we were directly initializing expectedTLEs
with the history file of recoveryTargetTLI, we were not even waiting
for reading the checkpoint,  but under this commit, it is changed.

I am referring to the below code which was deleted by this commit:

========
@@ -5279,49 +5299,6 @@ StartupXLOG(void)
  */
  readRecoveryCommandFile();

- /* Now we can determine the list of expected TLIs */
- expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
-
- /*
- * If the location of the checkpoint record is not on the expected
- * timeline in the history of the requested timeline, we cannot proceed:
- * the backup is not part of the history of the requested timeline.
- */
- if (tliOfPointInHistory(ControlFile->checkPoint, expectedTLEs) !=
- ControlFile->checkPointCopy.ThisTimeLineID)
- {
=========

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Thu, 13 May 2021 17:07:31 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Mon, May 10, 2021 at 4:35 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > It seems to me the issue here is not a race condition but
> > WaitForWALToBecomeAvailable initializing expectedTLEs with the history
> > of a improper timeline. So using recoveryTargetTLI instead of
> > receiveTLI for the case fixes this issue.
> 
> I agree.
> 
> > I believe the 004_timeline_switch.pl detects your issue.  And the
> > attached change fixes it.
> 
> So why does this use recoveryTargetTLI instead of receiveTLI only
> conditionally? Why not do it all the time?

The commit ee994272ca apparently says that there's a case where primary 

> The hard thing about this code is that the assumptions are not very
> clear. If we don't know why something is a certain way, then we might
> break things if we change it. Worse yet, if nobody else knows why it's
> like that either, then who knows what assumptions they might be
> making? It's hard to be sure that any change is safe.

Thanks for the comment.

> But that being said, we have a clear definition from the comments for
> what expectedTLEs is supposed to contain, and it's only going to end
> up with those contents if we initialize it from recoveryTargetTLI. So
> I am inclined to think that we ought to do that always, and if it

Yes, I also found it after that, and agreed.  Desynchronization
between recoveryTargetTLI and expectedTLEs prevents
rescanLatestTimeline from working.

> breaks something, then that's a sign that some other part of the code
> also needs fixing, because apparently that hypothetical other part of
> the code doesn't work if expctedTLEs contains what the comments say
> that it should.

After some more inspection, I'm now also sure that it is a
typo/thinko.  Other than while fetching the first checkpoint,
receivedTLI is always in the history of recoveryTargetTLI, otherwise
recoveryTargetTLI is equal to receiveTLI.

I read that the commit message as "waiting for fetching possible
future history files to know if there's the future for the current
timeline.  However now I read it as "don't bother expecting for
possiblly-unavailable history files when we're reading the first
checkpoint the timeline for which is already known to us.".  If it is
correct we don't bother considering future history files coming from
primary there.

> Now maybe that's the wrong idea. But if so, then we're saying that the
> definition of expectedTLEs needs to be changed, and we should update
> the comments with the new definition, whatever it is. A lot of the
> confusion here results from the fact that the code and comments are
> inconsistent and we can't tell whether that's intentional or
> inadvertent. Let's not leave the next person who looks at this code
> wondering the same thing about whatever changes we make.

Ok.  The reason why we haven't have a complain about that would be
that it is rare that pg_wal is wiped out before a standby connects to
a just-promoted primary. I'm not sure about the tool Dilip is using,
though..

So the result is the attached.  This would be back-patcheable to 9.3
(or 9.6?) but I doubt that we should do as we don't seem to have had a
complaint on this issue and we're not full faith on this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 94e695031e7aa78670b1d0fd63f6cfed0d501611 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 14 May 2021 13:51:01 +0900
Subject: [PATCH v2] Set expectedTLEs correctly based on recoveryTargetTLI

When a standby starts streaming before it determines expectedTLEs, it
is set based on the timeline of the WAL segment just streamed from the
primary.  If we fetch the last checkpoint in the older timeline, this
behavior leads to setting expectedTLEs based on the older timeline
than recoveryTargetTLI and later calls to rescanLatestTimeLine don't
any longer update recoveryTargetTLI and expectedTLEs and the standby
stays frozen at the point.

This behavior has been introduced by commit ee994272ca but there's no
explanation about breaking the defined relationship between the two
variables, and there seems not to be a case the behavior is useful.

Make things consistent by fixing WaitForWALToBecomeAvailable to set
expectedTLEs not using receiveTLI but recoveryTargetTLI.
---
 src/backend/access/transam/xlog.c          |  3 +-
 src/test/recovery/t/004_timeline_switch.pl | 72 +++++++++++++++++++++-
 2 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8d163f190f..ef8b6d2c8d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12662,7 +12662,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                         if (readFile < 0)
                         {
                             if (!expectedTLEs)
-                                expectedTLEs = readTimeLineHistory(receiveTLI);
+                                expectedTLEs =
+                                    readTimeLineHistory(recoveryTargetTLI);
                             readFile = XLogFileRead(readSegNo, PANIC,
                                                     receiveTLI,
                                                     XLOG_FROM_STREAM, false);
diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl
index c101980e9e..2c8b73f5f0 100644
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ b/src/test/recovery/t/004_timeline_switch.pl
@@ -7,7 +7,7 @@ use warnings;
 use File::Path qw(rmtree);
 use PostgresNode;
 use TestLib;
-use Test::More tests => 3;
+use Test::More tests => 4;
 
 $ENV{PGDATABASE} = 'postgres';
 
@@ -109,3 +109,73 @@ $node_primary_2->wait_for_catchup($node_standby_3, 'replay',
 my $result_2 =
   $node_standby_3->safe_psql('postgres', "SELECT count(*) FROM tab_int");
 is($result_2, qq(1), 'check content of standby 3');
+
+######
+# Check if recoveryTargetTLI is correctly updated when we start
+# replication from the checkpoint in the previous timeline of
+# just-promoted primary, with having pg_wal wiped out.
+
+# setup two standby nodes, one of them archives WAL files
+$node_primary_2->psql('postgres', 'CREATE TABLE t (a int)');
+my $node_standby_4 = get_new_node('standby_4');
+$node_standby_4->init_from_backup($node_primary_2, $backup_name,
+    has_streaming => 1,    allows_streaming => 1, has_archiving => 1);
+
+$node_standby_4->enable_archiving;
+my $archdir = $node_standby_4->archive_dir;
+
+# keep segments, and enable archive on the standby
+$node_standby_4->append_conf('postgresql.conf', qq[
+wal_keep_size = 512MB
+archive_mode = always
+]);
+
+$node_standby_4->start;
+
+my $node_standby_5 = get_new_node('node_standby_5');
+$node_standby_5->init_from_backup($node_primary_2, $backup_name,
+    has_streaming => 1);
+$node_standby_5->start;
+
+$node_primary_2->psql('postgres', qq[
+                      SELECT pg_switch_wal();
+                      INSERT INTO t VALUES (0);
+                      CHECKPOINT;
+]);
+$node_primary_2->wait_for_catchup($node_standby_4, 'write',
+                                $node_primary_2->lsn('insert'));
+$node_primary_2->wait_for_catchup($node_standby_5, 'write',
+                                $node_primary_2->lsn('insert'));
+
+# disconnect the second standby then connect to the promoted first standby
+$node_standby_5->stop;
+$node_standby_4->psql('postgres', 'SELECT pg_promote()');
+$node_standby_5->enable_streaming($node_standby_4);
+$node_standby_5->append_conf('postgresql.conf',
+                             "restore_command = 'cp $archdir/%f %p'");
+
+# wipe-out pg_wal on the second standby (new standby)
+my $pgwaldir = $node_standby_5->data_dir. "/pg_wal";
+opendir my $dh, $pgwaldir or die "failed to open $pgwaldir of node_standby_5";
+while (my $f = readdir($dh))
+{
+    unlink("$pgwaldir/$f") if (-f "$pgwaldir/$f");
+}
+closedir($dh);
+
+# restart the second standby, not just reloading to trigger archive recovery
+$node_standby_5->start;
+$node_standby_4->psql('postgres', qq[
+                        INSERT INTO t VALUES(0);
+                        SELECT pg_switch_wal();
+                        INSERT INTO t VALUES(0); -- avoid segment border
+]);
+
+$node_standby_4->wait_for_catchup($node_standby_5, 'write',
+                                    $node_standby_4->lsn('insert'));
+
+# check if the change is correctly replicated
+my $result_3 =
+  $node_standby_4->safe_psql('postgres',
+                               'SELECT pg_current_wal_insert_lsn() = write_lsn FROM pg_stat_replication');
+is($result_3, 't', 'check receiving historic timeline from primary');
-- 
2.27.0


Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Fri, 14 May 2021 14:12:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Thu, 13 May 2021 17:07:31 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> > On Mon, May 10, 2021 at 4:35 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > > It seems to me the issue here is not a race condition but
> > > WaitForWALToBecomeAvailable initializing expectedTLEs with the history
> > > of a improper timeline. So using recoveryTargetTLI instead of
> > > receiveTLI for the case fixes this issue.
> > 
> > I agree.
> > 
> > > I believe the 004_timeline_switch.pl detects your issue.  And the
> > > attached change fixes it.
> > 
> > So why does this use recoveryTargetTLI instead of receiveTLI only
> > conditionally? Why not do it all the time?
> 
> The commit ee994272ca apparently says that there's a case where primary 

This is not an incomplete line but just a garbage.

> > The hard thing about this code is that the assumptions are not very
> > clear. If we don't know why something is a certain way, then we might
> > break things if we change it. Worse yet, if nobody else knows why it's
> > like that either, then who knows what assumptions they might be
> > making? It's hard to be sure that any change is safe.
> 
> Thanks for the comment.
> 
> > But that being said, we have a clear definition from the comments for
> > what expectedTLEs is supposed to contain, and it's only going to end
> > up with those contents if we initialize it from recoveryTargetTLI. So
> > I am inclined to think that we ought to do that always, and if it
> 
> Yes, I also found it after that, and agreed.  Desynchronization
> between recoveryTargetTLI and expectedTLEs prevents
> rescanLatestTimeline from working.
> 
> > breaks something, then that's a sign that some other part of the code
> > also needs fixing, because apparently that hypothetical other part of
> > the code doesn't work if expctedTLEs contains what the comments say
> > that it should.
> 
> After some more inspection, I'm now also sure that it is a
> typo/thinko.  Other than while fetching the first checkpoint,
> receivedTLI is always in the history of recoveryTargetTLI, otherwise
> recoveryTargetTLI is equal to receiveTLI.
> 
> I read that the commit message as "waiting for fetching possible
> future history files to know if there's the future for the current
> timeline.  However now I read it as "don't bother expecting for
> possiblly-unavailable history files when we're reading the first
> checkpoint the timeline for which is already known to us.".  If it is
> correct we don't bother considering future history files coming from
> primary there.
> 
> > Now maybe that's the wrong idea. But if so, then we're saying that the
> > definition of expectedTLEs needs to be changed, and we should update
> > the comments with the new definition, whatever it is. A lot of the
> > confusion here results from the fact that the code and comments are
> > inconsistent and we can't tell whether that's intentional or
> > inadvertent. Let's not leave the next person who looks at this code
> > wondering the same thing about whatever changes we make.
> 
> Ok.  The reason why we haven't have a complain about that would be
> that it is rare that pg_wal is wiped out before a standby connects to
> a just-promoted primary. I'm not sure about the tool Dilip is using,
> though..
> 
> So the result is the attached.  This would be back-patcheable to 9.3
> (or 9.6?) but I doubt that we should do as we don't seem to have had a
> complaint on this issue and we're not full faith on this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Robert Haas
Date:
On Fri, May 14, 2021 at 12:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I am not sure that have you noticed the commit id which changed the
> definition of expectedTLEs, Heikki has committed that change so adding
> him in the list to know his opinion.

I did notice, but keep in mind that this was more than 8 years ago.
Even if Heikki is reading this thread, he may not remember why he
changed 1 line of code one way rather than another in 2013. I mean if
he does that's great, but it's asking a lot.

> =====
> ee994272ca50f70b53074f0febaec97e28f83c4e
> Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>  2013-01-03 14:11:58
> Committer: Heikki Linnakangas <heikki.linnakangas@iki.fi>  2013-01-03 14:11:58
>
>     Delay reading timeline history file until it's fetched from master.
>     .....
>      Without the timeline history file, recovering that file
>     will fail as the older timeline ID is not recognized to be an ancestor of
>     the target timeline. If you try to recover from such a backup, using only
>     streaming replication to fetch the WAL, this patch is required for that to
>     work.
> =====
>
> Part of this commit message says that it will not identify the
> recoveryTargetTLI as the ancestor of the checkpoint timeline (without
> history file).  I did not understand what it is trying to say.  Does
> it is trying to say that even though the recoveryTargetTLI is the
> ancestor of the checkpoint timeline but we can not track that because
> we don't have a history file?  So to handle this problem change the
> definition of expectedTLEs to directly point to the checkpoint
> timeline?
>
> Because before this commit, we were directly initializing expectedTLEs
> with the history file of recoveryTargetTLI, we were not even waiting
> for reading the checkpoint,  but under this commit, it is changed.

Well, I think that is talking about what the commit did in general,
not specifically the one line of code that we think may be incorrect.
As I understand it, the general issue here was that if
XLogFileReadAnyTLI() was called before expectedTLEs got set, then
prior to this commit it would have to fail, because the foreach() loop
in that function would be iterating over an empty list. Heikki tried
to make it not fail in that case, by setting tles =
readTimeLineHistory(recoveryTargetTLI), so that the foreach loop
*wouldn't* get an empty list.

Thinking about this a bit more, I think the idea behind the logic this
commit added to XLogFileReadAnyTLI() is that
XLogFileReadAnyTLI(recoveryTargetTLI) may or may not produce the
correct answer. If the timeline history file exists, it will contain
all the information that we need and will return a complete list of
TLEs. But if the file does not exist yet, then it will return a
1-entry list saying that the TLI in question has no parents. If
readTimeLineHistory() actually reads the file, then it's safe to save
the return value in expectedTLEs, but if it doesn't, then it may or
may not be safe. If XLogFileReadAnyTLI calls XLogFileRead and it
works, then the WAL segment we need exists on our target timeline and
we don't actually need the timeline history for anything because we
can just directly begin replay from the target timeline. But if
XLogFileRead fails with the 1-entry dummy list, then we need the
timeline history and don't have it yet, so we have to retry later,
when the history file will hopefully be present, and then at that
point readTimeLineHistory will return a different and better answer
and hopefully it will all work.

I think this is what the commit message is talking about when it says
that "Without the timeline history file, recovering that file will
fail as the older timeline ID is not recognized to be an ancestor of
the target timeline." Without the timeline history file, we can't know
whether some other timeline is an ancestor or not. But the specific
way that manifests is that XLogFileReadAnyTLI() returns a 1-entry
dummy list instead of the real contents of the timeline history file.
This commit doesn't prevent that from happening, but it does prevent
the 1-entry dummy list from getting stored in the global variable
expectedTLEs, except in the case where no timeline switch is occurring
and the lack of history therefore doesn't matter. Without this commit,
if the call to readTimeLineHistory(recoveryTargetTLI) happens at a
time when the timeline history file is not yet available, the 1-entry
dummy list ends up in the global variable and there's no way for it to
ever be replaced with a real history even if the timeline history file
shows up in the archive later.

As I see it, the question on the table here is whether there's any
justification for the fact that when the second switch in
WaitForWALToBecomeAvailable takes the
XLOG_FROM_ARCHIVE/XLOG_FROM_PG_WAL branch, it calls XLogFileReadAnyTLI
which tries to read the history of recoveryTargetTLI, while when that
same switch takes the XLOG_FROM_STREAM branch, it tries to read the
history of receiveTLI. I tend to think it doesn't make sense. On
general principle, archiving and streaming are supposed to work the
same way, so the idea that they are getting the timeline from
different places is inherently suspect. But also and more
specifically, AFAICS receiveTLI always has to be the same TLI that we
requested from the server, so we're always looking up our own current
TLI here rather than the target TLI, which seems wrong to me, at least
of this moment. :-)

But that having been said, I still don't quite understand the
conditions required to tickle this problem. I spent a long time poking
at it today. It seems to me that it ought somehow to be possible to
recreate the scenario without trying to reuse the old master as a
standby, and also without even needing a WAL archive, but I couldn't
figure out how to do it. I tried setting up a primary and a standby,
and then making a backup from the standby, promoting it, and then
starting what would have been a cascading standby from the backup. But
that doesn't do it. The first mistake I made was creating the standbys
with something like 'pg_basebackup -R', but that's not good enough
because then they have WAL, so I added '-Xnone'. But then I realized
that when a base backup ends, the primary writes an XLOG_SWITCH
record, which means that when the standby is promoted, the promotion
is not in the same WAL segment as the checkpoint from which the
machine that would have been a cascading standby is trying to start. I
worked around that by setting recovery_target='immediate' on the
standby. With that change, I get a WAL file on the new timeline - 2 in
this case - that looks like this:

rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn:
0/19000060, prev 0/19000028, desc: CHECKPOINT_ONLINE redo 0/19000028;
tli 1; prev tli 1; fpw true; xid 0:587; oid 16385; multi 1; offset 0;
oldest xid 579 in DB 1; oldest multi 1 in DB 1; oldest/newest commit
timestamp xid: 0/0; oldest running xid 587; online
rmgr: XLOG        len (rec/tot):     34/    34, tx:          0, lsn:
0/190000D8, prev 0/19000060, desc: BACKUP_END 0/19000028
rmgr: XLOG        len (rec/tot):    114/   114, tx:          0, lsn:
0/19000100, prev 0/190000D8, desc: CHECKPOINT_SHUTDOWN redo
0/19000100; tli 2; prev tli 1; fpw true; xid 0:587; oid 16385; multi
1; offset 0; oldest xid 579 in DB 1; oldest multi 1 in DB 1;
oldest/newest commit timestamp xid: 0/0; oldest running xid 0;
shutdown

That sure looks like the right thing to recreate the problem, because
the first checkpoint is from the backup, and the
woulda-been-a-cascading-standby should be starting there, and the
second checkpoint is in the same segment and shows a timeline switch.
But everything worked great:

2021-05-14 17:44:58.684 EDT [5697] DETAIL:  End of WAL reached on
timeline 1 at 0/19000100.
2021-05-14 17:44:58.728 EDT [5694] LOG:  new target timeline is 2
2021-05-14 17:44:58.746 EDT [5694] LOG:  redo starts at 0/19000028
2021-05-14 17:44:58.749 EDT [5694] LOG:  consistent recovery state
reached at 0/19000100

I don't understand why that works. It feels to me like it ought to run
smack into the same problem you saw, but it doesn't.

> I am referring to the below code which was deleted by this commit:
>
> ========
> @@ -5279,49 +5299,6 @@ StartupXLOG(void)
>   */
>   readRecoveryCommandFile();
>
> - /* Now we can determine the list of expected TLIs */
> - expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
> -
> - /*
> - * If the location of the checkpoint record is not on the expected
> - * timeline in the history of the requested timeline, we cannot proceed:
> - * the backup is not part of the history of the requested timeline.
> - */
> - if (tliOfPointInHistory(ControlFile->checkPoint, expectedTLEs) !=
> - ControlFile->checkPointCopy.ThisTimeLineID)
> - {
> =========

I don't think this code is really deleted. The tliOfPointInHistory
check was just moved later in the function. And expectedTLEs is still
supposed to be getting initialized, because just before the new
location of the tliOfPointInHistory check, Heikki added
Assert(expectedTLEs).

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Sat, May 15, 2021 at 3:58 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> I did notice, but keep in mind that this was more than 8 years ago.
> Even if Heikki is reading this thread, he may not remember why he
> changed 1 line of code one way rather than another in 2013. I mean if
> he does that's great, but it's asking a lot.

I agree with your point.  But I think that one line is related to the
purpose of this commit and I have explained (in 3rd paragraph below)
why do I think so.

 As I understand it, the general issue here was that if
> XLogFileReadAnyTLI() was called before expectedTLEs got set, then
> prior to this commit it would have to fail, because the foreach() loop
> in that function would be iterating over an empty list. Heikki tried
> to make it not fail in that case, by setting tles =
> readTimeLineHistory(recoveryTargetTLI), so that the foreach loop
> *wouldn't* get an empty list.

I might be missing something but I don't agree with this logic.  If
you see prior to this commit the code flow was like below[1].  So my
point is if we are calling XLogFileReadAnyTLI() somewhere while
reading the checkpoint record then note that expectedTLEs were
initialized unconditionally before even try to read that checkpoint
record.  So how expectedTLEs could be uninitialized in
LogFileReadAnyTLI?

[1]
StartupXLOG(void)
{
....

recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID;
...
readRecoveryCommandFile();
...
expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
...
..
record = ReadCheckpointRecord(checkPointLoc, 0);


Another point which I am not sure about but still I think that one
line (expectedTLEs = readTimeLineHistory(receiveTLI);), somewhere
related to the purpose of this commit.  Let me explain why do I think
so.  Basically, before this commit, we were initializing
"expectedTLEs" based on the history file of "recoveryTargetTLI", right
after calling "readRecoveryCommandFile()" (this function will
initialize recoveryTargetTLI based on the recovery target, and it
ensures it read the respective history file).  Now, right after this
point, there was a check as shown below[2], which is checking whether
the checkpoint TLI exists in the  "expectedTLEs" which is initialized
based on recoveryTargetTLI.  And it appeared that this check was
failing in some cases which this commit tried to fix and all other
code is there to support that.  Because now before going for reading
the checkpoint we are not initializing "expectedTLEs" so now after
moving this line from here it was possible that "expectedTLEs" is not
initialized in XLogFileReadAnyTLI() and the remaining code in
XLogFileReadAnyTLI() is to handle that part.

Now, coming to my point that why this one line is related,  In this
one line (expectedTLEs = readTimeLineHistory(receiveTLI);), we
completely avoiding recoveryTargetTLI and initializing "expectedTLEs"
based on the history file of the TL from which we read the checkpoint,
so now, there is no scope of below[2] check to fail because note that
we are not initializing "expectedTLEs" based on the
"recoveryTargetTLI" but we are initializing from the history from
where we read checkpoint.

So I feel if we directly fix this one line to initialize
"expectedTLEs" from "recoveryTargetTLI" then it will expose to the
same problem this commit tried to fix.

[2]
if (tliOfPointInHistory(ControlFile->checkPoint, expectedTLEs) !=
ControlFile->checkPointCopy.ThisTimeLineID)
{
error()
}


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Sat, 15 May 2021 10:55:05 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in 
> On Sat, May 15, 2021 at 3:58 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > I did notice, but keep in mind that this was more than 8 years ago.
> > Even if Heikki is reading this thread, he may not remember why he
> > changed 1 line of code one way rather than another in 2013. I mean if
> > he does that's great, but it's asking a lot.
> 
> I agree with your point.  But I think that one line is related to the
> purpose of this commit and I have explained (in 3rd paragraph below)
> why do I think so.
> 
>  As I understand it, the general issue here was that if
> > XLogFileReadAnyTLI() was called before expectedTLEs got set, then
> > prior to this commit it would have to fail, because the foreach() loop
> > in that function would be iterating over an empty list. Heikki tried
> > to make it not fail in that case, by setting tles =
> > readTimeLineHistory(recoveryTargetTLI), so that the foreach loop
> > *wouldn't* get an empty list.
> 
> I might be missing something but I don't agree with this logic.  If
> you see prior to this commit the code flow was like below[1].  So my
> point is if we are calling XLogFileReadAnyTLI() somewhere while
> reading the checkpoint record then note that expectedTLEs were
> initialized unconditionally before even try to read that checkpoint
> record.  So how expectedTLEs could be uninitialized in
> LogFileReadAnyTLI?

Mmm. I think both of you are right.  Before the commit,
XLogFileReadAnyTLI expected that expectedTLEs is initialized. After
the commit it cannot no longer expect that so readTimeLineHistory was
changed to try fetching by itself.  *If* an appropriate history file
is found, it *initializes* expectedTLEs with the content.

> [1]
> StartupXLOG(void)
> {
> ....
> 
> recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID;
> ...
> readRecoveryCommandFile();
> ...
> expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
> ...
> ..
> record = ReadCheckpointRecord(checkPointLoc, 0);
> 
> 
> Another point which I am not sure about but still I think that one
> line (expectedTLEs = readTimeLineHistory(receiveTLI);), somewhere
> related to the purpose of this commit.  Let me explain why do I think
> so.  Basically, before this commit, we were initializing
> "expectedTLEs" based on the history file of "recoveryTargetTLI", right
> after calling "readRecoveryCommandFile()" (this function will
> initialize recoveryTargetTLI based on the recovery target, and it
> ensures it read the respective history file).  Now, right after this
> point, there was a check as shown below[2], which is checking whether
> the checkpoint TLI exists in the  "expectedTLEs" which is initialized
> based on recoveryTargetTLI.  And it appeared that this check was
> failing in some cases which this commit tried to fix and all other
> code is there to support that.  Because now before going for reading
> the checkpoint we are not initializing "expectedTLEs" so now after
> moving this line from here it was possible that "expectedTLEs" is not
> initialized in XLogFileReadAnyTLI() and the remaining code in
> XLogFileReadAnyTLI() is to handle that part.

Before the commit expectedTLEs is always initialized with just one
entry for the TLI of the last checkpoint record.

(1) If XLogFileReadAnyTLI() found the segment but no history file
found, that is, using the dummy TLE-list, expectedTLEs is initialized
with the dummy one-entry list. So there's no behavioral change in this
aspect.

(2) If we didn't find the segment for the checkpoint record, it starts
replication and fetches history files and WAL records then revisits
XLogFileReadAnyTLI.  Now we have both the history file and segments,
it successfully reads the recood.  The difference of expectedTLEs made
by the patch is having just one entry or the all entries for the past.

Assuming that we keep expectedTLEs synced with recoveryTargetTLI,
rescanLatestTimeLine updates the list properly so no need to worry
about the future.  So the issue would be in the past timelines.  After
reading the checkpoint record, if we need to rewind to the previous
timeline for the REDO point, the dummy list is inconvenient.

So there is a possibility that the patch fixed the case (2), where the
standby doesn't have both the segment for the checkpoint record and
the history file for the checkpoint, and the REDO point is on the last
TLI.  If it is correct, the patch still fails for the case (1), that
is, the issue raised here. Anyway it would be useless (and rahter
harmful) to initialize expectedTLEs based on receiveTLI there.

So my resul there is:

The commit fixed the case (2)
The fix caused the issue for the case (1).
The proposed fix fixes the case (1), caused by the commit.

There's another issue in the case (1) and REDO point is back to the
previous timeline, which is in doubt we need to fix..

> Now, coming to my point that why this one line is related,  In this
> one line (expectedTLEs = readTimeLineHistory(receiveTLI);), we
> completely avoiding recoveryTargetTLI and initializing "expectedTLEs"
> based on the history file of the TL from which we read the checkpoint,
> so now, there is no scope of below[2] check to fail because note that
> we are not initializing "expectedTLEs" based on the
> "recoveryTargetTLI" but we are initializing from the history from
> where we read checkpoint.
> 
> So I feel if we directly fix this one line to initialize
> "expectedTLEs" from "recoveryTargetTLI" then it will expose to the
> same problem this commit tried to fix.
> 
> [2]
> if (tliOfPointInHistory(ControlFile->checkPoint, expectedTLEs) !=
> ControlFile->checkPointCopy.ThisTimeLineID)
> {
> error()
> }

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Mon, 17 May 2021 12:20:12 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Assuming that we keep expectedTLEs synced with recoveryTargetTLI,
> rescanLatestTimeLine updates the list properly so no need to worry
> about the future.  So the issue would be in the past timelines.  After
> reading the checkpoint record, if we need to rewind to the previous
> timeline for the REDO point, the dummy list is inconvenient.

By the way, I tried reproducing this situation, but ended in finding
it a kind of impossible because pg_basebackup (or pg_stop_backup())
waits for the promotion checkpoint to end.

If we make a backup in a somewhat broken steps, that could be done but
I didn't try.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Mon, 17 May 2021 13:01:04 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Mon, 17 May 2021 12:20:12 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > Assuming that we keep expectedTLEs synced with recoveryTargetTLI,
> > rescanLatestTimeLine updates the list properly so no need to worry
> > about the future.  So the issue would be in the past timelines.  After
> > reading the checkpoint record, if we need to rewind to the previous
> > timeline for the REDO point, the dummy list is inconvenient.
> 
> By the way, I tried reproducing this situation, but ended in finding
> it a kind of impossible because pg_basebackup (or pg_stop_backup())
> waits for the promotion checkpoint to end.

Mmm. That's wrong.  What the tool waits is not a promotion checkpoint,
but a backup-checkpoint, maybe.  (I don't remember cleary about that,
sorry.)

> If we make a backup in a somewhat broken steps, that could be done but
> I didn't try.

So there might still be a way to reproduce that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Mon, May 17, 2021 at 8:50 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Before the commit expectedTLEs is always initialized with just one
> entry for the TLI of the last checkpoint record.

Right

> (1) If XLogFileReadAnyTLI() found the segment but no history file
> found, that is, using the dummy TLE-list, expectedTLEs is initialized
> with the dummy one-entry list. So there's no behavioral change in this
> aspect.

Yeah, you are right.

> (2) If we didn't find the segment for the checkpoint record, it starts
> replication and fetches history files and WAL records then revisits
> XLogFileReadAnyTLI.  Now we have both the history file and segments,
> it successfully reads the recood.  The difference of expectedTLEs made
> by the patch is having just one entry or the all entries for the past.

Correct.

> Assuming that we keep expectedTLEs synced with recoveryTargetTLI,
> rescanLatestTimeLine updates the list properly so no need to worry
> about the future.  So the issue would be in the past timelines.  After
> reading the checkpoint record, if we need to rewind to the previous
> timeline for the REDO point, the dummy list is inconvenient.
>
> So there is a possibility that the patch fixed the case (2), where the
> standby doesn't have both the segment for the checkpoint record and
> the history file for the checkpoint, and the REDO point is on the last
> TLI.  If it is correct, the patch still fails for the case (1), that
> is, the issue raised here. Anyway it would be useless (and rahter
> harmful) to initialize expectedTLEs based on receiveTLI there.
>
> So my resul there is:
>
> The commit fixed the case (2)

Yes, by maintaining the entire history instead of one entry if history
was missing.

> The fix caused the issue for the case (1).

Basically, before this commit expectedTLEs and recoveryTargetTLI were
in always in sync which this patch broke.

> The proposed fix fixes the case (1), caused by the commit.

Right, I agree with the fix. So fix should be just to change that one
line and initialize expectedTLEs from recoveryTargetTLI


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Mon, May 17, 2021 at 10:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, May 17, 2021 at 8:50 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > Before the commit expectedTLEs is always initialized with just one
> > entry for the TLI of the last checkpoint record.
>
> Right
>
> > (1) If XLogFileReadAnyTLI() found the segment but no history file
> > found, that is, using the dummy TLE-list, expectedTLEs is initialized
> > with the dummy one-entry list. So there's no behavioral change in this
> > aspect.
>
> Yeah, you are right.

But do you agree that one line entry will always be a checkpoint
timeline entry?  Because if you notice below code[1] in function
"readRecoveryCommandFile();", then you will realize that once we come
out of this function either the "recoveryTargetTLI" is checkpoint TL
wherever it was before calling this function or we must have the
history file.  That means after exiting this function if we execute
this line (expectedTLEs = readTimeLineHistory(recoveryTargetTLI);)
that means either "expectedTLEs" could point to one dummy entry which
will be nothing but the checkpoint TL entry or it will be holding
complete history.

The patch is trying to say that without the history file the
checkpoint TL will not be found in "expectedTLEs" because the older TL
(checkpoint TL) is not the ancestor of the target
timeline(recoveryTargetTLI).  But ideally, either the target timeline
should be the same as the checkpoint timeline or we must have the
history file as I stated in the above paragraph.  Am I missing
something?

[1]
if (rtli)
{
  /* Timeline 1 does not have a history file, all else should */
  if (rtli != 1 && !existsTimeLineHistory(rtli))
   ereport(FATAL,
  (errmsg("recovery target timeline %u does not exist",
  rtli)));
  recoveryTargetTLI = rtli;
  recoveryTargetIsLatest = false;
}
else
{
  /* We start the "latest" search from pg_control's timeline */
  recoveryTargetTLI = findNewestTimeLine(recoveryTargetTLI);
  recoveryTargetIsLatest = true;
}


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Robert Haas
Date:
On Sat, May 15, 2021 at 1:25 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >  As I understand it, the general issue here was that if
> > XLogFileReadAnyTLI() was called before expectedTLEs got set, then
> > prior to this commit it would have to fail, because the foreach() loop
> > in that function would be iterating over an empty list. Heikki tried
> > to make it not fail in that case, by setting tles =
> > readTimeLineHistory(recoveryTargetTLI), so that the foreach loop
> > *wouldn't* get an empty list.
>
> I might be missing something but I don't agree with this logic.  If
> you see prior to this commit the code flow was like below[1].  So my
> point is if we are calling XLogFileReadAnyTLI() somewhere while
> reading the checkpoint record then note that expectedTLEs were
> initialized unconditionally before even try to read that checkpoint
> record.  So how expectedTLEs could be uninitialized in
> LogFileReadAnyTLI?

Sorry, you're right. It couldn't be uninitialized, but it could be a
fake 1-element list saying there are no ancestors rather than the real
value. So I think the point was to avoid that.

> Another point which I am not sure about but still I think that one
> line (expectedTLEs = readTimeLineHistory(receiveTLI);), somewhere
> related to the purpose of this commit.  Let me explain why do I think
> so.  Basically, before this commit, we were initializing
> "expectedTLEs" based on the history file of "recoveryTargetTLI", right
> after calling "readRecoveryCommandFile()" (this function will
> initialize recoveryTargetTLI based on the recovery target, and it
> ensures it read the respective history file).  Now, right after this
> point, there was a check as shown below[2], which is checking whether
> the checkpoint TLI exists in the  "expectedTLEs" which is initialized
> based on recoveryTargetTLI.  And it appeared that this check was
> failing in some cases which this commit tried to fix and all other
> code is there to support that.  Because now before going for reading
> the checkpoint we are not initializing "expectedTLEs" so now after
> moving this line from here it was possible that "expectedTLEs" is not
> initialized in XLogFileReadAnyTLI() and the remaining code in
> XLogFileReadAnyTLI() is to handle that part.

I think the issue here is: If expectedTLEs was initialized before the
history file was available, and contained a dummy 1-element list, then
tliOfPointInHistory() is going to say that every LSN is on that
timeline rather than any previous timeline. And if we are supposed to
be switching timelines then that will lead to this sanity check
failing.

> Now, coming to my point that why this one line is related,  In this
> one line (expectedTLEs = readTimeLineHistory(receiveTLI);), we
> completely avoiding recoveryTargetTLI and initializing "expectedTLEs"
> based on the history file of the TL from which we read the checkpoint,
> so now, there is no scope of below[2] check to fail because note that
> we are not initializing "expectedTLEs" based on the
> "recoveryTargetTLI" but we are initializing from the history from
> where we read checkpoint.

I agree, but that's actually bad, isn't it? I mean if we want the
sanity check to never fail we can just take it out. What we want to
happen is that the sanity check should pass if the startup timeline if
the TLI of the startup checkpoint is in the history of the recovery
target timeline, but fail if it isn't. The only way to achieve that
behavior is if expectedTLEs is initialized from the recovery target
timeline.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Tue, May 18, 2021 at 1:28 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> Sorry, you're right. It couldn't be uninitialized, but it could be a
> fake 1-element list saying there are no ancestors rather than the real
> value. So I think the point was to avoid that.

Yeah, it will be a fake 1-element list.  But just to be clear that
1-element can only be "ControlFile->checkPointCopy.ThisTimeLineID" and
nothing else, do you agree to this?  Because we initialize
recoveryTargetTLI to this value and we might change it in
readRecoveryCommandFile() but for that, we must get the history file,
so if we are talking about the case where we don't have the history
file then "recoveryTargetTLI" will still be
"ControlFile->checkPointCopy.ThisTimeLineID".

>
> I think the issue here is: If expectedTLEs was initialized before the
> history file was available, and contained a dummy 1-element list, then
> tliOfPointInHistory() is going to say that every LSN is on that
> timeline rather than any previous timeline. And if we are supposed to
> be switching timelines then that will lead to this sanity check
> failing.

You are talking about the sanity check of validating the timeline of
the checkpoint record right?  but as I mentioned earlier the only
entry in expectedTLEs will be the TLE of the checkpoint record so how
the sanity check will fail?

>
> I agree, but that's actually bad, isn't it?

Yes, it is bad.

 I mean if we want the
> sanity check to never fail we can just take it out. What we want to
> happen is that the sanity check should pass if the startup timeline if
> the TLI of the startup checkpoint is in the history of the recovery
> target timeline, but fail if it isn't. The only way to achieve that
> behavior is if expectedTLEs is initialized from the recovery target
> timeline.

Yes, I agree, with this.  So initializing expectedTLEs with the
recovery target timeline is the right fix.

Conclusion:
- I think now we agree on the point that initializing expectedTLEs
with the recovery target timeline is the right fix.
- We still have some differences of opinion about what was the
original problem in the base code which was fixed by the commit
(ee994272ca50f70b53074f0febaec97e28f83c4e).

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Mon, 17 May 2021 10:46:24 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in 
> On Mon, May 17, 2021 at 10:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, May 17, 2021 at 8:50 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > >
> > > Before the commit expectedTLEs is always initialized with just one
> > > entry for the TLI of the last checkpoint record.
> >
> > Right
> >
> > > (1) If XLogFileReadAnyTLI() found the segment but no history file
> > > found, that is, using the dummy TLE-list, expectedTLEs is initialized
> > > with the dummy one-entry list. So there's no behavioral change in this
> > > aspect.
> >
> > Yeah, you are right.
> 
> But do you agree that one line entry will always be a checkpoint
> timeline entry?  Because if you notice below code[1] in function
> "readRecoveryCommandFile();", then you will realize that once we come
> out of this function either the "recoveryTargetTLI" is checkpoint TL
> wherever it was before calling this function or we must have the
> history file.  That means after exiting this function if we execute
> this line (expectedTLEs = readTimeLineHistory(recoveryTargetTLI);)
> that means either "expectedTLEs" could point to one dummy entry which
> will be nothing but the checkpoint TL entry or it will be holding
> complete history.

Right.

> The patch is trying to say that without the history file the
> checkpoint TL will not be found in "expectedTLEs" because the older TL
> (checkpoint TL) is not the ancestor of the target
> timeline(recoveryTargetTLI).  But ideally, either the target timeline
> should be the same as the checkpoint timeline or we must have the
> history file as I stated in the above paragraph.  Am I missing
> something?

Yeah, that has been the most mysterious point here.  So I searched for
a situation the one-entry expectedTLEs does not work.

I vaguely believed that there's a case where REDO point of a
checkpoint is in the timeline previous to the record of the
checkpoint. The previous discussion is based on this case, but that
doesn't seem to happen.  The last replayed checkpoint (that causes a
restartpoint) record is found before protmotion and the first
checkpoint starts after promotion.

A little while ago I tried to make a situation where a checkpoint
record is placed in the previous timeline of the tli written in the
control file. But control file is always written after checkpoint
record is flushed.


I rebooted myself from this:

ee994272ca:
>    There is at least one scenario where this makes a difference: if you take
>    a base backup from a standby server right after a timeline switch, the
>    WAL segment containing the initial checkpoint record will begin with an
>    older timeline ID. Without the timeline history file, recovering that file

And finally I think I could reach the situation the commit wanted to fix.

I took a basebackup from a standby just before replaying the first
checkpoint of the new timeline (by using debugger), without copying
pg_wal.  In this backup, the control file contains checkPointCopy of
the previous timeline.

I modified StartXLOG so that expectedTLEs is set just after first
determining recoveryTargetTLI, then started the grandchild node.  I
have the following error and the server fails to continue replication.

[postmaster] LOG:  starting PostgreSQL 14beta1 on x86_64-pc-linux-gnu...
[startup] LOG:  database system was interrupted while in recovery at log...
[startup] LOG:  set expectedtles tli=6, length=1
[startup] LOG:  Probing history file for TLI=7
[startup] LOG:  entering standby mode
[startup] LOG:  scanning segment 3 TLI 6, source 0
[startup] LOG:  Trying fetching history file for TLI=6
[walreceiver] LOG:  fetching timeline history file for timeline 5 from pri...
[walreceiver] LOG:  fetching timeline history file for timeline 6 from pri...
[walreceiver] LOG:  started streaming ... primary at 0/3000000 on timeline 5
[walreceiver] DETAIL:  End of WAL reached on timeline 5 at 0/30006E0.
[startup] LOG:  unexpected timeline ID 1 in log segment 000000050000000000000003, offset 0
[startup] LOG:  Probing history file for TLI=7
[startup] LOG:  scanning segment 3 TLI 6, source 0
(repeats forever)

This seems like the behavior the patch wanted to fix. (I'm not sure
precisely what happened at the time of the "unexpected timeline ID
1..", though. The line is seen only just after the first conection.)

>    will fail as the older timeline ID is not recognized to be an ancestor of
>    the target timeline. If you try to recover from such a backup, using only
>    streaming replication to fetch the WAL, this patch is required for that to
>    work.

After I reverted the modification, I got the following behavior
instead from the same backup.

[postmaster] LOG:  starting PostgreSQL 14beta1 on x86_64-...
[startup] JST LOG:  database system was interrupted while in recovery at log time 2021-05-18 13:45:59 JST
[startup] JST LOG:  Probing history file for TLI=7
[startup] JST LOG:  entering standby mode
[startup] JST LOG:  Loading history file for TLI=6 (2)
[startup] JST LOG:  Trying reading history file for TLI=6
[startup] JST LOG:  scanning segment 3 TLI 6, source 0
[startup] JST LOG:  Trying fetching history file for TLI=6
[walreceiver] JST LOG:  fetching timeline history file for timeline 5 fro...
[walreceiver] JST LOG:  fetching timeline history file for timeline 6 fro...
[walreceiver] JST LOG:  started streaming ... primary at 0/3000000 on timeline 5
[walreceiver] JST LOG:  replication terminated by primary server
[walreceiver] JST DETAIL:  End of WAL reached on timeline 5 at 0/30006E0.
[startup] LOG:  Loading expectedTLEs for 5
[startup] LOG:  Trying reading history file for TLI=5
[startup] LOG:  Checkpoint record: TLI=5, 0/3000668, rectargetTLI=6, exptles=0x3322a60
[startup] FATAL:  requested timeline 6 does not contain minimum recovery point 0/30007C0 on timeline 6
[postmaster] LOG:  startup process (PID 76526) exited with exit code 1
[postmaster] LOG:  aborting startup due to startup process failure
[postmaster] LOG:  database system is shut down

Aborts.. Yeah, this is the same issue with what is railed here.  So,
still I'm not sure I confirmed the case exactly (since the problem is
still seen.. but I don't want to bother building the version.)...
Anyway reading history file for recoveryTargetTLI instead of
receiveTLI fixes that.

FWIW, you could be get a problematic base backup by the following steps.

0. (make sure /tmp/hoge is removed)
1. apply the attached patch
2. create a primary then start
3. create a standby then start
4. place standby.signal to the primary, then restart it.
5. place the file /tmp/hoge.
6. promote the "primary".
7. You will see a log line like this
   LOG:  WAIT START: CHECKPOINT_ONLINE: TLI=2
8. Take a base backup (without copying WAL files)


> [1]
> if (rtli)
> {
>   /* Timeline 1 does not have a history file, all else should */
>   if (rtli != 1 && !existsTimeLineHistory(rtli))
>    ereport(FATAL,
>   (errmsg("recovery target timeline %u does not exist",
>   rtli)));
>   recoveryTargetTLI = rtli;
>   recoveryTargetIsLatest = false;
> }
> else
> {
>   /* We start the "latest" search from pg_control's timeline */
>   recoveryTargetTLI = findNewestTimeLine(recoveryTargetTLI);
>   recoveryTargetIsLatest = true;
> }

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c175..9483fd055c 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -55,6 +55,7 @@ restoreTimeLineHistoryFiles(TimeLineID begin, TimeLineID end)
 
     for (tli = begin; tli < end; tli++)
     {
+        elog(LOG, "Trying restoring history file for TLI=%d", tli);
         if (tli == 1)
             continue;
 
@@ -95,6 +96,7 @@ readTimeLineHistory(TimeLineID targetTLI)
 
     if (ArchiveRecoveryRequested)
     {
+        elog(LOG, "Trying reading history file for TLI=%d", targetTLI);
         TLHistoryFileName(histfname, targetTLI);
         fromArchive =
             RestoreArchivedFile(path, histfname, "RECOVERYHISTORY", 0, false);
@@ -231,6 +233,7 @@ existsTimeLineHistory(TimeLineID probeTLI)
 
     if (ArchiveRecoveryRequested)
     {
+        elog(LOG, "Probing history file for TLI=%d", probeTLI);
         TLHistoryFileName(histfname, probeTLI);
         RestoreArchivedFile(path, histfname, "RECOVERYHISTORY", 0, false);
     }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8d163f190f..afd6a0ce0a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3726,7 +3726,7 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
             snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
                      xlogfname);
             set_ps_display(activitymsg);
-
+            elog(LOG, "Trying fetching history file for TLI=%d", tli);
             restoredFromArchive = RestoreArchivedFile(path, xlogfname,
                                                       "RECOVERYXLOG",
                                                       wal_segment_size,
@@ -3820,7 +3820,10 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
     if (expectedTLEs)
         tles = expectedTLEs;
     else
+    {
+        elog(LOG, "Loading history file for TLI=%d (2)", recoveryTargetTLI);
         tles = readTimeLineHistory(recoveryTargetTLI);
+    }
 
     foreach(cell, tles)
     {
@@ -3834,6 +3837,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
          * Skip scanning the timeline ID that the logfile segment to read
          * doesn't belong to
          */
+        elog(LOG, "scanning segment %lX TLI %d, source %d", segno, tli, source);
+
         if (hent->begin != InvalidXLogRecPtr)
         {
             XLogSegNo    beginseg = 0;
@@ -3860,6 +3865,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
                               XLOG_FROM_ARCHIVE, true);
             if (fd != -1)
             {
+                elog(LOG, "found segment %lX TLI %d, from archive", segno, tli);
                 elog(DEBUG1, "got WAL segment from archive");
                 if (!expectedTLEs)
                     expectedTLEs = tles;
@@ -3873,6 +3879,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
                               XLOG_FROM_PG_WAL, true);
             if (fd != -1)
             {
+                elog(LOG, "found segment %lX TLI %d, from PG_WAL", segno, tli);
                 if (!expectedTLEs)
                     expectedTLEs = tles;
                 return fd;
@@ -6577,6 +6584,8 @@ StartupXLOG(void)
     else
         recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID;
 
+    expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
+    elog(LOG, "set expectedtles %d, %d", recoveryTargetTLI, list_length(expectedTLEs));
     /*
      * Check for signal files, and if so set up state for offline recovery
      */
@@ -6866,11 +6875,19 @@ StartupXLOG(void)
     if (!XLogRecPtrIsInvalid(ControlFile->minRecoveryPoint) &&
         tliOfPointInHistory(ControlFile->minRecoveryPoint - 1, expectedTLEs) !=
         ControlFile->minRecoveryPointTLI)
+    {
+        ListCell *lc;
+        foreach (lc, expectedTLEs)
+        {
+            TimeLineHistoryEntry *tle = (TimeLineHistoryEntry *) lfirst(lc);
+            elog(LOG, "TLE %d {%X/%X - %X/%X}", tle->tli, LSN_FORMAT_ARGS(tle->begin), LSN_FORMAT_ARGS(tle->end));
+        }
         ereport(FATAL,
                 (errmsg("requested timeline %u does not contain minimum recovery point %X/%X on timeline %u",
                         recoveryTargetTLI,
                         LSN_FORMAT_ARGS(ControlFile->minRecoveryPoint),
                         ControlFile->minRecoveryPointTLI)));
+    }
 
     LastRec = RecPtr = checkPointLoc;
 
@@ -8396,7 +8413,7 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 
     XLogBeginRead(xlogreader, RecPtr);
     record = ReadRecord(xlogreader, LOG, true);
-
+    elog(LOG, "Checkpoint record: TLI=%d, %X/%X, rectargetTLI=%d, exptles=%p", xlogreader->seg.ws_tli,
LSN_FORMAT_ARGS(xlogreader->ReadRecPtr),recoveryTargetTLI, expectedTLEs);
 
     if (record == NULL)
     {
         if (!report)
@@ -10211,6 +10228,19 @@ xlog_redo(XLogReaderState *record)
         CheckPoint    checkPoint;
 
         memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
+        {
+            struct stat b;
+            bool f = true;
+            while (stat("/tmp/hoge", &b) == 0)
+            {
+                if (f)
+                    elog(LOG, "WAIT START: CHECKPOINT_ONLINE: TLI=%d", checkPoint.ThisTimeLineID);
+                f = false;
+                sleep(1);
+            }
+            if (!f)
+                elog(LOG, "WAIT END: CHECKPOINT_ONLINE");
+        }
         /* In an ONLINE checkpoint, treat the XID counter as a minimum */
         LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
         if (FullTransactionIdPrecedes(ShmemVariableCache->nextXid,
@@ -12595,7 +12625,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                              * TLI, rather than the position we're reading.
                              */
                             tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
-
+                            elog(LOG, "%X/%X is on TLI %X", LSN_FORMAT_ARGS(tliRecPtr), tli);
                             if (curFileTLI > 0 && tli < curFileTLI)
                                 elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but
previousrecovered WAL file came from timeline %u",
 
                                      LSN_FORMAT_ARGS(tliRecPtr),
@@ -12662,7 +12692,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                         if (readFile < 0)
                         {
                             if (!expectedTLEs)
-                                expectedTLEs = readTimeLineHistory(receiveTLI);
+                            {
+                                
+                                elog(LOG, "Loading expectedTLEs for %d (%d)", recoveryTargetTLI, receiveTLI);
+                                expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
+                            }
                             readFile = XLogFileRead(readSegNo, PANIC,
                                                     receiveTLI,
                                                     XLOG_FROM_STREAM, false);

Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Tue, 18 May 2021 15:52:07 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> FWIW, you could be get a problematic base backup by the following steps.
> 
> 0. (make sure /tmp/hoge is removed)
> 1. apply the attached patch
> 2. create a primary then start
> 3. create a standby then start
> 4. place standby.signal to the primary, then restart it.
> 5. place the file /tmp/hoge.
> 6. promote the "primary".
> 7. You will see a log line like this
>    LOG:  WAIT START: CHECKPOINT_ONLINE: TLI=2
> 8. Take a base backup (without copying WAL files)

I carelessly have left the "modification" uncommented in the diff file.

@@ -6577,6 +6584,8 @@ StartupXLOG(void)
     else
         recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID;
 
+    expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
+    elog(LOG, "set expectedtles %d, %d", recoveryTargetTLI, list_length(expectedTLEs));

Disabling the lines would show the result of the ancient fix.

regards.
-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Tue, May 18, 2021 at 12:22 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

> And finally I think I could reach the situation the commit wanted to fix.
>
> I took a basebackup from a standby just before replaying the first
> checkpoint of the new timeline (by using debugger), without copying
> pg_wal.  In this backup, the control file contains checkPointCopy of
> the previous timeline.
>
> I modified StartXLOG so that expectedTLEs is set just after first
> determining recoveryTargetTLI, then started the grandchild node.  I
> have the following error and the server fails to continue replication.

> [postmaster] LOG:  starting PostgreSQL 14beta1 on x86_64-pc-linux-gnu...
> [startup] LOG:  database system was interrupted while in recovery at log...
> [startup] LOG:  set expectedtles tli=6, length=1
> [startup] LOG:  Probing history file for TLI=7
> [startup] LOG:  entering standby mode
> [startup] LOG:  scanning segment 3 TLI 6, source 0
> [startup] LOG:  Trying fetching history file for TLI=6
> [walreceiver] LOG:  fetching timeline history file for timeline 5 from pri...
> [walreceiver] LOG:  fetching timeline history file for timeline 6 from pri...
> [walreceiver] LOG:  started streaming ... primary at 0/3000000 on timeline 5
> [walreceiver] DETAIL:  End of WAL reached on timeline 5 at 0/30006E0.
> [startup] LOG:  unexpected timeline ID 1 in log segment 000000050000000000000003, offset 0
> [startup] LOG:  Probing history file for TLI=7
> [startup] LOG:  scanning segment 3 TLI 6, source 0
> (repeats forever)

So IIUC, this logs shows that
"ControlFile->checkPointCopy.ThisTimeLineID" is 6 but
"ControlFile->checkPoint" record is on TL 5?  I think if you had the
old version of the code (before the commit) or below code [1], right
after initializing expectedTLEs then you would have hit the FATAL the
patch had fix.

While debugging did you check what was the "ControlFile->checkPoint"
LSN vs the first LSN of the first segment with TL6?

expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
[1]
if (tliOfPointInHistory(ControlFile->checkPoint, expectedTLEs) !=
ControlFile->checkPointCopy.ThisTimeLineID)
{
report(FATAL..
}

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Robert Haas
Date:
On Tue, May 18, 2021 at 1:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Yeah, it will be a fake 1-element list.  But just to be clear that
> 1-element can only be "ControlFile->checkPointCopy.ThisTimeLineID" and
> nothing else, do you agree to this?  Because we initialize
> recoveryTargetTLI to this value and we might change it in
> readRecoveryCommandFile() but for that, we must get the history file,
> so if we are talking about the case where we don't have the history
> file then "recoveryTargetTLI" will still be
> "ControlFile->checkPointCopy.ThisTimeLineID".

I don't think your conclusion is correct. As I understand it, you're
talking about the state before
ee994272ca50f70b53074f0febaec97e28f83c4e, because as of now
readRecoveryCommandFile() no longer exists in master. Before that
commit, StartupXLOG did this:

        recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID;
        readRecoveryCommandFile();
        expectedTLEs = readTimeLineHistory(recoveryTargetTLI);

Now, readRecoveryCommandFile() can change recoveryTargetTLI. Before
doing so, it will call existsTimeLineHistory if
recovery_target_timeline was set to an integer, and findNewestTimeLine
if it was set to latest. In the first case, existsTimeLineHistory()
calls RestoreArchivedFile(), but that restores it using a temporary
name, and KeepFileRestoredFromArchive is not called, so we might have
the timeline history in RECOVERYHISTORY but that's not the filename
we're actually going to try to read from inside readTimeLineHistory().
In the second case, findNewestTimeLine() will call
existsTimeLineHistory() which results in the same situation. So I
think when readRecoveryCommandFile() returns expectedTLI can be
different but the history file can be absent since it was only ever
restored under a temporary name.

> Conclusion:
> - I think now we agree on the point that initializing expectedTLEs
> with the recovery target timeline is the right fix.
> - We still have some differences of opinion about what was the
> original problem in the base code which was fixed by the commit
> (ee994272ca50f70b53074f0febaec97e28f83c4e).

I am also still concerned about whether we understand in exactly what
cases the current logic doesn't work. We seem to more or less agree on
the fix, but I don't think we really understand precisely what case we
are fixing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Thu, 20 May 2021 13:49:10 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Tue, May 18, 2021 at 1:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Yeah, it will be a fake 1-element list.  But just to be clear that
> > 1-element can only be "ControlFile->checkPointCopy.ThisTimeLineID" and
> > nothing else, do you agree to this?  Because we initialize
> > recoveryTargetTLI to this value and we might change it in
> > readRecoveryCommandFile() but for that, we must get the history file,
> > so if we are talking about the case where we don't have the history
> > file then "recoveryTargetTLI" will still be
> > "ControlFile->checkPointCopy.ThisTimeLineID".
> 
> I don't think your conclusion is correct. As I understand it, you're
> talking about the state before
> ee994272ca50f70b53074f0febaec97e28f83c4e, because as of now
> readRecoveryCommandFile() no longer exists in master. Before that
> commit, StartupXLOG did this:
> 
>         recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID;
>         readRecoveryCommandFile();
>         expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
> 
> Now, readRecoveryCommandFile() can change recoveryTargetTLI. Before
> doing so, it will call existsTimeLineHistory if
> recovery_target_timeline was set to an integer, and findNewestTimeLine
> if it was set to latest. In the first case, existsTimeLineHistory()
> calls RestoreArchivedFile(), but that restores it using a temporary
> name, and KeepFileRestoredFromArchive is not called, so we might have
> the timeline history in RECOVERYHISTORY but that's not the filename
> we're actually going to try to read from inside readTimeLineHistory().
> In the second case, findNewestTimeLine() will call
> existsTimeLineHistory() which results in the same situation. So I
> think when readRecoveryCommandFile() returns expectedTLI can be
> different but the history file can be absent since it was only ever
> restored under a temporary name.

Anyway it seems that the commit tried to fix an issue happens without
using WAL archive.

https://www.postgresql.org/message-id/50E43C57.5050101%40vmware.com

> That leaves one case not covered: If you take a backup with plain 
> "pg_basebackup" from a standby, without -X, and the first WAL segment 
> contains a timeline switch (ie. you take the backup right after a 
> failover), and you try to recover from it without a WAL archive, it 
> doesn't work. This is the original issue that started this thread, 
> except that I used "-x" in my original test case. The problem here is 
> that even though streaming replication will fetch the timeline history 
> file when it connects, at the very beginning of recovery, we expect that 
> we already have the timeline history file corresponding the initial 
> timeline available, either in pg_xlog or the WAL archive. By the time 
> streaming replication has connected and fetched the history file, we've 
> already initialized expectedTLEs to contain just the latest TLI. To fix 
> that, we should delay calling readTimeLineHistoryFile() until streaming 
> replication has connected and fetched the file.
> If the first segment read by recovery contains a timeline switch, the first
> pages have older timeline than segment timeline. However, if
> exepectedTLEs contained only the segment timeline, we cannot know if
> we can use the record.  In that case the following error is issued.

If expectedTLEs is initialized with the pseudo list,
tliOfPointInHistory always return the recoveryTargetTLI regardless of
the given LSN so the checking about timelines later doesn't work. And
later ReadRecord is surprised to see a page of an unknown timeline.

"unexpected timeline ID 1 in log segment"

So the objective is to initialize expectedTLEs with the right content
of the history file for the recoveryTargetTLI until ReadRecord fetches
the first record.  After the fix things are working as the following.

- recoveryTargetTimeLine is initialized with
  ControlFile->checkPointCopy.ThisTimeLineID

- readRecoveryCommandFile():

  Move recoveryTargetTLI forward to the specified target timline if
  the history file for the timeline is found, or in the case of
  latest, move it forward up to the maximum timeline among the history
  files found in either pg_wal or archive.

  !!! Anyway recoveryTargetTLI won't goes back behind the checkpoint
  TLI.

- ReadRecord...XLogFileReadAnyTLI

  Tries to load the history file for recoveryTargetTLI either from
  pg_wal or archive onto local TLE list, if the history file is not
  found, use a generateed list with one entry for the
  recoveryTargetTLI.

  (a) If the specified segment file for any timeline in the TLE list
    is found, expectedTLEs is initialized with the local list. No need
    to worry about expectedTLEs any longer.

  (b) If such a segment is *not* found, expectedTLEs is left
    NIL. Usually recoveryTargetTLI is equal to the last checkpoint
    TLI.

  (c) However, in the case where timeline switches happened in the
    segment and the recoveryTargetTLI has been increased, that is, the
    history file for the recoveryTargetTLI is found in pg_wal or
    archive, that is, the issue raised here, recoveryTargetTLI becomes
    the future timline of the checkpoint TLI.

  (d) The history file for the recoveryTargetTLI is *not* found but
    the segment file is found, expectedTLEs is initialized with the
    generated list, which doesn't contain past timelines. In this
    case, recoveryTargetTLI has not moved from the initial value of
    the checkpoint TLI. If the REDO point is before a timeline switch,
    the page causes FATAL in ReadRecord later.  However, I think there
    cannot be a case where segment file is found before corresponding
    history file.  (Except for TLI=1, which is no problem.)

- WaitForWALToBecomeAvailable

  if we have had no segments for the last checkpoint, initiate
  streaming from the REDO point of the last checkpoint. We should have
  all history files until receiving segment data.

  after sufficient WAL data has been received, the only cases where
  expectedTLEs is still NIL are the (b) and (c) above.

  In the case of (b) recoveryTargetTLI == checkpoint TLI.

  In the case of (c) recoveryTargetTLI > checkpoint TLI.  In this case
  we expecte that checkpint TLI is in the history of
  recoveryTargetTLI. Otherwise recovery failse.  This case is similar
  to the case (a) but the relationship between recoveryTargetTLI and
  the checkpoint TLI is not confirmed yet. ReadRecord barks later if
  they are not compatible so there's not a serious problem but might
  be better checking the relation ship there.  My first proposal
  performed mutual check between the two but we need to check only
  unidirectionally.

  if (readFile < 0)
  {
     if (!expectedTLEs)
     {
        expectedTLEs = readTimeLineHistory(receiveTLI);
+       if (!tliOfPointInHistory(receiveTLI, expectedTLEs))
+          ereport(ERROR, "the received timeline %d is not found in the history file for timeline %d");


> > Conclusion:
> > - I think now we agree on the point that initializing expectedTLEs
> > with the recovery target timeline is the right fix.
> > - We still have some differences of opinion about what was the
> > original problem in the base code which was fixed by the commit
> > (ee994272ca50f70b53074f0febaec97e28f83c4e).
> 
> I am also still concerned about whether we understand in exactly what
> cases the current logic doesn't work. We seem to more or less agree on
> the fix, but I don't think we really understand precisely what case we
> are fixing.

Does the discussion above make sense?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Fri, 21 May 2021 11:21:05 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Thu, 20 May 2021 13:49:10 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
>   In the case of (c) recoveryTargetTLI > checkpoint TLI.  In this case
>   we expecte that checkpint TLI is in the history of
>   recoveryTargetTLI. Otherwise recovery failse.  This case is similar
>   to the case (a) but the relationship between recoveryTargetTLI and
>   the checkpoint TLI is not confirmed yet. ReadRecord barks later if
>   they are not compatible so there's not a serious problem but might
>   be better checking the relation ship there.  My first proposal
>   performed mutual check between the two but we need to check only
>   unidirectionally.
> 
>   if (readFile < 0)
>   {
>      if (!expectedTLEs)
>      {
>         expectedTLEs = readTimeLineHistory(receiveTLI);
> +       if (!tliOfPointInHistory(receiveTLI, expectedTLEs))
> +          ereport(ERROR, "the received timeline %d is not found in the history file for timeline %d");
> 
> 
> > > Conclusion:
> > > - I think now we agree on the point that initializing expectedTLEs
> > > with the recovery target timeline is the right fix.
> > > - We still have some differences of opinion about what was the
> > > original problem in the base code which was fixed by the commit
> > > (ee994272ca50f70b53074f0febaec97e28f83c4e).
> > 
> > I am also still concerned about whether we understand in exactly what
> > cases the current logic doesn't work. We seem to more or less agree on
> > the fix, but I don't think we really understand precisely what case we
> > are fixing.
> 
> Does the discussion above make sense?

This is a revised version.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 6664808340494dffc775e21cfa1029d6dfb8452e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 21 May 2021 16:24:14 +0900
Subject: [PATCH v3] Set expectedTLEs correctly based on recoveryTargetTLI

When a standby started streaming before it determines expectedTLEs, it
is determined based on the timeline of the WAL segment just streamed
from the primary.  If the standby fetches the older timeline than
recoveryTargetTLI for thestarting checkpoint, this behavior leads to
setting expectedTLEs based on the older timeline than
recoveryTargetTLI.  It is inconsistent with the definition of the
variable and prevents later calls to rescanLatestTimeLine from
updating recoveryTargetTLI and expectedTLEs correctly. If that happens
the standby stops following the upstream.

This behavior has been introduced by commit ee994272ca but there seems
not a reason for the behavior and looks like a typo. Since the
relationship between the two timeline IDs is not verified until then,
also add a check for the relationship for safety's sake.
---
 src/backend/access/transam/xlog.c          | 35 ++++++++++----
 src/test/recovery/t/004_timeline_switch.pl | 55 +++++++++++++++++++++-
 2 files changed, 79 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 441a9124cd..fda729c63f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -12659,22 +12659,37 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                          * info is set correctly and XLogReceiptTime isn't
                          * changed.
                          */
-                        if (readFile < 0)
-                        {
-                            if (!expectedTLEs)
-                                expectedTLEs = readTimeLineHistory(receiveTLI);
-                            readFile = XLogFileRead(readSegNo, PANIC,
-                                                    receiveTLI,
-                                                    XLOG_FROM_STREAM, false);
-                            Assert(readFile >= 0);
-                        }
-                        else
+                        if (readFile >= 0)
                         {
                             /* just make sure source info is correct... */
                             readSource = XLOG_FROM_STREAM;
                             XLogReceiptSource = XLOG_FROM_STREAM;
                             return true;
                         }
+
+                        readFile = XLogFileRead(readSegNo, PANIC,
+                                                receiveTLI,
+                                                XLOG_FROM_STREAM, false);
+                        Assert(readFile >= 0);
+
+                        if (expectedTLEs)
+                            break;
+
+                        expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
+
+                        /*
+                         * We haven't checked the relationship beween the
+                         * receiveTLI and recoveryTargetTLI. Make sure that
+                         * receiveTLI is in the history for
+                         * recoveryTargetTLI. We don't need to do that
+                         * hearafter since recoveryTargetTLI and expectedTLEs
+                         * will be kept in sync.
+                         */
+                        if (!tliOfPointInHistory(receiveTLI, expectedTLEs))
+                            ereport(ERROR,
+                                    (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+                                     errmsg("hisotry file of timeline %d is incosistent with the current timeline
%d",
+                                            recoveryTargetTLI, receiveTLI)));
                         break;
                     }
 
diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl
index c101980e9e..357f3bf8fa 100644
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ b/src/test/recovery/t/004_timeline_switch.pl
@@ -90,7 +90,14 @@ $node_primary_2->backup($backup_name);
 # Create standby node
 my $node_standby_3 = get_new_node('standby_3');
 $node_standby_3->init_from_backup($node_primary_2, $backup_name,
-    has_streaming => 1);
+    has_streaming => 1, allows_streaming => 1, has_archiving => 1);
+my $archivedir_standby_3 = $node_standby_3->archive_dir;
+$node_standby_3->append_conf(
+    'postgresql.conf', qq(
+archive_mode = always
+archive_command = 'cp "%p" "$archivedir_standby_3/%f"'
+log_line_prefix = '%m [%p:%b] %q%a '
+));
 
 # Restart primary node in standby mode and promote it, switching it
 # to a new timeline.
@@ -109,3 +116,49 @@ $node_primary_2->wait_for_catchup($node_standby_3, 'replay',
 my $result_2 =
   $node_standby_3->safe_psql('postgres', "SELECT count(*) FROM tab_int");
 is($result_2, qq(1), 'check content of standby 3');
+
+
+# Check that server can start from a backup took from a standby that
+# contains timeline switch midst of the first segment.
+
+# setup a standby connects to standby_3
+my $node_standby_4 = get_new_node('standby_4');
+$node_standby_4->init_from_backup($node_primary_2, $backup_name,
+                                  has_streaming => 1);
+$node_standby_4->start;
+
+$node_primary_2->psql('postgres', qq[
+                      CREATE TABLE t (a int);
+                      SELECT pg_switch_wal();
+                      INSERT INTO t VALUES (0);
+                      CHECKPOINT;
+]);
+$node_primary_2->wait_for_catchup($node_standby_3, 'write',
+                                $node_primary_2->lsn('insert'));
+$node_primary_2->wait_for_catchup($node_standby_4, 'write',
+                                $node_primary_2->lsn('insert'));
+
+# promote standby_3 then connect standby_4 to standby_3. To enforce
+# archive fetching, fluash pg_wal files of the new standby.
+$node_standby_4->stop;
+$node_standby_3->psql('postgres', 'SELECT pg_promote()');
+$node_standby_4->enable_streaming($node_standby_3);
+$node_standby_4->append_conf(
+    'postgresql.conf',
+    qq[restore_command = 'cp "$archivedir_standby_3/%f" "%p"']);
+
+# clean up pg_wal on the second standby
+my $pgwaldir = $node_standby_4->data_dir. "/pg_wal";
+opendir my $dh, $pgwaldir or die "failed to open $pgwaldir";
+while (my $f = readdir($dh))
+{
+    unlink("$pgwaldir/$f") if (-f "$pgwaldir/$f");
+}
+closedir($dh);
+
+$node_standby_4->start;
+
+# check if replication is working
+$node_standby_3->psql('postgres', 'INSERT INTO t VALUES(1)');
+$node_standby_3->wait_for_catchup($node_standby_4, 'write',
+                                  $node_standby_3->lsn('insert'));
-- 
2.27.0


Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Thu, May 20, 2021 at 11:19 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, May 18, 2021 at 1:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Yeah, it will be a fake 1-element list.  But just to be clear that
> > 1-element can only be "ControlFile->checkPointCopy.ThisTimeLineID" and
> > nothing else, do you agree to this?  Because we initialize
> > recoveryTargetTLI to this value and we might change it in
> > readRecoveryCommandFile() but for that, we must get the history file,
> > so if we are talking about the case where we don't have the history
> > file then "recoveryTargetTLI" will still be
> > "ControlFile->checkPointCopy.ThisTimeLineID".
>
> I don't think your conclusion is correct. As I understand it, you're
> talking about the state before
> ee994272ca50f70b53074f0febaec97e28f83c4e,

Right, I am talking about before this commit.

 because as of now
> readRecoveryCommandFile() no longer exists in master. Before that
> commit, StartupXLOG did this:
>
>         recoveryTargetTLI = ControlFile->checkPointCopy.ThisTimeLineID;
>         readRecoveryCommandFile();
>         expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
>
> Now, readRecoveryCommandFile() can change recoveryTargetTLI. Before
> doing so, it will call existsTimeLineHistory if
> recovery_target_timeline was set to an integer, and findNewestTimeLine
> if it was set to latest. In the first case, existsTimeLineHistory()
> calls RestoreArchivedFile(), but that restores it using a temporary
> name, and KeepFileRestoredFromArchive is not called,

Yes, I agree with this.

so we might have
> the timeline history in RECOVERYHISTORY but that's not the filename
> we're actually going to try to read from inside readTimeLineHistory().
> In the second case, findNewestTimeLine() will call
> existsTimeLineHistory() which results in the same situation. So I
> think when readRecoveryCommandFile() returns expectedTLI can be
> different but the history file can be absent since it was only ever
> restored under a temporary name.

I agree that readTimeLineHistory() will not look for that filename,
but it will also try to get the file using (RestoreArchivedFile(path,
histfname, "RECOVERYHISTORY", 0, false)).  So after we check the
history file existence in existsTimeLineHistory(), if the file got
removed from the archive (not sure how) then it is possible that now
readTimeLineHistory() will not find that history file again.  Am I
missing something?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Fri, May 21, 2021 at 7:51 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> https://www.postgresql.org/message-id/50E43C57.5050101%40vmware.com
>
> > That leaves one case not covered: If you take a backup with plain
> > "pg_basebackup" from a standby, without -X, and the first WAL segment
> > contains a timeline switch (ie. you take the backup right after a
> > failover), and you try to recover from it without a WAL archive, it
> > doesn't work. This is the original issue that started this thread,
> > except that I used "-x" in my original test case. The problem here is
> > that even though streaming replication will fetch the timeline history
> > file when it connects, at the very beginning of recovery, we expect that
> > we already have the timeline history file corresponding the initial
> > timeline available, either in pg_xlog or the WAL archive. By the time
> > streaming replication has connected and fetched the history file, we've
> > already initialized expectedTLEs to contain just the latest TLI. To fix
> > that, we should delay calling readTimeLineHistoryFile() until streaming
> > replication has connected and fetched the file.
> > If the first segment read by recovery contains a timeline switch, the first
> > pages have older timeline than segment timeline. However, if
> > exepectedTLEs contained only the segment timeline, we cannot know if
> > we can use the record.  In that case the following error is issued.
>
> If expectedTLEs is initialized with the pseudo list,
> tliOfPointInHistory always return the recoveryTargetTLI regardless of
> the given LSN so the checking about timelines later doesn't work. And
> later ReadRecord is surprised to see a page of an unknown timeline.

From this whole discussion (on the thread given by you), IIUC the
issue was that if the checkpoint LSN does not exist on the
"ControlFile->checkPointCopy.ThisTimeLineID". If that is true then I
agree that we will just initialize expectedTLE based on the online
entry (ControlFile->checkPointCopy.ThisTimeLineID) and later we will
fail to find the checkpoint record on this timeline because the
checkpoint LSN is smaller than the start LSN of this timeline. Right?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Robert Haas
Date:
On Fri, May 21, 2021 at 10:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > so we might have
> > the timeline history in RECOVERYHISTORY but that's not the filename
> > we're actually going to try to read from inside readTimeLineHistory().
> > In the second case, findNewestTimeLine() will call
> > existsTimeLineHistory() which results in the same situation. So I
> > think when readRecoveryCommandFile() returns expectedTLI can be
> > different but the history file can be absent since it was only ever
> > restored under a temporary name.
>
> I agree that readTimeLineHistory() will not look for that filename,
> but it will also try to get the file using (RestoreArchivedFile(path,
> histfname, "RECOVERYHISTORY", 0, false)).  So after we check the
> history file existence in existsTimeLineHistory(), if the file got
> removed from the archive (not sure how) then it is possible that now
> readTimeLineHistory() will not find that history file again.  Am I
> missing something?

That sounds right.

I've lost the thread of what we're talking about here a bit. I think
what we've established is that, when running a commit prior to
ee994272ca50f70b53074f0febaec97e28f83c4e, if (a) recovery_target_tli
is set, (b) restore_command works, and (c) nothing's being removed
from the archive concurrently, then by the time StartupXLOG() does
expectedTLEs = readTimeLineHistory(recoveryTargetTLI), any timeline
history file that exists in the archive will have been restored, and
the scenario ee994272ca50f70b53074f0febaec97e28f83c4e was concerned
about won't occur. That's because it was concerned about a scenario
where we failed to restore the history file until after we set
expectedTLEs.

Consequently, if we want to try to reproduce the problem fixed by that
commit, we should look for a scenario that does not involve setting
recovery_target_tli.

Is that the conclusion you were driving towards?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Robert Haas
Date:
On Thu, May 20, 2021 at 10:21 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> > > Conclusion:
> > > - I think now we agree on the point that initializing expectedTLEs
> > > with the recovery target timeline is the right fix.
> > > - We still have some differences of opinion about what was the
> > > original problem in the base code which was fixed by the commit
> > > (ee994272ca50f70b53074f0febaec97e28f83c4e).
> >
> > I am also still concerned about whether we understand in exactly what
> > cases the current logic doesn't work. We seem to more or less agree on
> > the fix, but I don't think we really understand precisely what case we
> > are fixing.
>
> Does the discussion above make sense?

I had trouble following it completely, but I didn't really spot
anything that seemed definitely wrong. However, I don't understand
what it has to do with where we are now. What I want to understand is:
under exactly what circumstances does it matter that
WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM,
will stream from receiveTLI rather than recoveryTargetTLI?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Robert Haas
Date:
On Fri, May 21, 2021 at 12:52 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I had trouble following it completely, but I didn't really spot
> anything that seemed definitely wrong. However, I don't understand
> what it has to do with where we are now. What I want to understand is:
> under exactly what circumstances does it matter that
> WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM,
> will stream from receiveTLI rather than recoveryTargetTLI?

Ah ha! I think I figured it out. To hit this bug, you need to meet the
following conditions:

1. Both streaming and archiving have to be configured.
2. You have to promote a new primary.
3. After promoting the new primary you have to start a new standby
that doesn't have local WAL and for which the backup was taken from
the previous timeline. In Dilip's original scenario, this new standby
is actually the old primary, but that's not required.
4. The new standby has to be able to find the history file it needs in
the archive but not the WAL files.
5. The new standby needs to have recovery_target_timeline='latest'
(which is the default)

When you start the new standby, it will fetch the current TLI from its
control file. Then, since recovery_target_timeline=latest, the system
will try to figure out the latest timeline, which only works because
archiving is configured. There seems to be no provision for detecting
the latest timeline via streaming. With archiving enabled, though,
findNewestTimeLine() will be able to restore the history file created
by the promotion of the new primary, which will cause
validateRecoveryParameters() to change recoveryTargetTLI. Then we'll
try to read the WAL segment containing the checkpoint record and fail
because, by stipulation, only history files are available from the
archive. Now, because streaming is also configured, we'll try
streaming. That will work, so we'll be able to read the checkpoint
record, but now, because WaitForWALToBecomeAvailable() initialized
expectedTLEs using receiveTLI instead of recoveryTargetTLI, we can't
switch to the correct timeline and it all goes wrong.

The attached test script, test.sh seems to reliably reproduce this.
Put that file and the recalcitrant_cp script, also attached, into an
empty directory, cd to that directory, and run test.sh. Afterwards
examine pgcascade.log. Basically, these scripts just set up the
scenario described above. We set up primary and a standby that use
recalcitrant_cp as the archive command, and because it's recalcitrant,
it's only willing to copy history files, and always fails for WAL
files.Then we create a cascading standby by taking a base backup from
the standby, but before actually starting it, we promote the original
standby. So now it meets all the conditions described above. I tried a
couple variants of this test. If I switch the archive command from
recalcitrant_cp to just regular cp, then there's no problem. And if I
switch it to something that always fails, then there's also no
problem. That's because, with either of those changes, condition (4)
above is no longer met. In the first case, both files end up in the
archive, and in the second case, neither file.

What about hitting this in real life, with a real archive command?
Well, you'd probably need the archive command to be kind of slow and
get unlucky on the timing, but there's nothing to prevent it from
happening.

But, it will be WAY more likely if you have Dilip's original scenario,
where you try to repurpose an old primary as a standby. It would
normally be unlikely that the backup used to create a new standby
would have an older TLI, because you typically wouldn't switch masters
in between taking a base backup and using it to create a new standby.
But the old master always has an older TLI. So (3) is satisfied. For
(4) to be satisfied, you need the old master to fail to archive all of
its WAL when it shuts down.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Sat, May 22, 2021 at 1:14 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, May 21, 2021 at 12:52 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I had trouble following it completely, but I didn't really spot
> > anything that seemed definitely wrong. However, I don't understand
> > what it has to do with where we are now. What I want to understand is:
> > under exactly what circumstances does it matter that
> > WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM,
> > will stream from receiveTLI rather than recoveryTargetTLI?
>
> Ah ha! I think I figured it out. To hit this bug, you need to meet the
> following conditions:
>
> 1. Both streaming and archiving have to be configured.
> 2. You have to promote a new primary.
> 3. After promoting the new primary you have to start a new standby
> that doesn't have local WAL and for which the backup was taken from
> the previous timeline. In Dilip's original scenario, this new standby
> is actually the old primary, but that's not required.

No, in my original scenario also the new standby was not old primary,
I had 3 nodes
node1-> primary, node2 -> standby1, node3-> standby2
node2 promoted as a new primary and node3's local WAL was removed (so
that it has to stream checkpoint record from new primary and then
remaining everything happens as you explain in remaining steps).

> 4. The new standby has to be able to find the history file it needs in
> the archive but not the WAL files.
> 5. The new standby needs to have recovery_target_timeline='latest'
> (which is the default)
>
> When you start the new standby, it will fetch the current TLI from its
> control file. Then, since recovery_target_timeline=latest, the system
> will try to figure out the latest timeline, which only works because
> archiving is configured. There seems to be no provision for detecting
> the latest timeline via streaming. With archiving enabled, though,
> findNewestTimeLine() will be able to restore the history file created
> by the promotion of the new primary, which will cause
> validateRecoveryParameters() to change recoveryTargetTLI.

Right

 Then we'll
> try to read the WAL segment containing the checkpoint record and fail
> because, by stipulation, only history files are available from the
> archive. Now, because streaming is also configured, we'll try
> streaming. That will work, so we'll be able to read the checkpoint
> record, but now, because WaitForWALToBecomeAvailable() initialized
> expectedTLEs using receiveTLI instead of recoveryTargetTLI, we can't
> switch to the correct timeline and it all goes wrong.

exactly

> The attached test script, test.sh seems to reliably reproduce this.
> Put that file and the recalcitrant_cp script, also attached, into an
> empty directory, cd to that directory, and run test.sh. Afterwards
> examine pgcascade.log. Basically, these scripts just set up the
> scenario described above. We set up primary and a standby that use
> recalcitrant_cp as the archive command, and because it's recalcitrant,
> it's only willing to copy history files, and always fails for WAL
> files.Then we create a cascading standby by taking a base backup from
> the standby, but before actually starting it, we promote the original
> standby. So now it meets all the conditions described above. I tried a
> couple variants of this test. If I switch the archive command from
> recalcitrant_cp to just regular cp, then there's no problem. And if I
> switch it to something that always fails, then there's also no
> problem. That's because, with either of those changes, condition (4)
> above is no longer met. In the first case, both files end up in the
> archive, and in the second case, neither file.

I haven't tested this, but I will do that.  But now we are on the same
page about the cause of the actual problem I reported.

> What about hitting this in real life, with a real archive command?
> Well, you'd probably need the archive command to be kind of slow and
> get unlucky on the timing, but there's nothing to prevent it from
> happening.

Right

> But, it will be WAY more likely if you have Dilip's original scenario,
> where you try to repurpose an old primary as a standby. It would
> normally be unlikely that the backup used to create a new standby
> would have an older TLI, because you typically wouldn't switch masters
> in between taking a base backup and using it to create a new standby.
> But the old master always has an older TLI. So (3) is satisfied. For
> (4) to be satisfied, you need the old master to fail to archive all of
> its WAL when it shuts down.

For my original case, both standby1 and standby2 are connected to the
primary.  Now, standby1 is promoted and standby2 is shut down. And,
before restarting, all the local WAL of the standby2 is removed so
that it can follow the new primary. The primary info and restore
command for standby2 are changed as per the new primary(standby1).

Now the scenario is that the standby1 has switched the timeline in the
middle of the segment which contains the checkpoint record, so the
segment with old TL is renamed to (.partial) and the same segment with
new TL is not yet archived but the history file for the new TL has
been archived.

Now, when standby2 restart the remaining things happened as you
explained, basically it restores the history file and changes the
recoveryTargetTLI but it doesn't get the WAL file from the archive.
So try to stream checkpoint record from the primary using
"ControlFile->checkPointCopy.ThisTimeLineID", which is old timeline.

Now, we may ask that if the WAL segment with old TL on standby1(new
primary) which contains the checkpoint is already renamed to
".partial" then how can it stream using the old TL then the answer is
below code[1] in the walsender.  Basically, the checkpoint record is
present in both new and old TL as TL switched in the middle of the
segment, it will send you the record from the new TL even if the
walreciever asks to stream with old TL.  Now walrecievr is under
impression that it has read from the old TL.  And, we know the rest of
the story that we will set the expectedTLEs based on the old history
file and never be able to go to the new TL.

Anyways now we understand the issue and there are many ways we can
reproduce it. Still, I thought of explaining the exact steps how it
happed for me because now we understand it well so I think it is easy
to explain :)

[1]
WalSndSegmentOpen()
{
/*-------
* When reading from a historic timeline, and there is a timeline switch
* within this segment, read from the WAL segment belonging to the new
* timeline.
}

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Sat, May 22, 2021 at 10:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, May 22, 2021 at 1:14 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > The attached test script, test.sh seems to reliably reproduce this.
> > Put that file and the recalcitrant_cp script, also attached, into an
>
> I haven't tested this, but I will do that.  But now we are on the same
> page about the cause of the actual problem I reported.

Now, I have tested. I am able to reproduce the actual problem with your script.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

On Sat, May 22, 2021 at 10:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, May 22, 2021 at 1:14 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Fri, May 21, 2021 at 12:52 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > I had trouble following it completely, but I didn't really spot
> > > anything that seemed definitely wrong. However, I don't understand
> > > what it has to do with where we are now. What I want to understand is:
> > > under exactly what circumstances does it matter that
> > > WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM,
> > > will stream from receiveTLI rather than recoveryTargetTLI?
> >
> > Ah ha! I think I figured it out. To hit this bug, you need to meet the
> > following conditions:
> >
> > 1. Both streaming and archiving have to be configured.
> > 2. You have to promote a new primary.
> > 3. After promoting the new primary you have to start a new standby
> > that doesn't have local WAL and for which the backup was taken from
> > the previous timeline. In Dilip's original scenario, this new standby
> > is actually the old primary, but that's not required.
>
> No, in my original scenario also the new standby was not old primary,
> I had 3 nodes
> node1-> primary, node2 -> standby1, node3-> standby2
> node2 promoted as a new primary and node3's local WAL was removed (so
> that it has to stream checkpoint record from new primary and then
> remaining everything happens as you explain in remaining steps).
>
> > 4. The new standby has to be able to find the history file it needs in
> > the archive but not the WAL files.
> > 5. The new standby needs to have recovery_target_timeline='latest'
> > (which is the default)
> >
> > When you start the new standby, it will fetch the current TLI from its
> > control file. Then, since recovery_target_timeline=latest, the system
> > will try to figure out the latest timeline, which only works because
> > archiving is configured. There seems to be no provision for detecting
> > the latest timeline via streaming. With archiving enabled, though,
> > findNewestTimeLine() will be able to restore the history file created
> > by the promotion of the new primary, which will cause
> > validateRecoveryParameters() to change recoveryTargetTLI.
>
> Right
>
>  Then we'll
> > try to read the WAL segment containing the checkpoint record and fail
> > because, by stipulation, only history files are available from the
> > archive. Now, because streaming is also configured, we'll try
> > streaming. That will work, so we'll be able to read the checkpoint
> > record, but now, because WaitForWALToBecomeAvailable() initialized
> > expectedTLEs using receiveTLI instead of recoveryTargetTLI, we can't
> > switch to the correct timeline and it all goes wrong.
>
> exactly
>
> > The attached test script, test.sh seems to reliably reproduce this.
> > Put that file and the recalcitrant_cp script, also attached, into an
> > empty directory, cd to that directory, and run test.sh. Afterwards
> > examine pgcascade.log. Basically, these scripts just set up the
> > scenario described above. We set up primary and a standby that use
> > recalcitrant_cp as the archive command, and because it's recalcitrant,
> > it's only willing to copy history files, and always fails for WAL
> > files.Then we create a cascading standby by taking a base backup from
> > the standby, but before actually starting it, we promote the original
> > standby. So now it meets all the conditions described above. I tried a
> > couple variants of this test. If I switch the archive command from
> > recalcitrant_cp to just regular cp, then there's no problem. And if I
> > switch it to something that always fails, then there's also no
> > problem. That's because, with either of those changes, condition (4)
> > above is no longer met. In the first case, both files end up in the
> > archive, and in the second case, neither file.
>
> I haven't tested this, but I will do that.  But now we are on the same
> page about the cause of the actual problem I reported.
>
> > What about hitting this in real life, with a real archive command?
> > Well, you'd probably need the archive command to be kind of slow and
> > get unlucky on the timing, but there's nothing to prevent it from
> > happening.
>
> Right
>
> > But, it will be WAY more likely if you have Dilip's original scenario,
> > where you try to repurpose an old primary as a standby. It would
> > normally be unlikely that the backup used to create a new standby
> > would have an older TLI, because you typically wouldn't switch masters
> > in between taking a base backup and using it to create a new standby.
> > But the old master always has an older TLI. So (3) is satisfied. For
> > (4) to be satisfied, you need the old master to fail to archive all of
> > its WAL when it shuts down.
>
> For my original case, both standby1 and standby2 are connected to the
> primary.  Now, standby1 is promoted and standby2 is shut down. And,
> before restarting, all the local WAL of the standby2 is removed so
> that it can follow the new primary. The primary info and restore
> command for standby2 are changed as per the new primary(standby1).
>
> Now the scenario is that the standby1 has switched the timeline in the
> middle of the segment which contains the checkpoint record, so the
> segment with old TL is renamed to (.partial) and the same segment with
> new TL is not yet archived but the history file for the new TL has
> been archived.
>
> Now, when standby2 restart the remaining things happened as you
> explained, basically it restores the history file and changes the
> recoveryTargetTLI but it doesn't get the WAL file from the archive.
> So try to stream checkpoint record from the primary using
> "ControlFile->checkPointCopy.ThisTimeLineID", which is old timeline.
>
> Now, we may ask that if the WAL segment with old TL on standby1(new
> primary) which contains the checkpoint is already renamed to
> ".partial" then how can it stream using the old TL then the answer is
> below code[1] in the walsender.  Basically, the checkpoint record is
> present in both new and old TL as TL switched in the middle of the
> segment, it will send you the record from the new TL even if the
> walreciever asks to stream with old TL.  Now walrecievr is under
> impression that it has read from the old TL.  And, we know the rest of
> the story that we will set the expectedTLEs based on the old history
> file and never be able to go to the new TL.
>
> Anyways now we understand the issue and there are many ways we can
> reproduce it. Still, I thought of explaining the exact steps how it
> happed for me because now we understand it well so I think it is easy
> to explain :)
>
> [1]
> WalSndSegmentOpen()
> {
> /*-------
> * When reading from a historic timeline, and there is a timeline switch
> * within this segment, read from the WAL segment belonging to the new
> * timeline.
> }
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Robert Haas
Date:
On Sat, May 22, 2021 at 12:45 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> No, in my original scenario also the new standby was not old primary,
> I had 3 nodes
> node1-> primary, node2 -> standby1, node3-> standby2
> node2 promoted as a new primary and node3's local WAL was removed (so
> that it has to stream checkpoint record from new primary and then
> remaining everything happens as you explain in remaining steps).

Oh, OK. I misunderstood. I think it could happen that way, though.

> I haven't tested this, but I will do that.  But now we are on the same
> page about the cause of the actual problem I reported.

Yeah, sorry, I just didn't understand the exact chain of events before.

> For my original case, both standby1 and standby2 are connected to the
> primary.  Now, standby1 is promoted and standby2 is shut down. And,
> before restarting, all the local WAL of the standby2 is removed so
> that it can follow the new primary. The primary info and restore
> command for standby2 are changed as per the new primary(standby1).

One thing I don't understand is why the final WAL segment from the
original primary didn't end up in the archive in this scenario. If it
had, then we would not have seen the issue in that case.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Sat, May 22, 2021 at 8:33 PM Robert Haas <robertmhaas@gmail.com> wrote:

> > For my original case, both standby1 and standby2 are connected to the
> > primary.  Now, standby1 is promoted and standby2 is shut down. And,
> > before restarting, all the local WAL of the standby2 is removed so
> > that it can follow the new primary. The primary info and restore
> > command for standby2 are changed as per the new primary(standby1).
>
> One thing I don't understand is why the final WAL segment from the
> original primary didn't end up in the archive in this scenario. If it
> had, then we would not have seen the issue in that case.

I used different archive folders for primary and new
primary(standby1).  I have modified your test.sh slightly (modified
test2.sh attached) so that I can demonstrate my scenario where I was
seeing the issue and this is getting fixed after putting the fix we
discussed[1]

[1]
-
expectedTLEs = readTimeLineHistory(receiveTLI);
+
expectedTLEs = readTimeLineHistory(recoveryTargetTLI);


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Sun, May 23, 2021 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, May 22, 2021 at 8:33 PM Robert Haas <robertmhaas@gmail.com> wrote:

I have created a tap test based on Robert's test.sh script.  It
reproduces the issue.  I am new with perl so this still needs some
cleanup/improvement, but at least it shows the idea.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Fri, 21 May 2021 12:52:54 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> I had trouble following it completely, but I didn't really spot
> anything that seemed definitely wrong. However, I don't understand
> what it has to do with where we are now. What I want to understand is:
> under exactly what circumstances does it matter that
> WaitForWALToBecomeAvailable(), when currentSource == XLOG_FROM_STREAM,
> will stream from receiveTLI rather than recoveryTargetTLI?

Extracing related descriptions from my previous mail,

- recoveryTargetTimeLine is initialized with
  ControlFile->checkPointCopy.ThisTimeLineID

- readRecoveryCommandFile():
  ...or in the case of
  latest, move it forward up to the maximum timeline among the history
  files found in either pg_wal or archive.

- ReadRecord...XLogFileReadAnyTLI

  Tries to load the history file for recoveryTargetTLI either from
  pg_wal or archive onto local TLE list, if the history file is not
  found, use a generateed list with one entry for the
  recoveryTargetTLI.

  (b) If such a segment is *not* found, expectedTLEs is left
    NIL. Usually recoveryTargetTLI is equal to the last checkpoint
    TLI.

  (c) However, in the case where timeline switches happened in the
    segment and the recoveryTargetTLI has been increased, that is, the
    history file for the recoveryTargetTLI is found in pg_wal or
    archive, that is, the issue raised here, recoveryTargetTLI becomes
    the future timline of the checkpoint TLI.

- WaitForWALToBecomeAvailable

In the case of (c) recoveryTargetTLI > checkpoint TLI.  In this case
  we expecte that checkpint TLI is in the history of
  recoveryTargetTLI. Otherwise recovery failse^h.  This case is similar
  to the case (a) but the relationship between recoveryTargetTLI and
  the checkpoint TLI is not confirmed yet. ReadRecord barks later if
  they are not compatible so there's not a serious problem but might
  be better checking the relation ship there.  My first proposal
  performed mutual check between the two but we need to check only
  unidirectionally.

===
So the condition for the Dilip's case is, as you wrote in another mail:

- ControlFile->checkPointCopy.ThisTimeLineID is in the older timeline.
- Archive or pg_wal offers the history file for the newer timeline.
- The segment for the checkpoint is not found in pg_wal nor in archive.

That is,

- A grandchild(c) node is stopped
- Then the child node(b) is promoted.

- Clear pg_wal directory of (c) then connect it to (b) *before* (b)
  archives the segment for the newer timeline of the
  timeline-switching segments.  (if we have switched at segment 3,
  TLI=1, the segment file of the older timeline is renamed to
  .partial, then create the same segment for TLI=2.  The former is
  archived while promotion is performed but the latter won't be
  archive until the segment ends.)


The orinal case of after the commit ee994272ca,

- recoveryTargetTimeLine is initialized with
  ControlFile->checkPointCopy.ThisTimeLineID

(X) (Before the commit, we created the one-entry expectedTLEs consists
   only of ControlFile->checkPointCopy.ThisTimeLineID.)

- readRecoveryCommandFile():

  Move recoveryTargetTLI forward to the specified target timline if
  the history file for the timeline is found, or in the case of
  latest, move it forward up to the maximum timeline among the history
  files found in either pg_wal or archive.

- ReadRecord...XLogFileReadAnyTLI

  Tries to load the history file for recoveryTargetTLI either from
  pg_wal or archive onto local TLE list, if the history file is not
  found, use a generateed list with one entry for the
  recoveryTargetTLI.

  (b) If such a segment is *not* found, expectedTLEs is left
    NIL. Usually recoveryTargetTLI is equal to the last checkpoint
    TLI.

- WaitForWALToBecomeAvailable

  if we have had no segments for the last checkpoint, initiate
  streaming from the REDO point of the last checkpoint. We should have
  all history files until receiving segment data.

  after sufficient WAL data has been received, the only cases where
  expectedTLEs is still NIL are the (b) and (c) above.

  In the case of (b) recoveryTargetTLI == checkpoint TLI.

So I thought that the commit fixed this scenario. Even in this case,
ReadRecord fails because the checkpoint segment contains pages for the
older timeline which is not in expectedTLEs if we did (X).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Sun, 23 May 2021 21:37:58 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in 
> On Sun, May 23, 2021 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sat, May 22, 2021 at 8:33 PM Robert Haas <robertmhaas@gmail.com> wrote:
> 
> I have created a tap test based on Robert's test.sh script.  It
> reproduces the issue.  I am new with perl so this still needs some
> cleanup/improvement, but at least it shows the idea.

I'm not sure I'm following the discussion here, however, if we were
trying to reproduce Dilip's case using base backup, we would need such
a broken archive command if using pg_basebackup witn -Xnone.  Becuase
the current version of pg_basebackup waits for all required WAL
segments to be archived when connecting to a standby with -Xnone.  I
don't bother reconfirming the version that fix took place, but just
using -X stream instead of "none" we successfully miss the first
segment of the new timeline in the upstream archive, though we need to
erase pg_wal in the backup.  Either the broken archive command or
erasing pg_wal of the cascade is required to the behavior to occur.

The attached is how it looks like.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
# Copyright (c) 2021, PostgreSQL Global Development Group

# Minimal test testing streaming replication
use Cwd;
use strict;
use warnings;
use PostgresNode;
use TestLib;
use Test::More tests => 1;

# Initialize primary node
my $node_primary = get_new_node('primary');
# A specific role is created to perform some tests related to replication,
# and it needs proper authentication configuration.
$node_primary->init(allows_streaming => 1);
$node_primary->append_conf(
    'postgresql.conf', qq(
wal_keep_size=128MB
));
$node_primary->start;

my $backup_name = 'my_backup';

# Take backup
$node_primary->backup($backup_name);

my $node_standby_1 = get_new_node('standby_1');
$node_standby_1->init_from_backup($node_primary, $backup_name,
                                  allows_streaming => 1, has_streaming => 1);
my $archivedir_standby_1 = $node_standby_1->archive_dir;
$node_standby_1->append_conf(
    'postgresql.conf', qq(
archive_mode=always
archive_command='cp "%p" "$archivedir_standby_1/%f"'
));
$node_standby_1->start;


# Take backup of standby 1
# NB: Use -Xnone so that pg_wal is empty.
#$node_standby_1->backup($backup_name, backup_options => ['-Xnone']);
$node_standby_1->backup($backup_name);

# Promote the standby.
$node_standby_1->psql('postgres', 'SELECT pg_promote()');

# clean up pg_wal from the backup
my $pgwaldir = $node_standby_1->backup_dir. "/" . $backup_name . "/pg_wal";
opendir my $dh, $pgwaldir or die "failed to open $pgwaldir";
while (my $f = readdir($dh))
{
    unlink("$pgwaldir/$f") if (-f "$pgwaldir/$f");
}
closedir($dh);

# Create cascading standby but don't start it yet.
# NB: Must set up both streaming and archiving.
my $node_cascade = get_new_node('cascade');
$node_cascade->init_from_backup($node_standby_1, $backup_name,
    has_streaming => 1);
$node_cascade->append_conf(
    'postgresql.conf', qq(
restore_command = 'cp "$archivedir_standby_1/%f" "%p"'
log_line_prefix = '%m [%p:%b] %q%a '
archive_mode=off
));


# Start cascade node
$node_cascade->start;

# Create some content on primary and check its presence in standby 1
$node_standby_1->safe_psql('postgres',
    "CREATE TABLE tab_int AS SELECT 1 AS a");

# Wait for standbys to catch up
$node_standby_1->wait_for_catchup($node_cascade, 'replay',
    $node_standby_1->lsn('replay'));

ok(1, 'test'); # it's sucess if we come here.

Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Mon, May 24, 2021 at 10:17 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Sun, 23 May 2021 21:37:58 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in
> > On Sun, May 23, 2021 at 2:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Sat, May 22, 2021 at 8:33 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > I have created a tap test based on Robert's test.sh script.  It
> > reproduces the issue.  I am new with perl so this still needs some
> > cleanup/improvement, but at least it shows the idea.
>
> I'm not sure I'm following the discussion here, however, if we were
> trying to reproduce Dilip's case using base backup, we would need such
> a broken archive command if using pg_basebackup witn -Xnone.  Becuase
> the current version of pg_basebackup waits for all required WAL
> segments to be archived when connecting to a standby with -Xnone.

Right, that's the reason if you see my patch I have dynamically
generated such archive command which skips everything other than the
history file
see below snippet from my patch, where I am generating a skip_cp
command and then I am using that as an archive command.

==
+# Prepare a alternative archive command to skip WAL files
+my $script = "#!/usr/bin/perl \n
+use File::Copy; \n
+my (\$source, \$target) = \@ARGV; \n
+if (\$source =~ /history/) \n
+{ \n
+ copy(\$source, \$target); \n
+}";
+
+open my $fh, '>', "skip_cp";
+print {$fh} $script;
===

  I
> don't bother reconfirming the version that fix took place, but just
> using -X stream instead of "none" we successfully miss the first
> segment of the new timeline in the upstream archive, though we need to
> erase pg_wal in the backup.  Either the broken archive command or
> erasing pg_wal of the cascade is required to the behavior to occur.
>
> The attached is how it looks like.

I will test this and let you know.  Thanks!

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Robert Haas
Date:
On Sun, May 23, 2021 at 12:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have created a tap test based on Robert's test.sh script.  It
> reproduces the issue.  I am new with perl so this still needs some
> cleanup/improvement, but at least it shows the idea.

Thanks. I think this is the right idea but just needs a few adjustments.

I don't think that dynamically writing out a file into the current
working directory of the script is the right approach. Instead I think
we should be planning to check this file into the repository and then
have the test script find it. Now the trick is how to do that in a
portable way. I think we can probably use the same idea that the
pg_rewind tests use to find a perl module located in the test
directory.  That is:

use FindBin;

and then use $FindBin::RealBin to construct a path name to the executable, e.g.

$node_primary->append_conf(
       'postgresql.conf', qq(
archive_command = '"$FindBin::RealBin/skip_cp" "%p" "$archivedir_primary/%f"'
));

This avoids issues such as: leaving behind files if the script is
terminated, needing the current working directory to be writable,
possible permissions issues with the new file under Windows or
SE-Linux.

The restore_command needs to be "cp" on Linux but "copy" on Windows.
Maybe you can use PostgresNode.pm's enable_restoring? Or if that
doesn't work, then you need to mimic the logic, as
src/test/recovery/t/020_archive_status.pl does for archive_command.

Why do you set log_line_prefix? Is that needed?

Why are the nodes called standby_1 and cascade? Either use standby and
cascade or standby_1 and standby_2.

There is a comment that says "Create some content on primary and check
its presence in standby 1" but it only creates the content, and does
not check anything. I think we don't really need to do any of this,
but at least the code and the comment have to match.

Let's not call the command skip_cp. It's not very descriptive. If you
don't like recalcitrant_cp, then maybe something like cp_history_files
or so.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Tue, May 25, 2021 at 9:16 PM Robert Haas <robertmhaas@gmail.com> wrote:

> use FindBin;
>
> and then use $FindBin::RealBin to construct a path name to the executable, e.g.
>
> $node_primary->append_conf(
>        'postgresql.conf', qq(
> archive_command = '"$FindBin::RealBin/skip_cp" "%p" "$archivedir_primary/%f"'
> ));
>
> This avoids issues such as: leaving behind files if the script is
> terminated, needing the current working directory to be writable,
> possible permissions issues with the new file under Windows or
> SE-Linux.

Done

> The restore_command needs to be "cp" on Linux but "copy" on Windows.
> Maybe you can use PostgresNode.pm's enable_restoring? Or if that
> doesn't work, then you need to mimic the logic, as
> src/test/recovery/t/020_archive_status.pl does for archive_command.

Done

> Why do you set log_line_prefix? Is that needed?

No, it was not, removed

> Why are the nodes called standby_1 and cascade? Either use standby and
> cascade or standby_1 and standby_2.

Fixed

> There is a comment that says "Create some content on primary and check
> its presence in standby 1" but it only creates the content, and does
> not check anything. I think we don't really need to do any of this,
> but at least the code and the comment have to match.

I think we need to create some content on promoted standby and check
whether the cascade standby is able to get that or not, that will
guarantee that it is actually following the promoted standby,  I have
added the test for that so that it matches the comments.

> Let's not call the command skip_cp. It's not very descriptive. If you
> don't like recalcitrant_cp, then maybe something like cp_history_files
> or so.

Done

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Race condition in recovery?

From
Robert Haas
Date:
On Wed, May 26, 2021 at 2:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I think we need to create some content on promoted standby and check
> whether the cascade standby is able to get that or not, that will
> guarantee that it is actually following the promoted standby,  I have
> added the test for that so that it matches the comments.

OK, but I ran this test against an unpatched server and it passed.
That's not so great, because the test should fail without the bug fix.
It seems to be because there's only one actual test in this test case.
Looking at the log file,
src/test/recovery/tmp_check/log/regress_log_025_timeline_issue, the
only "ok" nor "not ok" line is:

ok 1 - check streamed content on cascade standby

So either that test is not right or some other test is needed. I think
there's something else going wrong here, because when I run my
original test.sh script, I see this:

2021-05-26 11:37:47.794 EDT [57961] LOG:  restored log file
"00000002.history" from archive
...
2021-05-26 11:37:47.916 EDT [57961] LOG:  redo starts at 0/2000028
...
2021-05-26 11:37:47.927 EDT [57966] LOG:  replication terminated by
primary server
2021-05-26 11:37:47.927 EDT [57966] DETAIL:  End of WAL reached on
timeline 1 at 0/3000000

But in the src/test/recovery/tmp_check/log/025_timeline_issue_cascade.log
file generated by this test case:

cp: /Users/rhaas/pgsql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/00000002.history:
No such file or directory
...
2021-05-26 11:41:08.149 EDT [63347] LOG:  fetching timeline history
file for timeline 2 from primary server
...
2021-05-26 11:41:08.288 EDT [63344] LOG:  new target timeline is 2
...
2021-05-26 11:41:08.303 EDT [63344] LOG:  redo starts at 0/2000028
...
2021-05-26 11:41:08.331 EDT [63347] LOG:  restarted WAL streaming at
0/3000000 on timeline 2

So it doesn't seem like the test is actually reproducing the problem
correctly. The timeline history file isn't available from the archive,
so it streams it, and then the problem doesn't occur. I guess that's
because there's nothing to guarantee that the history file reaches the
archive before 'cascade' is started. The code just does:

# Promote the standby.
$node_standby->psql('postgres', 'SELECT pg_promote()');

# Start cascade node
$node_cascade->start;

...which has a clear race condition.
src/test/recovery/t/023_pitr_prepared_xact.pl has logic to wait for a
WAL file to be archived, so maybe we can steal that logic and use it
here.

I suggest we rename the test to something a bit more descriptive. Like
instead of 025_timeline_issue.pl, perhaps
025_stuck_on_old_timeline.pl? Or I'm open to other suggestions, but
"timeline issue" is a bit too vague for my taste.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Wed, May 26, 2021 at 9:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, May 26, 2021 at 2:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I think we need to create some content on promoted standby and check
> > whether the cascade standby is able to get that or not, that will
> > guarantee that it is actually following the promoted standby,  I have
> > added the test for that so that it matches the comments.
>
> OK, but I ran this test against an unpatched server and it passed.
> That's not so great, because the test should fail without the bug fix.
> It seems to be because there's only one actual test in this test case.
> Looking at the log file,
> src/test/recovery/tmp_check/log/regress_log_025_timeline_issue, the
> only "ok" nor "not ok" line is:
>
> ok 1 - check streamed content on cascade standby
>
> So either that test is not right or some other test is needed. I think
> there's something else going wrong here, because when I run my
> original test.sh script, I see this:

Thats strange, when I ran the test I can see below in log of cascade
node (which shows that cascade get the history file but not the WAL
file and then it select the old timeline and never go to the new
timeline)

...
2021-05-26 21:46:54.412 IST [84080] LOG:  restored log file
"00000002.history" from archive
cp: cannot stat

‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/00000003.history’:
No such file or directory
2021-05-26 21:46:54.415 IST [84080] LOG:  entering standby mode
2021-05-26 21:46:54.419 IST [84080] LOG:  restored log file
"00000002.history" from archive
.....
2021-05-26 21:46:54.429 IST [84085] LOG:  started streaming WAL from
primary at 0/2000000 on timeline 1   -> stream using previous TL
2021-05-26 21:46:54.466 IST [84080] LOG:  redo starts at 0/2000028
2021-05-26 21:46:54.466 IST [84080] LOG:  consistent recovery state
reached at 0/3000000
2021-05-26 21:46:54.467 IST [84079] LOG:  database system is ready to
accept read only connections
2021-05-26 21:46:54.483 IST [84085] LOG:  replication terminated by
primary server
2021-05-26 21:46:54.483 IST [84085] DETAIL:  End of WAL reached on
timeline 1 at 0/3000000.
cp: cannot stat

‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/00000003.history’:
No such file or directory
cp: cannot stat

‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/000000010000000000000003’:
No such file or directory
2021-05-26 21:46:54.498 IST [84085] LOG:  primary server contains no
more WAL on requested timeline 1

<failure continues as it never go to timeline 2>

And finally the test case fails because the cascade can never get the changes.

I will check if there is any timing dependency in the test case.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Robert Haas
Date:
On Wed, May 26, 2021 at 12:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I will check if there is any timing dependency in the test case.

There is. I explained it in the second part of my email, which you may
have failed to notice.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Wed, 26 May 2021 at 10:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, May 26, 2021 at 12:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I will check if there is any timing dependency in the test case.

There is. I explained it in the second part of my email, which you may
have failed to notice.

Sorry, my bad.  I got your point now.  I will change the test.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Wed, 26 May 2021 22:08:32 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in 
> On Wed, 26 May 2021 at 10:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 
> > On Wed, May 26, 2021 at 12:26 PM Dilip Kumar <dilipbalaut@gmail.com>
> > wrote:
> > > I will check if there is any timing dependency in the test case.
> >
> > There is. I explained it in the second part of my email, which you may
> > have failed to notice.
> 
> 
> Sorry, my bad.  I got your point now.  I will change the test.

I didn't noticed that but that is actually possible to happen.


By the way I'm having a hard time understanding what was happening on
this thread.

In the very early in this thread I posted a test script that exactly
reproduces Dilip's case by starting from two standbys based on his
explanation. But *we* didn't understand what the original commit
ee994272ca intended and I understood that we wanted to know it.

So in the mail [1] and [2] I tried to describe what's going on around
the two issues.  Although I haven't have a response to [2], can I
think that we clarified the intention of ee994272ca?  And may I think
that we decided that we don't add a test for the commit?

Then it seems to me that Robert refound how to reproduce Dilip's case
using basebackup instead of using two standbys.  It is using a broken
archive_command with pg_basebackup -Xnone and I showed that the same
resulting state is available by pg_basebackup -Xstream/fetch clearing
pg_wal directory of the resulting backup including an explanation of
why.

*I* think that it is better to avoid to have the archive_command since
it seems to me that just unlinking some files seems simpler tha having
the broken archive_command.  However, since Robert ignored it, I guess
that Robert thinks that the broken archive_command is better than
that.

It my understanding above about the current status of this thread is
right?


FWIW, regarding to the name of the test script, putting aside what it
actually does, I proposed to place it as a part or
004_timeline_switch.pl because this issue is related to timeline
switching.


[1] 20210521.112105.27166595366072396.horikyota.ntt@gmail.com
[2] https://www.postgresql.org/message-id/20210524.113402.1922481024406047229.horikyota.ntt@gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Thu, May 27, 2021 at 6:19 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Wed, 26 May 2021 22:08:32 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in
> > On Wed, 26 May 2021 at 10:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > > On Wed, May 26, 2021 at 12:26 PM Dilip Kumar <dilipbalaut@gmail.com>
> > > wrote:
> > > > I will check if there is any timing dependency in the test case.
> > >
> > > There is. I explained it in the second part of my email, which you may
> > > have failed to notice.
> >
> >
> > Sorry, my bad.  I got your point now.  I will change the test.
>
> I didn't noticed that but that is actually possible to happen.
>
>
> By the way I'm having a hard time understanding what was happening on
> this thread.
>
> In the very early in this thread I posted a test script that exactly
> reproduces Dilip's case by starting from two standbys based on his
> explanation. But *we* didn't understand what the original commit
> ee994272ca intended and I understood that we wanted to know it.
>
> So in the mail [1] and [2] I tried to describe what's going on around
> the two issues.  Although I haven't have a response to [2], can I
> think that we clarified the intention of ee994272ca?  And may I think
> that we decided that we don't add a test for the commit?
>
> Then it seems to me that Robert refound how to reproduce Dilip's case
> using basebackup instead of using two standbys.  It is using a broken
> archive_command with pg_basebackup -Xnone and I showed that the same
> resulting state is available by pg_basebackup -Xstream/fetch clearing
> pg_wal directory of the resulting backup including an explanation of
> why.

Maybe we can somehow achieve that without a broken archive command,
but I am not sure how it is enough to just delete WAL from pg_wal?  I
mean my original case was that
1. Got the new history file from the archive but did not get the WAL
file yet which contains the checkpoint after TL switch
2. So the standby2 try to stream using new primary using old TL and
set the wrong TL in expectedTLEs

But if you are not doing anything to stop archiving WAL files or to
guarantee that WAL has come to archive and you deleted those then I am
not sure how we are reproducing the original problem.

BTW, I have also tested your script and I found below log, which shows
that standby2 is successfully able to select the timeline2 so it is
not reproducing the issue.  Am I missing something?

--standby-1--
2021-05-27 10:45:35.866 IST [5096] LOG:  last completed transaction
was at log time 2021-05-27 10:45:35.699316+05:30
2021-05-27 10:45:35.867 IST [5096] LOG:  selected new timeline ID: 2
2021-05-27 10:45:35.882 IST [5096] LOG:  archive recovery complete
2021-05-27 10:45:35.911 IST [5095] LOG:  database system is ready to
accept connections
2021-05-27 10:45:36.096 IST [5134] standby_2 LOG:  received
replication command: IDENTIFY_SYSTEM
2021-05-27 10:45:36.096 IST [5134] standby_2 STATEMENT:  IDENTIFY_SYSTEM

--standby-2--
2021-05-27 10:45:36.089 IST [5129] LOG:  entering standby mode
2021-05-27 10:45:36.090 IST [5129] LOG:  redo starts at 0/2000028
2021-05-27 10:45:36.092 IST [5129] LOG:  consistent recovery state
reached at 0/3030320
2021-05-27 10:45:36.092 IST [5129] LOG:  invalid record length at
0/3030320: wanted 24, got 0
2021-05-27 10:45:36.092 IST [5128] LOG:  database system is ready to
accept read only connections
2021-05-27 10:45:36.096 IST [5133] LOG:  fetching timeline history
file for timeline 2 from primary server
2021-05-27 10:45:36.097 IST [5133] LOG:  started streaming WAL from
primary at 0/3000000 on timeline 1
2021-05-27 10:45:36.098 IST [5133] LOG:  replication terminated by
primary server
2021-05-27 10:45:36.098 IST [5133] DETAIL:  End of WAL reached on
timeline 1 at 0/3030320.
2021-05-27 10:45:36.098 IST [5129] LOG:  new target timeline is 2
2021-05-27 10:45:36.098 IST [5133] LOG:  restarted WAL streaming at
0/3000000 on timeline 2


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Wed, May 26, 2021 at 9:40 PM Robert Haas <robertmhaas@gmail.com> wrote:

> ...which has a clear race condition.
> src/test/recovery/t/023_pitr_prepared_xact.pl has logic to wait for a
> WAL file to be archived, so maybe we can steal that logic and use it
> here.

Yeah, done that, I think we can use exact same logic for history files
as well because if wal file is archived then history file must be
because a) history file get created during promotion so created before
WAL file with new TL is ready for archive b) Archiver archive history
files before archiving any WAL files.

src/test/recovery/t/025_stuck_on_old_timeline.pl

> I suggest we rename the test to something a bit more descriptive. Like
> instead of 025_timeline_issue.pl, perhaps
> 025_stuck_on_old_timeline.pl? Or I'm open to other suggestions, but
> "timeline issue" is a bit too vague for my taste.

Changed as suggested.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Thu, 27 May 2021 11:44:47 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in 
> Maybe we can somehow achieve that without a broken archive command,
> but I am not sure how it is enough to just delete WAL from pg_wal?  I
> mean my original case was that
> 1. Got the new history file from the archive but did not get the WAL
> file yet which contains the checkpoint after TL switch
> 2. So the standby2 try to stream using new primary using old TL and
> set the wrong TL in expectedTLEs
> 
> But if you are not doing anything to stop archiving WAL files or to
> guarantee that WAL has come to archive and you deleted those then I am
> not sure how we are reproducing the original problem.

Thanks for the reply!

We're writing at the very beginning of the switching segment at the
promotion time. So it is guaranteed that the first segment of the
newer timline won't be archived until the rest almost 16MB in the
segment is consumed or someone explicitly causes a segment switch
(including archive timeout).

> BTW, I have also tested your script and I found below log, which shows
> that standby2 is successfully able to select the timeline2 so it is
> not reproducing the issue.  Am I missing something?

standby_2? My last one 026_timeline_issue_2.pl doesn't use that name
and uses "standby_1 and "cascade".  In the ealier ones, standby_4 and
5 (or 3 and 4 in the later versions) are used in ths additional tests.

So I think it shold be something different?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Thu, May 27, 2021 at 12:09 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Thu, 27 May 2021 11:44:47 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in
> > Maybe we can somehow achieve that without a broken archive command,
> > but I am not sure how it is enough to just delete WAL from pg_wal?  I
> > mean my original case was that
> > 1. Got the new history file from the archive but did not get the WAL
> > file yet which contains the checkpoint after TL switch
> > 2. So the standby2 try to stream using new primary using old TL and
> > set the wrong TL in expectedTLEs
> >
> > But if you are not doing anything to stop archiving WAL files or to
> > guarantee that WAL has come to archive and you deleted those then I am
> > not sure how we are reproducing the original problem.
>
> Thanks for the reply!
>
> We're writing at the very beginning of the switching segment at the
> promotion time. So it is guaranteed that the first segment of the
> newer timline won't be archived until the rest almost 16MB in the
> segment is consumed or someone explicitly causes a segment switch
> (including archive timeout).

I agree

> > BTW, I have also tested your script and I found below log, which shows
> > that standby2 is successfully able to select the timeline2 so it is
> > not reproducing the issue.  Am I missing something?
>
> standby_2? My last one 026_timeline_issue_2.pl doesn't use that name
> and uses "standby_1 and "cascade".  In the ealier ones, standby_4 and
> 5 (or 3 and 4 in the later versions) are used in ths additional tests.
>
> So I think it shold be something different?

Yeah, I tested with your patch where you had a different test case,
with "026_timeline_issue_2.pl", I am able to reproduce the issue.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Thu, 27 May 2021 12:47:30 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in 
> On Thu, May 27, 2021 at 12:09 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Thu, 27 May 2021 11:44:47 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in
> > We're writing at the very beginning of the switching segment at the
> > promotion time. So it is guaranteed that the first segment of the
> > newer timline won't be archived until the rest almost 16MB in the
> > segment is consumed or someone explicitly causes a segment switch
> > (including archive timeout).
> 
> I agree
>
> > > BTW, I have also tested your script and I found below log, which shows
> > > that standby2 is successfully able to select the timeline2 so it is
> > > not reproducing the issue.  Am I missing something?
> >
> > standby_2? My last one 026_timeline_issue_2.pl doesn't use that name
> > and uses "standby_1 and "cascade".  In the ealier ones, standby_4 and
> > 5 (or 3 and 4 in the later versions) are used in ths additional tests.
> >
> > So I think it shold be something different?
> 
> Yeah, I tested with your patch where you had a different test case,
> with "026_timeline_issue_2.pl", I am able to reproduce the issue.

That said, I don't object if we decide to choose the crafted archive
command as far as we consider the trade-offs between the two ways.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Robert Haas
Date:
On Wed, May 26, 2021 at 8:49 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> So in the mail [1] and [2] I tried to describe what's going on around
> the two issues.  Although I haven't have a response to [2], can I
> think that we clarified the intention of ee994272ca?  And may I think
> that we decided that we don't add a test for the commit?

Regarding the first question, I feel that the intention of ee994272ca
is fairly clear at this point. Someone else might feel differently so
I won't presume to speak for anyone but me.

Regarding the second question, I am not opposed to adding a test for
that commit, but I think it is a lot more important to fix the bug we
have now than to add a test for a bug that was fixed a long time ago.

> Then it seems to me that Robert refound how to reproduce Dilip's case
> using basebackup instead of using two standbys.  It is using a broken
> archive_command with pg_basebackup -Xnone and I showed that the same
> resulting state is available by pg_basebackup -Xstream/fetch clearing
> pg_wal directory of the resulting backup including an explanation of
> why.

Yes, it makes sense that we could get to the same state either by not
fetching the WAL in the first place, or alternatively by fetching it
and then removing it.

> *I* think that it is better to avoid to have the archive_command since
> it seems to me that just unlinking some files seems simpler tha having
> the broken archive_command.  However, since Robert ignored it, I guess
> that Robert thinks that the broken archive_command is better than
> that.

Well ... I don't see those things as quite related. As far as I can
see, unlinking files from pg_wal is an alternative to using -Xnone. On
the other hand, the broken archive_command is there to make sure the
new primary doesn't archive its WAL segment too soon.

Regarding the first point, I think using -Xnone is better than using
-Xfetch/stream and then removing the WAL, because (1) it doesn't seem
efficient to fetch WAL only to turn around and remove it and (2)
someone might question whether removing the WAL afterward is a
supported procedure, whereas using an option built into the tool must
surely be supported.

Regarding the second point, I think using the broken archive command
is superior because we can be sure of the behavior. If we just rely on
not having crossed a segment boundary, then anything that causes more
WAL to be generated than we are expecting could break the test. I
don't think it's particularly likely in a case like this that
autovacuum or any other thing would kick in and generate extra WAL,
but the broken archive command ensures that even if it does happen,
the test will still work as intended. That, to me, seems like a good
enough reason to do it that way.

> FWIW, regarding to the name of the test script, putting aside what it
> actually does, I proposed to place it as a part or
> 004_timeline_switch.pl because this issue is related to timeline
> switching.

I think it is better to keep it separate. Long test scripts that test
multiple things with completely separate tests are hard to read.

Kyotaro-san, I hope I have not given any offense. I am doing my best,
and certainly did not mean to be rude.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Tatsuro Yamada
Date:
Hi Horiguchi-san,

In a project I helped with, I encountered an issue where
the archive command kept failing. I thought this issue was
related to the problem in this thread, so I'm sharing it here.
If I should create a new thread, please let me know.

* Problem
   - The archive_command is failed always.

* Conditions under which the problem occurs (parameters)
   - archive_mode=always
   - Using the test command in archive_command
  
* Probable cause
   - I guess that is because the .history file already exists,
     and the test command fails.
     (but if we use archive_mode=on, archive_command is successful).

* How to reproduce
   - Attached is a script to reproduce the problem.
  Note: the script will remove $PGDATA when it started

The test command is listed as an example of the use of archive_command
in postgresql.conf, and the project faced this problem because it used
the example as is. If this behavior is a specification, it would be
better not to write the test command as a usage example.
Or maybe there should be a note that the test command should not be used
when archive_mode=always. Maybe, I'm missing something, sorry.

Regards,
Tatsuro Yamada

Attachment

Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
Thanks!

At Thu, 27 May 2021 15:05:44 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Wed, May 26, 2021 at 8:49 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > So in the mail [1] and [2] I tried to describe what's going on around
> > the two issues.  Although I haven't have a response to [2], can I
> > think that we clarified the intention of ee994272ca?  And may I think
> > that we decided that we don't add a test for the commit?
> 
> Regarding the first question, I feel that the intention of ee994272ca
> is fairly clear at this point. Someone else might feel differently so
> I won't presume to speak for anyone but me.

I completely misunderstood your intention here.

> Regarding the second question, I am not opposed to adding a test for
> that commit, but I think it is a lot more important to fix the bug we
> have now than to add a test for a bug that was fixed a long time ago.

Yes. I agree to that. Glad to see that.

> > Then it seems to me that Robert refound how to reproduce Dilip's case
> > using basebackup instead of using two standbys.  It is using a broken
> > archive_command with pg_basebackup -Xnone and I showed that the same
> > resulting state is available by pg_basebackup -Xstream/fetch clearing
> > pg_wal directory of the resulting backup including an explanation of
> > why.
> 
> Yes, it makes sense that we could get to the same state either by not
> fetching the WAL in the first place, or alternatively by fetching it
> and then removing it.

Sure. That is an opinion and I can agree to that.

> > *I* think that it is better to avoid to have the archive_command since
> > it seems to me that just unlinking some files seems simpler tha having
> > the broken archive_command.  However, since Robert ignored it, I guess
> > that Robert thinks that the broken archive_command is better than
> > that.
> 
> Well ... I don't see those things as quite related. As far as I can
> see, unlinking files from pg_wal is an alternative to using -Xnone. On
> the other hand, the broken archive_command is there to make sure the
> new primary doesn't archive its WAL segment too soon.

I agree to use the archive_command just to create the desired state.

> Regarding the first point, I think using -Xnone is better than using
> -Xfetch/stream and then removing the WAL, because (1) it doesn't seem
> efficient to fetch WAL only to turn around and remove it and (2)
> someone might question whether removing the WAL afterward is a
> supported procedure, whereas using an option built into the tool must
> surely be supported.

Mmmm.  That looks like meaning that we don't intend to support the
Dilip's case, and means that we support the use of
archive-command-copies-only-other-than-wal-segments?

> Regarding the second point, I think using the broken archive command
> is superior because we can be sure of the behavior. If we just rely on
> not having crossed a segment boundary, then anything that causes more
> WAL to be generated than we are expecting could break the test. I
> don't think it's particularly likely in a case like this that
> autovacuum or any other thing would kick in and generate extra WAL,
> but the broken archive command ensures that even if it does happen,
> the test will still work as intended. That, to me, seems like a good
> enough reason to do it that way.

Yeah. It is what the most convincing reason.

> > FWIW, regarding to the name of the test script, putting aside what it
> > actually does, I proposed to place it as a part or
> > 004_timeline_switch.pl because this issue is related to timeline
> > switching.
> 
> I think it is better to keep it separate. Long test scripts that test
> multiple things with completely separate tests are hard to read.

Agreed. I often annoyed by a long-lasting TAP script when I wanted to
do one of the test items in it. However, I was not sure which is our
policy here, consolidating all related tests into one script or having
separate scripts containing tests up to a "certain" number or a set of
tests that would take a certain time, or limiting by number the of
lines.  I thought that we are on the first way as I have told several
times to put new tests into an existing script.

> Kyotaro-san, I hope I have not given any offense. I am doing my best,
> and certainly did not mean to be rude.

No. Thanks for the words, Robert. I might be a bit too naive, but I
had an anxious feeling that I might have been totally pointless or my
words might have been too cryptic/broken (my fingers are quite fat),
or I might have done something wrong or anything other.  Anyway I
thought I might have done something wrong here.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
(Sorry for being a bit off-topic)

At Fri, 28 May 2021 12:18:35 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in 
> Hi Horiguchi-san,

(Why me?)

> In a project I helped with, I encountered an issue where
> the archive command kept failing. I thought this issue was
> related to the problem in this thread, so I'm sharing it here.
> If I should create a new thread, please let me know.
> 
> * Problem
>   - The archive_command is failed always.

Although I think the configuration is a kind of broken, it can be seen
as it is mimicing the case of shared-archive, where primary and
standby share the same archive directory.

Basically we need to use an archive command like the following for
that case to avoid this kind of failure. The script returns "success"
when the target file is found but identical with the source file. I
don't find such a description in the documentation, and haven't
bothered digging into the mailing-list archive.

==
#! /bin/bash

if [ -f $2 ]; then
    cmp -s $1 $2
    if [ $? != 0 ]; then
        exit 1
    fi
    exit 0
fi

cp $1 $2
==

A maybe-non-optimal behavior is both 00000002.history.done and .ready
files are found at once in archive_status directory but that doesn't
practically matter. (Or I faintly remember that it is designed to work
even in that case.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Tatsuro Yamada
Date:
Hi Horiguchi-san,

> (Why me?)

Because the story was also related to PG-REX, which you are
also involved in developing. Perhaps off-list instead of
-hackers would have been better, but I emailed -hackers because
the same problem could be encountered by PostgreSQL users who
do not use PG-REX.

  
>> In a project I helped with, I encountered an issue where
>> the archive command kept failing. I thought this issue was
>> related to the problem in this thread, so I'm sharing it here.
>> If I should create a new thread, please let me know.
>>
>> * Problem
>>    - The archive_command is failed always.
> 
> Although I think the configuration is a kind of broken, it can be seen
> as it is mimicing the case of shared-archive, where primary and
> standby share the same archive directory.


To be precise, the environment of this reproduction script is
different from our actual environment. I tried to make it as
simple as possible to reproduce the problem.
(In order to make it look like the actual environment, you have
to build a PG-REX environment.)

A simple replication environment might be enough, so I'll try to
recreate a script that is closer to the actual environment later.

  
> Basically we need to use an archive command like the following for
> that case to avoid this kind of failure. The script returns "success"
> when the target file is found but identical with the source file. I
> don't find such a description in the documentation, and haven't
> bothered digging into the mailing-list archive.
> 
> ==
> #! /bin/bash
> 
> if [ -f $2 ]; then
>     cmp -s $1 $2
>     if [ $? != 0 ]; then
>         exit 1
>     fi
>     exit 0
> fi
> 
> cp $1 $2
> ==

Thanks for your reply.
Since the above behavior is different from the behavior of the
test command in the following example in postgresql.conf, I think
we should write a note about this example.

# e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p /mnt/server/archivedir/%f'

Let me describe the problem we faced.
- When archive_mode=always, archive_command is (sometimes) executed
   in a situation where the history file already exists on the standby
   side.

- In this case, if "test ! -f" is written in the archive_command of
   postgresql.conf on the standby side, the command will keep failing.

   Note that this problem does not occur when archive_mode=on.

So, what should we do for the user? I think we should put some notes
in postgresql.conf or in the documentation. For example, something
like this:

====
Note: If you use archive_mode=always, the archive_command on the standby side should not be used "test ! -f".
====



Regards,
Tatsuro Yamada





Duplicate history file?

From
Kyotaro Horiguchi
Date:
So, I started a thread for this topic diverged from the following
thread.

https://www.postgresql.org/message-id/4698027d-5c0d-098f-9a8e-8cf09e36a555@nttcom.co.jp_1

At Mon, 31 May 2021 11:52:05 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in 
> Since the above behavior is different from the behavior of the
> test command in the following example in postgresql.conf, I think
> we should write a note about this example.
> 
> # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p
> # /mnt/server/archivedir/%f'
>
> Let me describe the problem we faced.
> - When archive_mode=always, archive_command is (sometimes) executed
>   in a situation where the history file already exists on the standby
>   side.
> 
> - In this case, if "test ! -f" is written in the archive_command of
>   postgresql.conf on the standby side, the command will keep failing.
> 
>   Note that this problem does not occur when archive_mode=on.
> 
> So, what should we do for the user? I think we should put some notes
> in postgresql.conf or in the documentation. For example, something
> like this:

I'm not sure about the exact configuration you have in mind, but that
would happen on the cascaded standby in the case where the upstream
promotes. In this case, the history file for the new timeline is
archived twice.  walreceiver triggers archiving of the new history
file at the time of the promotion, then startup does the same when it
restores the file from archive.  Is it what you complained about?

The same workaround using the alternative archive script works for the
case.

We could check pg_wal before fetching archive, however, archiving is
not controlled so strictly that duplicate archiving never happens and
I think we choose possible duplicate archiving than having holes in
archive. (so we suggest the "test ! -f" script)

> ====
> Note: If you use archive_mode=always, the archive_command on the
> standby side should not be used "test ! -f".
> ====

It could be one workaround. However, I would suggest not to overwrite
existing files (with a file with different content) to protect archive
from corruption.

We might need to write that in the documentation...

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
Moved to another thread.

https://www.postgresql.org/message-id/20210531.165825.921389284096975508.horikyota.ntt@gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Duplicate history file?

From
Tatsuro Yamada
Date:
Hi Horiguchi-san,

On 2021/05/31 16:58, Kyotaro Horiguchi wrote:
> So, I started a thread for this topic diverged from the following
> thread.
> 
> https://www.postgresql.org/message-id/4698027d-5c0d-098f-9a8e-8cf09e36a555@nttcom.co.jp_1
> 
>> So, what should we do for the user? I think we should put some notes
>> in postgresql.conf or in the documentation. For example, something
>> like this:
> 
> I'm not sure about the exact configuration you have in mind, but that
> would happen on the cascaded standby in the case where the upstream
> promotes. In this case, the history file for the new timeline is
> archived twice.  walreceiver triggers archiving of the new history
> file at the time of the promotion, then startup does the same when it
> restores the file from archive.  Is it what you complained about?


Thank you for creating a new thread and explaining this.
We are not using cascade replication in our environment, but I think
the situation is similar. As an overview, when I do a promote,
the archive_command fails due to the history file.

I've created a reproduction script that includes building replication,
and I'll share it with you. (I used Robert's test.sh as a reference
for creating the reproduction script. Thanks)

The scenario (sr_test_historyfile.sh) is as follows.

#1 Start pgprimary as a main
#2 Create standby
#3 Start pgstandby as a standby
#4 Execute archive command
#5 Shutdown pgprimary
#6 Start pgprimary as a standby
#7 Promote pgprimary
#8 Execute archive_command again, but failed since duplicate history
    file exists (see pgstandby.log)

Note that this may not be appropriate if you consider it as a recovery
procedure for replication configuration. However, I'm sharing it as it is
because this seems to be the procedure used in the customer's environment (PG-REX).

  
> The same workaround using the alternative archive script works for the
> case.
> 
> We could check pg_wal before fetching archive, however, archiving is
> not controlled so strictly that duplicate archiving never happens and
> I think we choose possible duplicate archiving than having holes in
> archive. (so we suggest the "test ! -f" script)
> 
>> ====
>> Note: If you use archive_mode=always, the archive_command on the
>> standby side should not be used "test ! -f".
>> ====
> 
> It could be one workaround. However, I would suggest not to overwrite
> existing files (with a file with different content) to protect archive
> from corruption.
> 
> We might need to write that in the documentation...

I think you're right, replacing it with an alternative archive script
that includes the cmp command will resolve the error. The reason is that
I checked with the diff command that the history files are identical.

=====
$ diff -s pgprimary/arc/00000002.history  pgstandby/arc/00000002.history
Files pgprimary/arc/00000002.history and pgstandby/arc/00000002.history are identical
=====

Regarding "test ! -f",
I am wondering how many people are using the test command for
archive_command. If I remember correctly, the guide provided by
NTT OSS Center that we are using does not recommend using the test command.


Regards,
Tatsuro Yamada


Attachment

Re: Race condition in recovery?

From
Robert Haas
Date:
On Fri, May 28, 2021 at 2:05 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> Mmmm.  That looks like meaning that we don't intend to support the
> Dilip's case, and means that we support the use of
> archive-command-copies-only-other-than-wal-segments?

Actually, I think Dilip's case ought to be supported, but I also think
that somebody else might disagree, so it's better for me if the test
doesn't need to rely on it.

> Agreed. I often annoyed by a long-lasting TAP script when I wanted to
> do one of the test items in it. However, I was not sure which is our
> policy here, consolidating all related tests into one script or having
> separate scripts containing tests up to a "certain" number or a set of
> tests that would take a certain time, or limiting by number the of
> lines.  I thought that we are on the first way as I have told several
> times to put new tests into an existing script.

Different people might have different opinions about this, but my
opinion is that when it's possible to combine the test cases in a way
that feels natural, it's good to do. For example if I have two tests
that require the same setup and teardown but do different things in
the middle, and if those things seem related, then it's great to set
up once, try both things, and tear down once. However I don't support
combining test cases where it's just concatenating them one after
another, because that sort of thing seems to have no benefit. Fewer
files in the source tree is not a goal of itself.

> No. Thanks for the words, Robert. I might be a bit too naive, but I
> had an anxious feeling that I might have been totally pointless or my
> words might have been too cryptic/broken (my fingers are quite fat),
> or I might have done something wrong or anything other.  Anyway I
> thought I might have done something wrong here.

No, I don't think so. I think the difficulty is more that the three of
us who are mostly involved in this conversation all have different
native languages, and we are trying to discuss an issue which is very
subtle. Sometimes I am having difficulty understanding precisely what
either you or Dilip are intending to say, and it would not surprise me
to learn that there are difficulties in the other direction also. If
we seem to be covering the same topics multiple times or if any
important points seem to be getting ignored, that's probably the
reason.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Noah Misch
Date:
On Tue, Jun 01, 2021 at 04:45:52PM -0400, Robert Haas wrote:
> On Fri, May 28, 2021 at 2:05 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> > Agreed. I often annoyed by a long-lasting TAP script when I wanted to
> > do one of the test items in it. However, I was not sure which is our
> > policy here, consolidating all related tests into one script or having
> > separate scripts containing tests up to a "certain" number or a set of
> > tests that would take a certain time, or limiting by number the of
> > lines.  I thought that we are on the first way as I have told several
> > times to put new tests into an existing script.
> 
> Different people might have different opinions about this, but my
> opinion is that when it's possible to combine the test cases in a way
> that feels natural, it's good to do. For example if I have two tests
> that require the same setup and teardown but do different things in
> the middle, and if those things seem related, then it's great to set
> up once, try both things, and tear down once. However I don't support
> combining test cases where it's just concatenating them one after
> another, because that sort of thing seems to have no benefit. Fewer
> files in the source tree is not a goal of itself.

I agree, particularly for the recovery and subscription TAP suites.  When one
of those tests fails on the buildfarm, it's often not obvious to me which log
messages are relevant to the failure.  Smaller test files simplify the
investigation somewhat.



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Tue, 1 Jun 2021 16:45:52 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Fri, May 28, 2021 at 2:05 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > Mmmm.  That looks like meaning that we don't intend to support the
> > Dilip's case, and means that we support the use of
> > archive-command-copies-only-other-than-wal-segments?
> 
> Actually, I think Dilip's case ought to be supported, but I also think
> that somebody else might disagree, so it's better for me if the test
> doesn't need to rely on it.

Understood.

> > Agreed. I often annoyed by a long-lasting TAP script when I wanted to
> > do one of the test items in it. However, I was not sure which is our
> > policy here, consolidating all related tests into one script or having
> > separate scripts containing tests up to a "certain" number or a set of
> > tests that would take a certain time, or limiting by number the of
> > lines.  I thought that we are on the first way as I have told several
> > times to put new tests into an existing script.
> 
> Different people might have different opinions about this, but my
> opinion is that when it's possible to combine the test cases in a way
> that feels natural, it's good to do. For example if I have two tests
> that require the same setup and teardown but do different things in
> the middle, and if those things seem related, then it's great to set
> up once, try both things, and tear down once. However I don't support
> combining test cases where it's just concatenating them one after
> another, because that sort of thing seems to have no benefit. Fewer
> files in the source tree is not a goal of itself.

Sounds like a reasonable criteria.

> > No. Thanks for the words, Robert. I might be a bit too naive, but I
> > had an anxious feeling that I might have been totally pointless or my
> > words might have been too cryptic/broken (my fingers are quite fat),
> > or I might have done something wrong or anything other.  Anyway I
> > thought I might have done something wrong here.
> 
> No, I don't think so. I think the difficulty is more that the three of
> us who are mostly involved in this conversation all have different
> native languages, and we are trying to discuss an issue which is very
> subtle. Sometimes I am having difficulty understanding precisely what
> either you or Dilip are intending to say, and it would not surprise me
> to learn that there are difficulties in the other direction also. If
> we seem to be covering the same topics multiple times or if any
> important points seem to be getting ignored, that's probably the
> reason.

That makes me convinced. Thanks for the thought and sorry for
bothering with the complaint.


Anyway, Now I agree to the all of the direction here.

Thanks!

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Duplicate history file?

From
Kyotaro Horiguchi
Date:
At Tue, 01 Jun 2021 13:03:22 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in 
> Hi Horiguchi-san,
> 
> On 2021/05/31 16:58, Kyotaro Horiguchi wrote:
> > So, I started a thread for this topic diverged from the following
> > thread.
> > https://www.postgresql.org/message-id/4698027d-5c0d-098f-9a8e-8cf09e36a555@nttcom.co.jp_1
> > 
> >> So, what should we do for the user? I think we should put some notes
> >> in postgresql.conf or in the documentation. For example, something
> >> like this:
> > I'm not sure about the exact configuration you have in mind, but that
> > would happen on the cascaded standby in the case where the upstream
> > promotes. In this case, the history file for the new timeline is
> > archived twice.  walreceiver triggers archiving of the new history
> > file at the time of the promotion, then startup does the same when it
> > restores the file from archive.  Is it what you complained about?
> 
> 
> Thank you for creating a new thread and explaining this.
> We are not using cascade replication in our environment, but I think
> the situation is similar. As an overview, when I do a promote,
> the archive_command fails due to the history file.

Ah, I remembered that PG-REX starts a primary as a standby then
promotes it.

> I've created a reproduction script that includes building replication,
> and I'll share it with you. (I used Robert's test.sh as a reference
> for creating the reproduction script. Thanks)
> 
> The scenario (sr_test_historyfile.sh) is as follows.
> 
> #1 Start pgprimary as a main
> #2 Create standby
> #3 Start pgstandby as a standby
> #4 Execute archive command
> #5 Shutdown pgprimary
> #6 Start pgprimary as a standby
> #7 Promote pgprimary
> #8 Execute archive_command again, but failed since duplicate history
>    file exists (see pgstandby.log)

Ok, I clearly understood what you meant. (However, it is not the legit
state where a standby is running without the primary is running..)
Anyway the "test ! -f" can be problematic in the case.

> Note that this may not be appropriate if you consider it as a recovery
> procedure for replication configuration. However, I'm sharing it as it
> is
> because this seems to be the procedure used in the customer's
> environment (PG-REX).

Understood.

> Regarding "test ! -f",
> I am wondering how many people are using the test command for
> archive_command. If I remember correctly, the guide provided by
> NTT OSS Center that we are using does not recommend using the test
> command.

I think, as the PG-REX documentation says, the simple cp works well as
far as the assumption of PG-REX - no double failure happenes, and
following the instruction - holds.


On the other hand, I found that the behavior happens more generally.

If a standby with archive_mode=always craches, it starts recovery from
the last checkpoint. If the checkpoint were in a archived segment, the
restarted standby will fetch the already-archived segment from archive
then fails to archive it. (The attached).

So, your fear stated upthread is applicable for wider situations. The
following suggestion is rather harmful for the archive_mode=always
setting.

https://www.postgresql.org/docs/14/continuous-archiving.html
> The archive command should generally be designed to refuse to
> overwrite any pre-existing archive file. This is an important safety
> feature to preserve the integrity of your archive in case of
> administrator error (such as sending the output of two different
> servers to the same archive directory).

I'm not sure how we should treat this..  Since archive must store
files actually applied to the server data, just being already archived
cannot be the reason for omitting archiving.  We need to make sure the
new file is byte-identical to the already-archived version. We could
compare just *restored* file to the same file in pg_wal but it might
be too much of penalty for for the benefit. (Attached second file.)

Otherwise the documentation would need someting like the following if
we assume the current behavior.

> The archive command should generally be designed to refuse to
> overwrite any pre-existing archive file. This is an important safety
> feature to preserve the integrity of your archive in case of
> administrator error (such as sending the output of two different
> servers to the same archive directory).
+ For standby with the setting archive_mode=always, there's a case where
+ the same file is archived more than once.  For safety, it is
+ recommended that when the destination file exists, the archive_command
+ returns zero if it is byte-identical to the source file.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
# Copyright (c) 2021, PostgreSQL Global Development Group

#
# Tests related to WAL archiving and recovery.
#
use strict;
use warnings;
use PostgresNode;
use TestLib;
use Test::More tests => 1;
use Config;

my $backup_name='mybackup';

my $primary = get_new_node('primary');
$primary->init(
    has_archiving    => 1,
    allows_streaming => 1);
$primary->append_conf('postgresql.conf', qq[
wal_keep_size=128MB
archive_mode=always
log_checkpoints=yes

]);
my $primary_archive = $primary->archive_dir;
$primary->start;

$primary->backup($backup_name);
my $standby = get_new_node('standby');
my $standby_archive = $standby->archive_dir;
$standby->init_from_backup($primary, $backup_name, has_streaming=>1);
$standby->append_conf('postgresql.conf', qq[
restore_command='cp $primary_archive/%f  %p'
archive_command='test ! -f $standby_archive/%f && cp %p $standby_archive/%f'
]);
$standby->start;

$primary->psql('postgres', 'CHECKPOINT;SELECT pg_switch_wal();CREATE TABLE t(); pg_switch_wal();');
$standby->psql('postgres', 'CHECKPOINT');
$standby->stop('immediate');
$standby->start;

$primary->psql('postgres', 'CHECKPOINT;SELECT pg_switch_wal();CHECKPOINT');
$standby->psql('postgres', 'CHECKPOINT');

my $result;
while (1) {

    $result = 
      $standby->safe_psql('postgres',
                          "SELECT last_archived_wal, last_failed_wal FROM pg_stat_archiver");
    sleep(0.1);
    last  if ($result ne "|");
}

ok($result =~ /^[^|]+\|$/, 'archive check 1');
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 26b023e754..037da0aa3d 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -382,6 +382,7 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
 {
     char        xlogfpath[MAXPGPATH];
     bool        reload = false;
+    bool        skip_archive = false;
     struct stat statbuf;
 
     snprintf(xlogfpath, MAXPGPATH, XLOGDIR "/%s", xlogfname);
@@ -416,6 +417,56 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
         /* same-size buffers, so this never truncates */
         strlcpy(oldpath, xlogfpath, MAXPGPATH);
 #endif
+        /*
+         * On a standby with archive_mode=always, there's the case where the
+         * same file is archived more than once. If the archive_command rejects
+         * overwriting, WAL-archiving won't go further than the file forever.
+         * Avoid duplicate archiving attempts when the file is known to have
+         * been archived and the content doesn't change.
+         */
+        if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS &&
+            XLogArchiveCheckDone(xlogfname))
+        {
+            unsigned char srcbuf[XLOG_BLCKSZ];
+            unsigned char dstbuf[XLOG_BLCKSZ];
+            int fd1 = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+            int fd2 = BasicOpenFile(oldpath, O_RDONLY | PG_BINARY);
+            uint32 i;
+            uint32 off = 0;
+
+            /*
+             * Compare the two files' contents.  We don't bother completing if
+             * something's wrong meanwhile.
+             */
+            for (i = 0 ; i < wal_segment_size / XLOG_BLCKSZ ; i++)
+            {
+                if (pg_pread(fd1, srcbuf, XLOG_BLCKSZ, (off_t) off)
+                    != XLOG_BLCKSZ)
+                    break;
+                
+                if (pg_pread(fd2, dstbuf, XLOG_BLCKSZ, (off_t) off)
+                    != XLOG_BLCKSZ)
+                    break;
+
+                if (memcmp(srcbuf, dstbuf, XLOG_BLCKSZ) != 0)
+                    break;
+
+                off += XLOG_BLCKSZ;
+            }
+
+            close(fd1);
+            close(fd2);
+            
+            if (i == wal_segment_size / XLOG_BLCKSZ)
+            {
+                skip_archive = true;
+
+                ereport(LOG,
+                        (errmsg ("log file \"%s\" have been already archived, skip archiving",
+                                 xlogfname)));
+            }
+        }
+
         if (unlink(oldpath) != 0)
             ereport(FATAL,
                     (errcode_for_file_access(),
@@ -430,7 +481,7 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
      * Create .done file forcibly to prevent the restored segment from being
      * archived again later.
      */
-    if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+    if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS || skip_archive)
         XLogArchiveForceDone(xlogfname);
     else
         XLogArchiveNotify(xlogfname);

Re: Race condition in recovery?

From
Robert Haas
Date:
On Thu, May 27, 2021 at 2:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Changed as suggested.

I don't think the code as written here is going to work on Windows,
because your code doesn't duplicate enable_restoring's call to
perl2host or its backslash-escaping logic. It would really be better
if we could use enable_restoring directly. Also, I discovered that the
'return' in cp_history_files should really say 'exit', because
otherwise it generates a complaint every time it's run. It should also
have 'use strict' and 'use warnings' at the top.

Here's a version of your test case patch with the 1-line code fix
added, the above issues addressed, and a bunch of cosmetic tweaks.
Unfortunately, it doesn't pass for me consistently. I'm not sure if
that's because I broke something with my changes, or because the test
contains an underlying race condition which we need to address.
Attached also are the log files from a failed run if you want to look
at them. The key lines seem to be:

2021-06-03 16:16:53.984 EDT [47796] LOG:  restarted WAL streaming at
0/3000000 on timeline 2
2021-06-03 16:16:54.197 EDT [47813] 025_stuck_on_old_timeline.pl LOG:
statement: SELECT count(*) FROM tab_int
2021-06-03 16:16:54.197 EDT [47813] 025_stuck_on_old_timeline.pl
ERROR:  relation "tab_int" does not exist at character 22

Or from the main log:

Waiting for replication conn cascade's replay_lsn to pass '0/3000000' on standby
done
error running SQL: 'psql:<stdin>:1: ERROR:  relation "tab_int" does not exist
LINE 1: SELECT count(*) FROM tab_int
                             ^'

I wonder whether that problem points to an issue with this incantation:

$node_standby->wait_for_catchup($node_cascade, 'replay',
        $node_standby->lsn('replay'));

But I'm not sure, and I'm out of time to investigate for today.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: Duplicate history file?

From
Kyotaro Horiguchi
Date:
At Thu, 03 Jun 2021 21:52:08 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> 
> https://www.postgresql.org/docs/14/continuous-archiving.html
> > The archive command should generally be designed to refuse to
> > overwrite any pre-existing archive file. This is an important safety
> > feature to preserve the integrity of your archive in case of
> > administrator error (such as sending the output of two different
> > servers to the same archive directory).
> 
> I'm not sure how we should treat this..  Since archive must store
> files actually applied to the server data, just being already archived
> cannot be the reason for omitting archiving.  We need to make sure the
> new file is byte-identical to the already-archived version. We could
> compare just *restored* file to the same file in pg_wal but it might
> be too much of penalty for for the benefit. (Attached second file.)

(To recap: In a replication set using archive, startup tries to
restore WAL files from archive before checking pg_wal directory for
the desired file.  The behavior itself is intentionally designed and
reasonable. However, the restore code notifies of a restored file
regardless of whether it has been already archived or not.  If
archive_command is written so as to return error for overwriting as we
suggest in the documentation, that behavior causes archive failure.)

After playing with this, I see the problem just by restarting a
standby even in a simple archive-replication set after making
not-special prerequisites.  So I think this is worth fixing.

With this patch, KeepFileRestoredFromArchive compares the contents of
just-restored file and the existing file for the same segment only
when:

     - archive_mode = always
 and - the file to restore already exists in pgwal
 and - it has a .done and/or .ready status file.

which doesn't happen usually.  Then the function skips archive
notification if the contents are identical.  The included TAP test is
working both on Linux and Windows.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/test/recovery/t/020_archive_status.pl b/src/test/recovery/t/020_archive_status.pl
index 777bf05274..ed5e80327b 100644
--- a/src/test/recovery/t/020_archive_status.pl
+++ b/src/test/recovery/t/020_archive_status.pl
@@ -8,7 +8,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 16;
+use Test::More tests => 17;
 use Config;
 
 my $primary = get_new_node('primary');
@@ -176,6 +176,7 @@ ok( -f "$standby1_data/$segment_path_2_done",
 # will cause archiving failures.
 my $standby2 = get_new_node('standby2');
 $standby2->init_from_backup($primary, 'backup', has_restoring => 1);
+$standby2->enable_archiving;
 $standby2->append_conf('postgresql.conf', 'archive_mode = always');
 my $standby2_data = $standby2->data_dir;
 $standby2->start;
@@ -234,3 +235,49 @@ ok( -f "$standby2_data/$segment_path_1_done"
       && -f "$standby2_data/$segment_path_2_done",
     ".done files created after archive success with archive_mode=always on standby"
 );
+
+
+# Check if restart of a archive-replicated standby doesn't archive the
+# same file twice.
+
+# Use duplicate-resistent archive_command
+my $archpath = TestLib::perl2host($standby2->archive_dir);
+$archpath =~ s{\\}{\\\\}g if ($TestLib::windows_os);
+my $nodup_command =
+      $TestLib::windows_os
+      ? qq{if not exist "$archpath\\\\%f" (copy "%p" "$archpath\\\\%f") else (exit 1)}
+      : qq{test ! -f "$archpath/%f" && cp "%p" "$archpath/%f"};
+$standby2->safe_psql(
+    'postgres', qq{
+    ALTER SYSTEM SET archive_command TO '$nodup_command';
+    SELECT pg_reload_conf();
+});
+
+# Remember the current semgent file, The +1 below is an adjustment to avoid
+# segment borders.
+my $cursegfile = $primary->safe_psql(
+    'postgres',
+    'SELECT pg_walfile_name(pg_current_wal_lsn() + 1)');
+$primary->psql('postgres', 'CHECKPOINT; SELECT pg_switch_wal();');
+$standby2->psql('postgres', 'CHECKPOINT');
+$standby2->poll_query_until(
+    'postgres',
+    "SELECT last_archived_wal = '$cursegfile' FROM pg_stat_archiver")
+  or die "Timed out waiting for the segment to be archived";
+
+$standby2->restart;
+
+$primary->psql('postgres', 'CHECKPOINT; SELECT pg_switch_wal();');
+
+$standby2->poll_query_until(
+    'postgres', qq[
+    SELECT last_archived_wal IS NOT NULL OR
+           last_failed_wal IS NOT NULL
+    FROM pg_stat_archiver]);
+
+my $result =
+  $standby2->safe_psql(
+      'postgres',
+      "SELECT last_archived_wal, last_failed_wal FROM pg_stat_archiver");
+
+ok($result =~ /^[^|]+\|$/, 'check if archiving proceeds');
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 26b023e754..7d94e7f963 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -382,6 +382,7 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
 {
     char        xlogfpath[MAXPGPATH];
     bool        reload = false;
+    bool        already_archived = false;
     struct stat statbuf;
 
     snprintf(xlogfpath, MAXPGPATH, XLOGDIR "/%s", xlogfname);
@@ -416,6 +417,68 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
         /* same-size buffers, so this never truncates */
         strlcpy(oldpath, xlogfpath, MAXPGPATH);
 #endif
+        /*
+         * On a standby with archive_mode=always, there's the case where the
+         * same file is archived more than once. If the archive_command rejects
+         * overwriting, WAL-archiving won't go further than the file forever.
+         * Avoid duplicate archiving attempts when the file with the same
+         * content is known to have been already archived or notified.
+         */
+        if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS &&
+            XLogArchiveIsReadyOrDone(xlogfname))
+        {
+            int fd1;
+            int fd2 = -1;
+
+            fd1 = BasicOpenFile(path, O_RDONLY | PG_BINARY);
+            if (fd1 >= 0)
+                fd2 = BasicOpenFile(oldpath, O_RDONLY | PG_BINARY);
+
+            if (fd1 < 0 || fd2 < 0)
+            {
+                ereport(WARNING,
+                        (errcode_for_file_access(),
+                         errmsg("could not open file \"%s\", skip duplicate check: %m",
+                                fd1 < 0 ? path : oldpath)));
+                if (fd1 >= 0)
+                    close(fd1);
+            }
+            else
+            {
+                unsigned char srcbuf[XLOG_BLCKSZ];
+                unsigned char dstbuf[XLOG_BLCKSZ];
+                uint32 i;
+
+                /*
+                 * Compare the two files' contents.  We don't bother
+                 * completing if something's wrong meanwhile.
+                 */
+                for (i = 0 ; i < wal_segment_size / XLOG_BLCKSZ ; i++)
+                {
+                    if (read(fd1, srcbuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+                        break;
+
+                    if (read(fd2, dstbuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+                        break;
+
+                    if (memcmp(srcbuf, dstbuf, XLOG_BLCKSZ) != 0)
+                        break;
+                }
+
+                close(fd1);
+                close(fd2);
+
+                if (i == wal_segment_size / XLOG_BLCKSZ)
+                {
+                    already_archived = true;
+
+                    ereport(LOG,
+                            (errmsg("log file \"%s\" have been already archived, skip archiving",
+                                    xlogfname)));
+                }
+            }
+        }
+
         if (unlink(oldpath) != 0)
             ereport(FATAL,
                     (errcode_for_file_access(),
@@ -432,7 +495,7 @@ KeepFileRestoredFromArchive(const char *path, const char *xlogfname)
      */
     if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
         XLogArchiveForceDone(xlogfname);
-    else
+    else if (!already_archived)
         XLogArchiveNotify(xlogfname);
 
     /*

Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Fri, Jun 4, 2021 at 2:03 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, May 27, 2021 at 2:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Changed as suggested.
>
> I don't think the code as written here is going to work on Windows,
> because your code doesn't duplicate enable_restoring's call to
> perl2host or its backslash-escaping logic. It would really be better
> if we could use enable_restoring directly. Also, I discovered that the
> 'return' in cp_history_files should really say 'exit', because
> otherwise it generates a complaint every time it's run. It should also
> have 'use strict' and 'use warnings' at the top.

Ok

> Here's a version of your test case patch with the 1-line code fix
> added, the above issues addressed, and a bunch of cosmetic tweaks.
> Unfortunately, it doesn't pass for me consistently. I'm not sure if
> that's because I broke something with my changes, or because the test
> contains an underlying race condition which we need to address.
> Attached also are the log files from a failed run if you want to look
> at them. The key lines seem to be:

I could not reproduce this but I think I got the issue, I think I used
the wrong target LSN in wait_for_catchup, instead of checking the last
"insert LSN" of the standby I was waiting for last "replay LSN" of
standby which was wrong.  Changed as below in the attached patch.

diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl
b/src/test/recovery/t/025_stuck_on_old_timeline.pl
index 09eb3eb..ee7d78d 100644
--- a/src/test/recovery/t/025_stuck_on_old_timeline.pl
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -78,7 +78,7 @@ $node_standby->safe_psql('postgres', "CREATE TABLE
tab_int AS SELECT 1 AS a");

 # Wait for the replication to catch up
 $node_standby->wait_for_catchup($node_cascade, 'replay',
-       $node_standby->lsn('replay'));
+       $node_standby->lsn('insert'));

 # Check that cascading standby has the new content
 my $result =

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Fri, 4 Jun 2021 13:21:08 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in 
> On Fri, Jun 4, 2021 at 2:03 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Thu, May 27, 2021 at 2:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > Changed as suggested.
> >
> > I don't think the code as written here is going to work on Windows,
> > because your code doesn't duplicate enable_restoring's call to
> > perl2host or its backslash-escaping logic. It would really be better
> > if we could use enable_restoring directly. Also, I discovered that the
> > 'return' in cp_history_files should really say 'exit', because
> > otherwise it generates a complaint every time it's run. It should also
> > have 'use strict' and 'use warnings' at the top.
> 
> Ok
> 
> > Here's a version of your test case patch with the 1-line code fix
> > added, the above issues addressed, and a bunch of cosmetic tweaks.
> > Unfortunately, it doesn't pass for me consistently. I'm not sure if
> > that's because I broke something with my changes, or because the test
> > contains an underlying race condition which we need to address.
> > Attached also are the log files from a failed run if you want to look
> > at them. The key lines seem to be:
> 
> I could not reproduce this but I think I got the issue, I think I used
> the wrong target LSN in wait_for_catchup, instead of checking the last
> "insert LSN" of the standby I was waiting for last "replay LSN" of
> standby which was wrong.  Changed as below in the attached patch.

I think that's right. And the test script detects the issue for me
both on Linux but doesn't work for Windows.

'"C:/../Documents/work/postgresql/src/test/recovery/t/cp_history_files"' is not recognized as an internal command or
externalcommand ..
 

Because Windows' cmd.exe doesn't have the shbang feature. On Windows,
maybe archive_command should be like

'".../perl" "$FindBin../cp_history_files" "%p"...

If I did this I got another error.

"couldn't copy pg_wal\00000002.history to C:/../Documents/work/postgresql/src
est^Mecovery/tmp_check/t_000_a_primary_data/archives/00000002.history: at
C:/../Documents/work/postgresql/src/test/recovery/t/cp_history_filesline 10.^M"
 

("^M" are the replacement for carrage return)
So.. I'm not sure what is happening but the error messages, or..
Anyway I don't have a time to investigate it.


+ # clean up
+ $node_primary->teardown_node;
+ $node_standby->teardown_node;
+ $node_cascade->teardown_node;

I don't think explicit teardown is useless as the final cleanup.

By the way the attached patch is named as "Fix-corner-case..." but
doesn't contain the fix. Is it intentional?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Robert Haas
Date:
On Fri, Jun 4, 2021 at 3:51 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I could not reproduce this but I think I got the issue, I think I used
> the wrong target LSN in wait_for_catchup, instead of checking the last
> "insert LSN" of the standby I was waiting for last "replay LSN" of
> standby which was wrong.  Changed as below in the attached patch.

Yeah, that fixes it for me. Thanks.

With that change, this test reliably passes for me with the fix, and
reliably fails for me without the fix. Woohoo!

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Robert Haas
Date:
On Fri, Jun 4, 2021 at 5:25 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> I think that's right. And the test script detects the issue for me
> both on Linux but doesn't work for Windows.
>
> '"C:/../Documents/work/postgresql/src/test/recovery/t/cp_history_files"' is not recognized as an internal command or
externalcommand ..
 

Hmm, that's a problem. Can you try the attached version?

> + # clean up
> + $node_primary->teardown_node;
> + $node_standby->teardown_node;
> + $node_cascade->teardown_node;
>
> I don't think explicit teardown is useless as the final cleanup.

I don't know what you mean by this. If it's not useless, good, because
we're doing it. Or do you mean that you think it is useless, and we
should remove it?

> By the way the attached patch is named as "Fix-corner-case..." but
> doesn't contain the fix. Is it intentional?

No, that was a goof.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Fri, 4 Jun 2021 10:56:12 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Fri, Jun 4, 2021 at 5:25 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > I think that's right. And the test script detects the issue for me
> > both on Linux but doesn't work for Windows.
> >
> > '"C:/../Documents/work/postgresql/src/test/recovery/t/cp_history_files"' is not recognized as an internal command
orexternal command ..
 
> 
> Hmm, that's a problem. Can you try the attached version?

Unfortunately no. The backslashes in the binary path need to be
escaped. (taken from PostgresNode.pm:1008)

> (my $perlbin = $^X) =~ s{\\}{\\\\}g if ($TestLib::windows_os);
> $node_primary->append_conf(
>     'postgresql.conf', qq(
> archive_command = '$perlbin "$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"'
> ));

This works for me.

> > + # clean up
> > + $node_primary->teardown_node;
> > + $node_standby->teardown_node;
> > + $node_cascade->teardown_node;
> >
> > I don't think explicit teardown is useless as the final cleanup.
> 
> I don't know what you mean by this. If it's not useless, good, because
> we're doing it. Or do you mean that you think it is useless, and we
> should remove it?

Ugh! Sorry. I meant "The explicit teardowns are useless". That's not
harmful but it is done by PostgresNode.pm automatically(implicitly)
and we don't do that in the existing scripts.

> > By the way the attached patch is named as "Fix-corner-case..." but
> > doesn't contain the fix. Is it intentional?
> 
> No, that was a goof.

As I said upthread the relationship between receiveTLI and
recoveryTargetTLI is not confirmed yet at the point.
findNewestTimeLine() simply searches for the history file with the
largest timeline id so the returned there's a case where the timeline
id that the function returns is not a future of the latest checkpoint
TLI.  I think that the fact that rescanLatestTimeLine() checks the
relationship is telling us that we need to do the same in the path as
well.

In my previous proposal, it is done just after the line the patch
touches but it can be in the if (fetching_ckpt) branch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
Sorry, some extra words are left alone.

At Mon, 07 Jun 2021 13:57:35 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> As I said upthread the relationship between receiveTLI and
> recoveryTargetTLI is not confirmed yet at the point.
- findNewestTimeLine() simply searches for the history file with the
- largest timeline id so the returned there's a case where the timeline
+ findNewestTimeLine() simply searches for the history file with the
+ largest timeline id so there's a case where the timeline
> id that the function returns is not a future of the latest checkpoint
> TLI.  I think that the fact that rescanLatestTimeLine() checks the
> relationship is telling us that we need to do the same in the path as
> well.
> 
> In my previous proposal, it is done just after the line the patch
> touches but it can be in the if (fetching_ckpt) branch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Duplicate history file?

From
Tatsuro Yamada
Date:
Hi Horiguchi-san,

>> Regarding "test ! -f",
>> I am wondering how many people are using the test command for
>> archive_command. If I remember correctly, the guide provided by
>> NTT OSS Center that we are using does not recommend using the test
>> command.
> 
> I think, as the PG-REX documentation says, the simple cp works well as
> far as the assumption of PG-REX - no double failure happenes, and
> following the instruction - holds.


I believe that this assumption started to be wrong after
archive_mode=always was introduced. As far as I can tell, it doesn't
happen when it's archive_mode=on.


> On the other hand, I found that the behavior happens more generally.
> 
> If a standby with archive_mode=always craches, it starts recovery from
> the last checkpoint. If the checkpoint were in a archived segment, the
> restarted standby will fetch the already-archived segment from archive
> then fails to archive it. (The attached).
> 
> So, your fear stated upthread is applicable for wider situations. The
> following suggestion is rather harmful for the archive_mode=always
> setting.
> 
> https://www.postgresql.org/docs/14/continuous-archiving.html
>> The archive command should generally be designed to refuse to
>> overwrite any pre-existing archive file. This is an important safety
>> feature to preserve the integrity of your archive in case of
>> administrator error (such as sending the output of two different
>> servers to the same archive directory).
> 
> I'm not sure how we should treat this..  Since archive must store
> files actually applied to the server data, just being already archived
> cannot be the reason for omitting archiving.  We need to make sure the
> new file is byte-identical to the already-archived version. We could
> compare just *restored* file to the same file in pg_wal but it might
> be too much of penalty for for the benefit. (Attached second file.)


Thanks for creating the patch!

  
> Otherwise the documentation would need someting like the following if
> we assume the current behavior.
> 
>> The archive command should generally be designed to refuse to
>> overwrite any pre-existing archive file. This is an important safety
>> feature to preserve the integrity of your archive in case of
>> administrator error (such as sending the output of two different
>> servers to the same archive directory).
> + For standby with the setting archive_mode=always, there's a case where
> + the same file is archived more than once.  For safety, it is
> + recommended that when the destination file exists, the archive_command
> + returns zero if it is byte-identical to the source file.


Agreed.
That is same solution as I mentioned earlier.
If possible, it also would better to write it postgresql.conf (that might
be overkill?!).


Regards,
Tatsuro Yamada





Re: Duplicate history file?

From
Tatsuro Yamada
Date:
Hi Horiguchi-san,


> (To recap: In a replication set using archive, startup tries to
> restore WAL files from archive before checking pg_wal directory for
> the desired file.  The behavior itself is intentionally designed and
> reasonable. However, the restore code notifies of a restored file
> regardless of whether it has been already archived or not.  If
> archive_command is written so as to return error for overwriting as we
> suggest in the documentation, that behavior causes archive failure.)
> 
> After playing with this, I see the problem just by restarting a
> standby even in a simple archive-replication set after making
> not-special prerequisites.  So I think this is worth fixing.
> 
> With this patch, KeepFileRestoredFromArchive compares the contents of
> just-restored file and the existing file for the same segment only
> when:
> 
>       - archive_mode = always
>   and - the file to restore already exists in pgwal
>   and - it has a .done and/or .ready status file.
> 
> which doesn't happen usually.  Then the function skips archive
> notification if the contents are identical.  The included TAP test is
> working both on Linux and Windows.


Thank you for the analysis and the patch.
I'll try the patch tomorrow.

I just noticed that this thread is still tied to another thread
(it's not an independent thread). To fix that, it may be better to
create a new thread again.


Regards,
Tatsuro Yamada





Re: Duplicate history file?

From
Kyotaro Horiguchi
Date:
At Mon, 07 Jun 2021 16:13:08 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in 
> I just noticed that this thread is still tied to another thread
> (it's not an independent thread). To fix that, it may be better to
> create a new thread again.

Mmm. Maybe my mailer automatically inserted In-Reply-To field for the
cited messsage.  Do we (the two of us) bother re-launching a new
thread?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Duplicate history file?

From
Kyotaro Horiguchi
Date:
At Mon, 07 Jun 2021 15:57:00 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in 
> Hi Horiguchi-san,
> 
> >> Regarding "test ! -f",
> >> I am wondering how many people are using the test command for
> >> archive_command. If I remember correctly, the guide provided by
> >> NTT OSS Center that we are using does not recommend using the test
> >> command.
> > I think, as the PG-REX documentation says, the simple cp works well as
> > far as the assumption of PG-REX - no double failure happenes, and
> > following the instruction - holds.
> 
> 
> I believe that this assumption started to be wrong after
> archive_mode=always was introduced. As far as I can tell, it doesn't
> happen when it's archive_mode=on.

?? Doesn't *simple* cp (without "test") work for you?  I meant that
the operating assumption of PG-REX ensures that overwriting doesn't
cause a problem.

> > Otherwise the documentation would need someting like the following if
> > we assume the current behavior.
> > 
> >> The archive command should generally be designed to refuse to
> >> overwrite any pre-existing archive file. This is an important safety
> >> feature to preserve the integrity of your archive in case of
> >> administrator error (such as sending the output of two different
> >> servers to the same archive directory).
> > + For standby with the setting archive_mode=always, there's a case
> > where
> > + the same file is archived more than once.  For safety, it is
> > + recommended that when the destination file exists, the
> > archive_command
> > + returns zero if it is byte-identical to the source file.
> 
> 
> Agreed.
> That is same solution as I mentioned earlier.
> If possible, it also would better to write it postgresql.conf (that
> might
> be overkill?!).

Mmmm, I didn't noticed that. I don't think such a complex caveat fits
the configuration file. And if we need such a caveart there, it might
be the sign that we need to fix the causal behavior...

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Duplicate history file?

From
Tatsuro Yamada
Date:
On 2021/06/07 16:31, Kyotaro Horiguchi wrote:
> At Mon, 07 Jun 2021 16:13:08 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in
>> I just noticed that this thread is still tied to another thread
>> (it's not an independent thread). To fix that, it may be better to
>> create a new thread again.
> 
> Mmm. Maybe my mailer automatically inserted In-Reply-To field for the
> cited messsage.  Do we (the two of us) bother re-launching a new
> thread?


The reason I suggested it was because I thought it might be
confusing if the threads were not independent when registered in
a commitfest. If that is not a problem, then I'm fine with it as is. :-D

Regards,
Tatsuro Yamada





Re: Duplicate history file?

From
Kyotaro Horiguchi
Date:
(Sorry for the noise on the old thread..)

At Mon, 07 Jun 2021 16:54:49 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in 
> On 2021/06/07 16:31, Kyotaro Horiguchi wrote:
> > At Mon, 07 Jun 2021 16:13:08 +0900, Tatsuro Yamada
> > <tatsuro.yamada.tf@nttcom.co.jp> wrote in
> >> I just noticed that this thread is still tied to another thread
> >> (it's not an independent thread). To fix that, it may be better to
> >> create a new thread again.
> > Mmm. Maybe my mailer automatically inserted In-Reply-To field for the
> > cited messsage.  Do we (the two of us) bother re-launching a new
> > thread?
> 
> 
> The reason I suggested it was because I thought it might be
> confusing if the threads were not independent when registered in
> a commitfest. If that is not a problem, then I'm fine with it as
> is. :-D

(You can freely do that, too:p)

Hmm. I found that the pgsql-hackers archive treats the new thread as a
part of the old thread, so CF-app would do the same.

Anyway I re-launched a new standalone thread.

https://www.postgresql.org/message-id/20210607.173108.348241508233844279.horikyota.ntt%40gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Robert Haas
Date:
On Mon, Jun 7, 2021 at 12:57 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
> Unfortunately no. The backslashes in the binary path need to be
> escaped. (taken from PostgresNode.pm:1008)
>
> > (my $perlbin = $^X) =~ s{\\}{\\\\}g if ($TestLib::windows_os);
> > $node_primary->append_conf(
> >       'postgresql.conf', qq(
> > archive_command = '$perlbin "$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"'
> > ));
>
> This works for me.

Hmm, OK. Do you think we also need to use perl2host in this case?

> Ugh! Sorry. I meant "The explicit teardowns are useless". That's not
> harmful but it is done by PostgresNode.pm automatically(implicitly)
> and we don't do that in the existing scripts.

OK. I don't think it's a big deal, but we can remove them.

> As I said upthread the relationship between receiveTLI and
> recoveryTargetTLI is not confirmed yet at the point.
> findNewestTimeLine() simply searches for the history file with the
> largest timeline id so the returned there's a case where the timeline
> id that the function returns is not a future of the latest checkpoint
> TLI.  I think that the fact that rescanLatestTimeLine() checks the
> relationship is telling us that we need to do the same in the path as
> well.
>
> In my previous proposal, it is done just after the line the patch
> touches but it can be in the if (fetching_ckpt) branch.

I went back and looked at your patch again, now that I understand the
issue better. I believe it's not necessary to do this here, because
StartupXLOG() already contains a check for the same thing:

    /*
     * If the location of the checkpoint record is not on the expected
     * timeline in the history of the requested timeline, we cannot proceed:
     * the backup is not part of the history of the requested timeline.
     */
    Assert(expectedTLEs);       /* was initialized by reading checkpoint
                                 * record */
    if (tliOfPointInHistory(checkPointLoc, expectedTLEs) !=
        checkPoint.ThisTimeLineID)
...

This code is always run after ReadCheckpointRecord() returns. And I
think that your only concern here is about the case where the
checkpoint record is being fetched, because otherwise expectedTLEs
must already be set.

By the way, I also noticed that your version of the patch contains a
few words which are spelled incorrectly: hearafter, and incosistent.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Duplicate history file?

From
Stephen Frost
Date:
Greetings,

* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
> So, this is the new new thread.

This is definitely not the way I would recommend starting up a new
thread as you didn't include the actual text of the prior discussion for
people to be able to read and respond to, instead making them go hunt
for the prior discussion on the old thread and negating the point of
starting a new thread..

Still, I went and found the other email-

* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
> At Mon, 31 May 2021 11:52:05 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in
> > Since the above behavior is different from the behavior of the
> > test command in the following example in postgresql.conf, I think
> > we should write a note about this example.
> >
> > # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p
> > # /mnt/server/archivedir/%f'
> >
> > Let me describe the problem we faced.
> > - When archive_mode=always, archive_command is (sometimes) executed
> >   in a situation where the history file already exists on the standby
> >   side.
> >
> > - In this case, if "test ! -f" is written in the archive_command of
> >   postgresql.conf on the standby side, the command will keep failing.
> >
> >   Note that this problem does not occur when archive_mode=on.
> >
> > So, what should we do for the user? I think we should put some notes
> > in postgresql.conf or in the documentation. For example, something
> > like this:

First off, we should tell them to not use test or cp in their actual
archive command because they don't do things like make sure that the WAL
that's been archived has actually been fsync'd.  Multiple people have
tried to make improvements in this area but the long and short of it is
that trying to provide a simple archive command in the documentation
that actually *works* isn't enough- you need a real tool.  Maybe someone
will write one some day that's part of core, but it's not happened yet
and instead there's external solutions which actually do the correct
things.

The existing documentation should be taken as purely "this is how the
variables which are passed in get expanded" not as "this is what you
should do", because it's very much not the latter in any form.

Thanks,

Stephen

Attachment

Re: Race condition in recovery?

From
Robert Haas
Date:
Hi,

I tried back-porting my version of this patch to 9.6 to see what would
happen there. One problem is that some of the functions have different
names before v10. So 9.6 needs this:

-    "SELECT pg_walfile_name(pg_current_wal_lsn());");
+    "SELECT pg_xlogfile_name(pg_current_xlog_location());");

But there's also another problem, which is that this doesn't work before v12:

$node_standby->psql('postgres', 'SELECT pg_promote()');

So I tried changing it to this:

$node_standby->promote;

But then the test fails, because pg_promote() has logic built into it
to wait until the promotion actually happens, but ->promote doesn't,
so SELECT pg_walfile_name(pg_current_wal_lsn()) errors out because the
system is still in recovery. I'm not sure what to do about that. I
quickly tried adding -w to 'sub promote' in PostgresNode.pm, but that
didn't fix it, so I guess we'll have to find some other way to wait
until the promotion is complete.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Mon, 7 Jun 2021 10:40:27 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Mon, Jun 7, 2021 at 12:57 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > Unfortunately no. The backslashes in the binary path need to be
> > escaped. (taken from PostgresNode.pm:1008)
> >
> > > (my $perlbin = $^X) =~ s{\\}{\\\\}g if ($TestLib::windows_os);
> > > $node_primary->append_conf(
> > >       'postgresql.conf', qq(
> > > archive_command = '$perlbin "$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"'
> > > ));
> >
> > This works for me.
> 
> Hmm, OK. Do you think we also need to use perl2host in this case?

I understand that perl2host converts '/some/where' style path to the
native windows path 'X:/any/where' if needed. Since perl's $^X is
already in native style so I think we don't need to use it.

> > Ugh! Sorry. I meant "The explicit teardowns are useless". That's not
> > harmful but it is done by PostgresNode.pm automatically(implicitly)
> > and we don't do that in the existing scripts.
> 
> OK. I don't think it's a big deal, but we can remove them.

Thanks.

> I went back and looked at your patch again, now that I understand the
> issue better. I believe it's not necessary to do this here, because
> StartupXLOG() already contains a check for the same thing:
> 
>     /*
>      * If the location of the checkpoint record is not on the expected
>      * timeline in the history of the requested timeline, we cannot proceed:
>      * the backup is not part of the history of the requested timeline.
>      */
>     Assert(expectedTLEs);       /* was initialized by reading checkpoint
>                                  * record */
>     if (tliOfPointInHistory(checkPointLoc, expectedTLEs) !=
>         checkPoint.ThisTimeLineID)
> ...
> 
> This code is always run after ReadCheckpointRecord() returns. And I
> think that your only concern here is about the case where the
> checkpoint record is being fetched, because otherwise expectedTLEs
> must already be set.

Sure. Thanks for confirming that, and agreed.

> By the way, I also noticed that your version of the patch contains a
> few words which are spelled incorrectly: hearafter, and incosistent.

Mmm. Sorry for them..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Duplicate history file?

From
Tatsuro Yamada
Date:
On 2021/06/07 17:32, Kyotaro Horiguchi wrote:
>>>> I just noticed that this thread is still tied to another thread
>>>> (it's not an independent thread). To fix that, it may be better to
>>>> create a new thread again.
>>> Mmm. Maybe my mailer automatically inserted In-Reply-To field for the
>>> cited messsage.  Do we (the two of us) bother re-launching a new
>>> thread?
>>
>>
>> The reason I suggested it was because I thought it might be
>> confusing if the threads were not independent when registered in
>> a commitfest. If that is not a problem, then I'm fine with it as
>> is. :-D
> 
> (You can freely do that, too:p)

I should have told you that I would be happy to create a new thread.

Why I didn't create new thread is that because I didn't want people to
think I had hijacked the thread. :)


> Hmm. I found that the pgsql-hackers archive treats the new thread as a
> part of the old thread, so CF-app would do the same.
> 
> Anyway I re-launched a new standalone thread.
> 
> https://www.postgresql.org/message-id/20210607.173108.348241508233844279.horikyota.ntt%40gmail.com

Thank you!


Regards,
Tatsuro Yamada





Re: Duplicate history file?

From
Kyotaro Horiguchi
Date:
(Mmm. thunderbird or gmail connects this thread to the previous one..)

At Mon, 7 Jun 2021 14:20:38 -0400, Stephen Frost <sfrost@snowman.net> wrote in 
> Greetings,
> 
> * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
> > So, this is the new new thread.
> 
> This is definitely not the way I would recommend starting up a new
> thread as you didn't include the actual text of the prior discussion for
> people to be able to read and respond to, instead making them go hunt
> for the prior discussion on the old thread and negating the point of
> starting a new thread..

Sorry for that. I'll do that next time.

> Still, I went and found the other email-

Thanks!

> * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
> > At Mon, 31 May 2021 11:52:05 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in 
> > > Since the above behavior is different from the behavior of the
> > > test command in the following example in postgresql.conf, I think
> > > we should write a note about this example.
> > > 
> > > # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p
> > > # /mnt/server/archivedir/%f'
> > >
> > > Let me describe the problem we faced.
> > > - When archive_mode=always, archive_command is (sometimes) executed
> > >   in a situation where the history file already exists on the standby
> > >   side.
> > > 
> > > - In this case, if "test ! -f" is written in the archive_command of
> > >   postgresql.conf on the standby side, the command will keep failing.
> > > 
> > >   Note that this problem does not occur when archive_mode=on.
> > > 
> > > So, what should we do for the user? I think we should put some notes
> > > in postgresql.conf or in the documentation. For example, something
> > > like this:
> 
> First off, we should tell them to not use test or cp in their actual
> archive command because they don't do things like make sure that the WAL
> that's been archived has actually been fsync'd.  Multiple people have
> tried to make improvements in this area but the long and short of it is
> that trying to provide a simple archive command in the documentation
> that actually *works* isn't enough- you need a real tool.  Maybe someone
> will write one some day that's part of core, but it's not happened yet
> and instead there's external solutions which actually do the correct
> things.

Ideally I agree that it is definitely right. But the documentation
doesn't say a bit of "don't use the simple copy command in any case
(or at least the cases where more than a certain level of durability
and integrity guarantee is required).".

Actually many people are satisfied with just "cp/copy" and I think
they know that the command doesn't guarantee on the integrity of
archived files on , say, some disastrous situation like a sudden power
cut.

However, the use of "test !  -f..." is in a bit different kind of
suggestion.

https://www.postgresql.org/docs/13/continuous-archiving.html
| The archive command should generally be designed to refuse to
| overwrite any pre-existing archive file. This is an important safety
| feature to preserve the integrity of your archive in case of
| administrator error (such as sending the output of two different
| servers to the same archive directory)

This implies that no WAL segment are archived more than once at least
under any valid operation. Some people are following this suggestion
to prevent archive from breaking by some *wrong* operations.

> The existing documentation should be taken as purely "this is how the
> variables which are passed in get expanded" not as "this is what you
> should do", because it's very much not the latter in any form.

It describes "how archive_command should be like" and showing examples
among the description implies that the example conforms the
should-be's.

Nevertheless, the issue here is that there's a case where archiving
stalls when following the suggestion above under a certain condition.
Even if it is written premising "set .. archive_mode to on", I don't
believe that people can surmise that the same archive_command might
fail when setting archive_mode to always, because the description
implies


So I think we need to revise the documentation, or need to *fix* the
revealed problem that is breaking the assumption of the documentation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Duplicate history file?

From
Kyotaro Horiguchi
Date:
Yeah, it's hot these days...

At Tue, 08 Jun 2021 12:04:43 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> (Mmm. thunderbird or gmail connects this thread to the previous one..)
> 
> At Mon, 7 Jun 2021 14:20:38 -0400, Stephen Frost <sfrost@snowman.net> wrote in 
> > Greetings,
> > 
> > * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
> > > So, this is the new new thread.
> > 
> > This is definitely not the way I would recommend starting up a new
> > thread as you didn't include the actual text of the prior discussion for
> > people to be able to read and respond to, instead making them go hunt
> > for the prior discussion on the old thread and negating the point of
> > starting a new thread..
> 
> Sorry for that. I'll do that next time.
> 
> > Still, I went and found the other email-
> 
> Thanks!
> 
> > * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
> > > At Mon, 31 May 2021 11:52:05 +0900, Tatsuro Yamada <tatsuro.yamada.tf@nttcom.co.jp> wrote in 
> > > > Since the above behavior is different from the behavior of the
> > > > test command in the following example in postgresql.conf, I think
> > > > we should write a note about this example.
> > > > 
> > > > # e.g. 'test ! -f /mnt/server/archivedir/%f && cp %p
> > > > # /mnt/server/archivedir/%f'
> > > >
> > > > Let me describe the problem we faced.
> > > > - When archive_mode=always, archive_command is (sometimes) executed
> > > >   in a situation where the history file already exists on the standby
> > > >   side.
> > > > 
> > > > - In this case, if "test ! -f" is written in the archive_command of
> > > >   postgresql.conf on the standby side, the command will keep failing.
> > > > 
> > > >   Note that this problem does not occur when archive_mode=on.
> > > > 
> > > > So, what should we do for the user? I think we should put some notes
> > > > in postgresql.conf or in the documentation. For example, something
> > > > like this:
> > 
> > First off, we should tell them to not use test or cp in their actual
> > archive command because they don't do things like make sure that the WAL
> > that's been archived has actually been fsync'd.  Multiple people have
> > tried to make improvements in this area but the long and short of it is
> > that trying to provide a simple archive command in the documentation
> > that actually *works* isn't enough- you need a real tool.  Maybe someone
> > will write one some day that's part of core, but it's not happened yet
> > and instead there's external solutions which actually do the correct
> > things.
> 
> Ideally I agree that it is definitely right. But the documentation
> doesn't say a bit of "don't use the simple copy command in any case
> (or at least the cases where more than a certain level of durability
> and integrity guarantee is required).".
> 
> Actually many people are satisfied with just "cp/copy" and I think
> they know that the command doesn't guarantee on the integrity of
> archived files on , say, some disastrous situation like a sudden power
> cut.
> 
> However, the use of "test !  -f..." is in a bit different kind of
> suggestion.
> 
> https://www.postgresql.org/docs/13/continuous-archiving.html
> | The archive command should generally be designed to refuse to
> | overwrite any pre-existing archive file. This is an important safety
> | feature to preserve the integrity of your archive in case of
> | administrator error (such as sending the output of two different
> | servers to the same archive directory)
> 
> This implies that no WAL segment are archived more than once at least
> under any valid operation. Some people are following this suggestion
> to prevent archive from breaking by some *wrong* operations.
> 
> > The existing documentation should be taken as purely "this is how the
> > variables which are passed in get expanded" not as "this is what you
> > should do", because it's very much not the latter in any form.
> 

- It describes "how archive_command should be like" and showing examples
+ It describes "what archive_command should be like" and showing examples

> among the description implies that the example conforms the
> should-be's.
> 
> Nevertheless, the issue here is that there's a case where archiving
> stalls when following the suggestion above under a certain condition.
> Even if it is written premising "set .. archive_mode to on", I don't
> believe that people can surmise that the same archive_command might
- fail when setting archive_mode to always, because the description
- implies
+ fail when setting archive_mode to always.

> 
> So I think we need to revise the documentation, or need to *fix* the
> revealed problem that is breaking the assumption of the documentation.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Tue, Jun 8, 2021 at 12:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> I tried back-porting my version of this patch to 9.6 to see what would
> happen there. One problem is that some of the functions have different
> names before v10. So 9.6 needs this:
>
> -    "SELECT pg_walfile_name(pg_current_wal_lsn());");
> +    "SELECT pg_xlogfile_name(pg_current_xlog_location());");
>
> But there's also another problem, which is that this doesn't work before v12:
>
> $node_standby->psql('postgres', 'SELECT pg_promote()');
>
> So I tried changing it to this:
>
> $node_standby->promote;
>
> But then the test fails, because pg_promote() has logic built into it
> to wait until the promotion actually happens, but ->promote doesn't,
> so SELECT pg_walfile_name(pg_current_wal_lsn()) errors out because the
> system is still in recovery. I'm not sure what to do about that. I
> quickly tried adding -w to 'sub promote' in PostgresNode.pm, but that
> didn't fix it, so I guess we'll have to find some other way to wait
> until the promotion is complete.
>

Maybe we can use it ?

# Wait until the node exits recovery.
$standby->poll_query_until('postgres', "SELECT pg_is_in_recovery() = 'f';")
or die "Timed out while waiting for promotion";

I will try to generate a version for 9.6 based on this idea and see how it goes

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Tue, Jun 8, 2021 at 11:13 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> # Wait until the node exits recovery.
> $standby->poll_query_until('postgres', "SELECT pg_is_in_recovery() = 'f';")
> or die "Timed out while waiting for promotion";
>
> I will try to generate a version for 9.6 based on this idea and see how it goes

I have changed for as per 9.6 but I am seeing some crash (both
with/without fix), I could not figure out the reason, it did not
generate any core dump, although I changed pg_ctl in PostgresNode.pm
to use "-c" so that it can generate core but it did not generate any
core file.

This is log from cascading node (025_stuck_on_old_timeline_cascade.log)
-------------
cp: cannot stat
‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/data_primary_52dW/archives/000000010000000000000003’:
No such file or directory
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
FATAL:  could not receive database system identifier and timeline ID
from the primary server: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
--------------

The attached logs are when I ran without a fix.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Race condition in recovery?

From
Robert Haas
Date:
On Tue, Jun 8, 2021 at 4:47 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have changed for as per 9.6 but I am seeing some crash (both
> with/without fix), I could not figure out the reason, it did not
> generate any core dump, although I changed pg_ctl in PostgresNode.pm
> to use "-c" so that it can generate core but it did not generate any
> core file.

I think the problem is here:

Can't locate object method "lsn" via package "PostgresNode" at
t/025_stuck_on_old_timeline.pl line 84.

When that happens, it bails out, and cleans everything up, doing an
immediate shutdown of all the nodes. The 'lsn' method was added by
commit fb093e4cb36fe40a1c3f87618fb8362845dae0f0, so it only appears in
v10 and later. I think maybe we can think of back-porting that to 9.6.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Robert Haas
Date:
On Tue, Jun 8, 2021 at 12:26 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I think the problem is here:
>
> Can't locate object method "lsn" via package "PostgresNode" at
> t/025_stuck_on_old_timeline.pl line 84.
>
> When that happens, it bails out, and cleans everything up, doing an
> immediate shutdown of all the nodes. The 'lsn' method was added by
> commit fb093e4cb36fe40a1c3f87618fb8362845dae0f0, so it only appears in
> v10 and later. I think maybe we can think of back-porting that to 9.6.

Here's an updated set of patches. I removed the extra teardown_node
calls per Kyotaro Horiguchi's request. I adopted his suggestion for
setting a $perlbin variable from $^X, but found that $perlbin was
undefined, so I split the incantation into two lines to fix that. I
updated the code to use ->promote() instead of calling pg_promote(),
and to use poll_query_until() afterwards to wait for promotion as
suggested by Dilip. Also, I added a comment to the change in xlog.c.

Then I tried to get things working on 9.6. There's a patch attached to
back-port a couple of PostgresNode.pm methods from 10 to 9.6, and also
a version of the main patch attached with the necessary wal->xlog,
lsn->location renaming. Unfortunately ... the new test case still
fails on 9.6 in a way that looks an awful lot like the bug isn't
actually fixed:

LOG:  primary server contains no more WAL on requested timeline 1
cp: /Users/rhaas/pgsql/src/test/recovery/tmp_check/data_primary_enMi/archives/000000010000000000000003:
No such file or directory
(repeated many times)

I find that the same failure happens if I back-port the master version
of the patch to v10 or v11, but if I back-port it to v12 or v13 then
the test passes as expected. I haven't figured out what the issue is
yet. I also noticed that if I back-port it to v12 and then revert the
code change, the test still passes. So I think there may be something
subtly wrong with this test case yet. Or maybe a code bug.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Wed, Jun 9, 2021 at 2:07 AM Robert Haas <robertmhaas@gmail.com> wrote:
Then I tried to get things working on 9.6. There's a patch attached to
back-port a couple of PostgresNode.pm methods from 10 to 9.6, and also
a version of the main patch attached with the necessary wal->xlog,
lsn->location renaming. Unfortunately ... the new test case still
fails on 9.6 in a way that looks an awful lot like the bug isn't
actually fixed:

LOG:  primary server contains no more WAL on requested timeline 1
cp: /Users/rhaas/pgsql/src/test/recovery/tmp_check/data_primary_enMi/archives/000000010000000000000003:
No such file or directory
(repeated many times)

I find that the same failure happens if I back-port the master version
of the patch to v10 or v11,

I think this fails because prior to v12 the recovery target tli was not set to the latest by default because it was not GUC at that time.  So after below fix it started passing on v11(only tested on v11 so far).


diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl b/src/test/recovery/t/025_stuck_on_old_timeline.pl
index 842878a..b3ce5da 100644
--- a/src/test/recovery/t/025_stuck_on_old_timeline.pl
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -50,6 +50,9 @@ my $node_cascade = get_new_node('cascade');
 $node_cascade->init_from_backup($node_standby, $backup_name,
        has_streaming => 1);
 $node_cascade->enable_restoring($node_primary);
+$node_cascade->append_conf('recovery.conf', qq(
+recovery_target_timeline='latest'
+));
 
But now it started passing even without the fix and the log says that it never tried to stream from primary using TL 1 so it never hit the defect location.

2021-06-09 12:11:08.618 IST [122456] LOG:  entering standby mode
2021-06-09 12:11:08.622 IST [122456] LOG:  restored log file "00000002.history" from archive
cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/000000010000000000000002’: No such file or directory
2021-06-09 12:11:08.627 IST [122456] LOG:  redo starts at 0/2000028
2021-06-09 12:11:08.627 IST [122456] LOG:  consistent recovery state reached at 0/3000000

Next, I will investigate, without a fix on v11 (maybe v12, v10..) why it is not hitting the defect location at all.  And after that, I will check the status on other older versions. 

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Wed, Jun 9, 2021 at 12:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jun 9, 2021 at 2:07 AM Robert Haas <robertmhaas@gmail.com> wrote:
> 2021-06-09 12:11:08.618 IST [122456] LOG:  entering standby mode
> 2021-06-09 12:11:08.622 IST [122456] LOG:  restored log file "00000002.history" from archive
> cp: cannot stat
‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/000000010000000000000002’:
Nosuch file or directory 
> 2021-06-09 12:11:08.627 IST [122456] LOG:  redo starts at 0/2000028
> 2021-06-09 12:11:08.627 IST [122456] LOG:  consistent recovery state reached at 0/3000000
>
> Next, I will investigate, without a fix on v11 (maybe v12, v10..) why it is not hitting the defect location at all.
Andafter that, I will check the status on other older versions. 

Reason for the problem was that the "-Xnone" parameter was not
accepted by "sub backup" in PostgresNode.pm so I created that for
backpatch.  With attached patches I am to make it pass in v12,v11,v10
(with fix) and fail (without fix).  However, we will have to make some
change for 9.6 because pg_basebackup doesn't support -Xnone on 9.6,
maybe we can delete the content from pg_wal after the backup, if we
think that approach looks fine then I will make the changes for 9.6 as
well.

Note: for param backport for v12 and v11 same patch getting applied
but for v10 due to some conflict we need a separate patch (both
attached).

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Wed, Jun 9, 2021 at 1:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Jun 9, 2021 at 12:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Jun 9, 2021 at 2:07 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > 2021-06-09 12:11:08.618 IST [122456] LOG:  entering standby mode
> > 2021-06-09 12:11:08.622 IST [122456] LOG:  restored log file "00000002.history" from archive
> > cp: cannot stat
‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/000000010000000000000002’:
Nosuch file or directory 
> > 2021-06-09 12:11:08.627 IST [122456] LOG:  redo starts at 0/2000028
> > 2021-06-09 12:11:08.627 IST [122456] LOG:  consistent recovery state reached at 0/3000000
> >
> > Next, I will investigate, without a fix on v11 (maybe v12, v10..) why it is not hitting the defect location at all.
And after that, I will check the status on other older versions. 
>
> Reason for the problem was that the "-Xnone" parameter was not
> accepted by "sub backup" in PostgresNode.pm so I created that for
> backpatch.  With attached patches I am to make it pass in v12,v11,v10
> (with fix) and fail (without fix).  However, we will have to make some
> change for 9.6 because pg_basebackup doesn't support -Xnone on 9.6,
> maybe we can delete the content from pg_wal after the backup, if we
> think that approach looks fine then I will make the changes for 9.6 as
> well.
>
> Note: for param backport for v12 and v11 same patch getting applied
> but for v10 due to some conflict we need a separate patch (both
> attached).

I have fixed it for 9.6 as well by removing the wal from the xlog
directory. Attaching all the patches in single mail to avoid
confusion.

Note:
v7-0001 applies to master, v13,v12 (but for v12 before this we need to
apply backport)
v12-v8-0001-Backport is same as v11-v8-0001-Backport (duplicated for
version wise separation)
v11-v8-0002 is same as v10-v8-0002

Basically, for v12 and v11 same backport patch works and for V11 and
V10 same main patch works, still I duplicated them to avoid confusion.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Race condition in recovery?

From
Robert Haas
Date:
On Wed, Jun 9, 2021 at 4:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Reason for the problem was that the "-Xnone" parameter was not
> accepted by "sub backup" in PostgresNode.pm so I created that for
> backpatch.  With attached patches I am to make it pass in v12,v11,v10
> (with fix) and fail (without fix).  However, we will have to make some
> change for 9.6 because pg_basebackup doesn't support -Xnone on 9.6,
> maybe we can delete the content from pg_wal after the backup, if we
> think that approach looks fine then I will make the changes for 9.6 as
> well.

Ah. I looked into this and found that this is because commit
081876d75ea15c3bd2ee5ba64a794fd8ea46d794 is new in master, so actually
that change is absent in all the back-branches. I have now back-ported
that portion of that commit to v13, v12, v11, and v10.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Robert Haas
Date:
On Wed, Jun 9, 2021 at 4:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Reason for the problem was that the "-Xnone" parameter was not
> accepted by "sub backup" in PostgresNode.pm so I created that for
> backpatch.  With attached patches I am to make it pass in v12,v11,v10
> (with fix) and fail (without fix).  However, we will have to make some
> change for 9.6 because pg_basebackup doesn't support -Xnone on 9.6,
> maybe we can delete the content from pg_wal after the backup, if we
> think that approach looks fine then I will make the changes for 9.6 as
> well.

Got it. I have now committed the patch to all branches, after adapting
your changes just a little bit.

Thanks to you and Kyotaro-san for all the time spent on this. What a slog!

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Got it. I have now committed the patch to all branches, after adapting
> your changes just a little bit.
> Thanks to you and Kyotaro-san for all the time spent on this. What a slog!

conchuela failed its first encounter with this test case:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25

That machine has a certain, er, history of flakiness; so this may
not mean anything.  Still, we'd better keep an eye out to see if
the test needs more stabilization.

            regards, tom lane



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Wed, 09 Jun 2021 19:09:54 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Robert Haas <robertmhaas@gmail.com> writes:
> > Got it. I have now committed the patch to all branches, after adapting
> > your changes just a little bit.
> > Thanks to you and Kyotaro-san for all the time spent on this. What a slog!
> 
> conchuela failed its first encounter with this test case:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25
> 
> That machine has a certain, er, history of flakiness; so this may
> not mean anything.  Still, we'd better keep an eye out to see if
> the test needs more stabilization.


https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25&stg=recovery-check

> ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_cascade.log
==~_~===-=-===~_~==
....
> 2021-06-09 23:31:10.439 CEST [893820:1] LOG:  started streaming WAL from primary at 0/2000000 on timeline 1
> 2021-06-09 23:31:10.439 CEST [893820:2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment
000000010000000000000002has already been removed
 

The script 025_stuck_on_olde_timeline.pl (and I) forgets to set
wal_keep_size(segments).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl b/src/test/recovery/t/025_stuck_on_old_timeline.pl
index 0d96bb3c15..25c2dff437 100644
--- a/src/test/recovery/t/025_stuck_on_old_timeline.pl
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -27,6 +27,7 @@ $perlbin =~ s{\\}{\\\\}g if ($TestLib::windows_os);
 my $archivedir_primary = $node_primary->archive_dir;
 $node_primary->append_conf('postgresql.conf', qq(
 archive_command = '$perlbin "$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"'
+wal_keep_size=128MB
 ));
 $node_primary->start;

diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl b/src/test/recovery/t/025_stuck_on_old_timeline.pl
index 0d96bb3c15..8099571299 100644
--- a/src/test/recovery/t/025_stuck_on_old_timeline.pl
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -27,6 +27,7 @@ $perlbin =~ s{\\}{\\\\}g if ($TestLib::windows_os);
 my $archivedir_primary = $node_primary->archive_dir;
 $node_primary->append_conf('postgresql.conf', qq(
 archive_command = '$perlbin "$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"'
+wal_keep_segments=8
 ));
 $node_primary->start;


Re: Race condition in recovery?

From
Robert Haas
Date:
On Wed, Jun 9, 2021 at 9:12 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25&stg=recovery-check
>
> > ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_cascade.log
==~_~===-=-===~_~==
> ....
> > 2021-06-09 23:31:10.439 CEST [893820:1] LOG:  started streaming WAL from primary at 0/2000000 on timeline 1
> > 2021-06-09 23:31:10.439 CEST [893820:2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
segment000000010000000000000002 has already been removed
 
>
> The script 025_stuck_on_olde_timeline.pl (and I) forgets to set
> wal_keep_size(segments).

Thanks for the analysis and the patches. I have committed them.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Thu, 10 Jun 2021 09:56:51 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Wed, Jun 9, 2021 at 9:12 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25&stg=recovery-check
> >
> > > ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_cascade.log
==~_~===-=-===~_~==
> > ....
> > > 2021-06-09 23:31:10.439 CEST [893820:1] LOG:  started streaming WAL from primary at 0/2000000 on timeline 1
> > > 2021-06-09 23:31:10.439 CEST [893820:2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
segment000000010000000000000002 has already been removed
 
> >
> > The script 025_stuck_on_olde_timeline.pl (and I) forgets to set
> > wal_keep_size(segments).
> 
> Thanks for the analysis and the patches. I have committed them.

Thanks for committing it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Tom Lane
Date:
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> At Thu, 10 Jun 2021 09:56:51 -0400, Robert Haas <robertmhaas@gmail.com> wrote in
>> Thanks for the analysis and the patches. I have committed them.

> Thanks for committing it.

Please note that conchuela and jacana are still failing ...

conchuela's failure is evidently not every time, but this test
definitely postdates the "fix":

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-10%2014%3A09%3A08

            regards, tom lane



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Thu, 10 Jun 2021 21:53:18 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> > At Thu, 10 Jun 2021 09:56:51 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> >> Thanks for the analysis and the patches. I have committed them.
> 
> > Thanks for committing it.
> 
> Please note that conchuela and jacana are still failing ...
> 
> conchuela's failure is evidently not every time, but this test
> definitely postdates the "fix":
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-10%2014%3A09%3A08

A different test is failing there.  Maybe from different issue.


> ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/regress_log_002_archiving ==~_~===-=-===~_~==
...
> # Postmaster PID for node "standby2" is 342349
> ### Promoting node "standby2"
> # Running: pg_ctl -D
/home/pgbf/buildroot/HEAD/pgsql.build/src/test/recovery/tmp_check/t_002_archiving_standby2_data/pgdata-l
/home/pgbf/buildroot/HEAD/pgsql.build/src/test/recovery/tmp_check/log/002_archiving_standby2.logpromote
 
> waiting for server to
promote................................................................................................................
stoppedwaiting
 
> pg_ctl: server did not promote in time
> Bail out!  system pg_ctl failed

> ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/002_archiving_standby2.log ==~_~===-=-===~_~==
...
> 2021-06-10 16:21:21.870 CEST [342350:9] LOG:  received promote request
> 2021-06-10 16:21:21.870 CEST [342350:10] LOG:  redo done at 0/3030200 system usage: CPU: user: 0.00 s, system: 0.00
s,elapsed: 0.07 s
 
> 2021-06-10 16:21:21.870 CEST [342350:11] LOG:  last completed transaction was at log time 2021-06-10
16:21:21.010599+02
> 2021-06-10 16:21:21.893 CEST [342350:12] LOG:  restored log file "000000010000000000000003" from archive
> cp:
/home/pgbf/buildroot/HEAD/pgsql.build/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/00000003.history:
Nosuch file or directory
 
> 2021-06-10 16:21:21.896 CEST [342350:13] LOG:  selected new timeline ID: 3
(log ends here)


> ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/002_archiving_primary.log ==~_~===-=-===~_~==
...
> 2021-06-10 16:21:21.107 CEST [342322:4] 002_archiving.pl LOG:  disconnection: session time: 0:00:00.022 user=pgbf
database=postgreshost=[local]
 
> 2021-06-10 16:23:21.965 CEST [342279:4] LOG:  received immediate shutdown request

So the standby2 was stuck after selecting the new timeline and before
updating control file and its postmaster couldn't even respond to
SIGQUIT.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Fri, 11 Jun 2021 14:07:45 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Thu, 10 Jun 2021 21:53:18 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> > conchuela's failure is evidently not every time, but this test
> > definitely postdates the "fix":

conchuela failed recovery_check this time, and

> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-10%2014%3A09%3A08
> So the standby2 was stuck after selecting the new timeline and before
> updating control file and its postmaster couldn't even respond to
> SIGQUIT.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25

  This is before the "fix"

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-08%2014%3A07%3A46

  failed in pg_verifybackupCheck

> ==~_~===-=-===~_~== pgsql.build/src/bin/pg_verifybackup/tmp_check/log/regress_log_003_corruption ==~_~===-=-===~_~==
...
> #   Failed test 'base backup ok'
> #   at t/003_corruption.pl line 115.
> # Running: pg_verifybackup
/home/pgbf/buildroot/HEAD/pgsql.build/src/bin/pg_verifybackup/tmp_check/t_003_corruption_primary_data/backup/open_directory_fails
> pg_verifybackup: fatal: could not open file
"/home/pgbf/buildroot/HEAD/pgsql.build/src/bin/pg_verifybackup/tmp_check/t_003_corruption_primary_data/backup/open_directory_fails/backup_manifest":
Nosuch file or directory
 
> not ok 38 - intact backup verified

The manifest file is missing in backup. In this case also the servers
failed to handle SIGQUIT.

> ==~_~===-=-===~_~== pgsql.build/src/bin/pg_verifybackup/tmp_check/log/003_corruption_primary.log ==~_~===-=-===~_~==
...
> 2021-06-08 16:17:41.706 CEST [51792:9] 003_corruption.pl LOG:  received replication command: START_REPLICATION SLOT
"pg_basebackup_51792"0/B000000 TIMELINE 1
 
> 2021-06-08 16:17:41.706 CEST [51792:10] 003_corruption.pl STATEMENT:  START_REPLICATION SLOT "pg_basebackup_51792"
0/B000000TIMELINE 1
 
(log ends here)

There seems like some hardware failure?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Thu, 10 Jun 2021 21:53:18 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
tgl> Please note that conchuela and jacana are still failing ...

I forgot jacana's case..

It is failing for the issue the first patch should have fixed.

> ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_primary.log
==~_~===-=-===~_~==
...
> The system cannot find the path specified.
> 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:1] LOG:  archive command failed with exit code 1
> 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:2] DETAIL:  The failed archive command was: /usr/bin/perl
"/home/pgrunner/bf/root/HEAD/pgsql/src/test/recovery/t/cp_history_files""pg_wal\\000000010000000000000001"
"/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/000000010000000000000001"

the cp_history_files exits with just "exit" for the files with that
name, which should set status to 0. ActivePerl did so.

If I specified nonexistent command like /hoge/perl, %ERRORLEVEL% is
set to 3, not 1.

I don't find what is happening there so far.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Dilip Kumar
Date:
On Fri, Jun 11, 2021 at 11:45 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Thu, 10 Jun 2021 21:53:18 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in
> tgl> Please note that conchuela and jacana are still failing ...
>
> I forgot jacana's case..
>
> It is failing for the issue the first patch should have fixed.
>
> > ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_primary.log
==~_~===-=-===~_~==
> ...
> > The system cannot find the path specified.
> > 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:1] LOG:  archive command failed with exit code 1
> > 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:2] DETAIL:  The failed archive command was: /usr/bin/perl
"/home/pgrunner/bf/root/HEAD/pgsql/src/test/recovery/t/cp_history_files""pg_wal\\000000010000000000000001"
"/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/000000010000000000000001"

Wal file copying will not create a problem for us, but I noticed that
it is failing in copying the history files as well and that is
creating a problem.

2021-06-10 22:56:28.940 EDT [60c2d0db.1208:1] LOG:  archive command
failed with exit code 1
2021-06-10 22:56:28.940 EDT [60c2d0db.1208:2] DETAIL:  The failed
archive command was: /usr/bin/perl
"/home/pgrunner/bf/root/HEAD/pgsql/src/test/recovery/t/cp_history_files"
"pg_wal\\00000002.history"

"/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/00000002.history"

I have noticed that the archive command is failing in some other test
case too (002_archiving_standby2.log), see below logs.

==~_~===-=-===~_~==
pgsql.build/src/test/recovery/tmp_check/log/002_archiving_standby2.log
==~_~===-=-===~_~==
...

        0 file(s) copied.
2021-06-10 22:44:34.467 EDT [60c2ce10.1270:1] LOG:  archive command
failed with exit code 1
2021-06-10 22:44:34.467 EDT [60c2ce10.1270:2] DETAIL:  The failed
archive command was: copy "pg_wal\\00000003.history"

"c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives\\00000003.history"
The system cannot find the path specified.
        0 file(s) copied.
2021-06-10 22:44:35.478 EDT [60c2ce10.1270:3] LOG:  archive command
failed with exit code 1
2021-06-10 22:44:35.478 EDT [60c2ce10.1270:4] DETAIL:  The failed
archive command was: copy "pg_wal\\00000003.history"

"c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives\\00000003.history"
2021-06-10 22:44:36.113 EDT [60c2ce0c.283c:5] LOG:  received immediate
shutdown request
2021-06-10 22:44:36.129 EDT [60c2ce0c.283c:6] LOG:  database system is shut down

I am not able to figure out why the archive command is failing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Tom Lane
Date:
Dilip Kumar <dilipbalaut@gmail.com> writes:
> On Fri, Jun 11, 2021 at 11:45 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
>>> ==~_~===-=-===~_~== pgsql.build/src/test/recovery/tmp_check/log/025_stuck_on_old_timeline_primary.log
==~_~===-=-===~_~==
>>> ...
>>> The system cannot find the path specified.
>>> 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:1] LOG:  archive command failed with exit code 1
>>> 2021-06-10 22:56:17.754 EDT [60c2d0cf.54c:2] DETAIL:  The failed archive command was: /usr/bin/perl
"/home/pgrunner/bf/root/HEAD/pgsql/src/test/recovery/t/cp_history_files""pg_wal\\000000010000000000000001"
"/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/000000010000000000000001"

> Wal file copying will not create a problem for us, but I noticed that
> it is failing in copying the history files as well and that is
> creating a problem.

I think jacana uses msys[2?], so this likely indicates a problem
in path sanitization for the archive command.  Andrew, any advice?

            regards, tom lane



Re: Race condition in recovery?

From
Tom Lane
Date:
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
>> ==~_~===-=-===~_~== pgsql.build/src/bin/pg_verifybackup/tmp_check/log/003_corruption_primary.log ==~_~===-=-===~_~==
>> ...
>> 2021-06-08 16:17:41.706 CEST [51792:9] 003_corruption.pl LOG:  received replication command: START_REPLICATION SLOT
"pg_basebackup_51792"0/B000000 TIMELINE 1 
>> 2021-06-08 16:17:41.706 CEST [51792:10] 003_corruption.pl STATEMENT:  START_REPLICATION SLOT "pg_basebackup_51792"
0/B000000TIMELINE 1 
>> (log ends here)

> There seems like some hardware failure?

conchuela has definitely evinced flakiness before.  Not sure what's
up with it, but I have no problem with writing off non-repeatable
failures from that machine.  In any case, it's now passed half a
dozen times in a row on HEAD, so I think we can say that it's okay
with this test.  That leaves jacana, which I'm betting has a
Windows portability issue with the new test.

            regards, tom lane



Re: Race condition in recovery?

From
Michael Paquier
Date:
On Fri, Jun 11, 2021 at 10:46:45AM -0400, Tom Lane wrote:
> I think jacana uses msys[2?], so this likely indicates a problem
> in path sanitization for the archive command.  Andrew, any advice?

Err, something around TestLib::perl2host()?
--
Michael

Attachment

Re: Race condition in recovery?

From
Andrew Dunstan
Date:
On 6/12/21 3:48 AM, Michael Paquier wrote:
> On Fri, Jun 11, 2021 at 10:46:45AM -0400, Tom Lane wrote:
>> I think jacana uses msys[2?], so this likely indicates a problem
>> in path sanitization for the archive command.  Andrew, any advice?
> Err, something around TestLib::perl2host()?


I'm working on a fix for this. Yes it includes perl2host, but that's not
enough :-)


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Race condition in recovery?

From
Andrew Dunstan
Date:
On 6/12/21 7:31 AM, Andrew Dunstan wrote:
> On 6/12/21 3:48 AM, Michael Paquier wrote:
>> On Fri, Jun 11, 2021 at 10:46:45AM -0400, Tom Lane wrote:
>>> I think jacana uses msys[2?], so this likely indicates a problem
>>> in path sanitization for the archive command.  Andrew, any advice?
>> Err, something around TestLib::perl2host()?
>
> I'm working on a fix for this. Yes it includes perl2host, but that's not
> enough :-)
>
>

I have pushed a fix, tested on a replica of fairywren/drongo,


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Race condition in recovery?

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> I have pushed a fix, tested on a replica of fairywren/drongo,

This bit seems a bit random:

 # WAL segment, this is enough to guarantee that the history file was
 # archived.
 my $archive_wait_query =
-  "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM pg_stat_archiver;";
+  "SELECT coalesce('$walfile_to_be_archived' <= last_archived_wal, false) " .
+  "FROM pg_stat_archiver";
 $node_standby->poll_query_until('postgres', $archive_wait_query)
   or die "Timed out while waiting for WAL segment to be archived";
 my $last_archived_wal_file = $walfile_to_be_archived;

I wonder whether that is a workaround for the poll_query_until bug
I proposed to fix at [1].

            regards, tom lane

[1] https://www.postgresql.org/message-id/2130215.1623450521%40sss.pgh.pa.us



Re: Race condition in recovery?

From
Andrew Dunstan
Date:
On 6/12/21 10:20 AM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> I have pushed a fix, tested on a replica of fairywren/drongo,
> This bit seems a bit random:
>
>  # WAL segment, this is enough to guarantee that the history file was
>  # archived.
>  my $archive_wait_query =
> -  "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM pg_stat_archiver;";
> +  "SELECT coalesce('$walfile_to_be_archived' <= last_archived_wal, false) " .
> +  "FROM pg_stat_archiver";
>  $node_standby->poll_query_until('postgres', $archive_wait_query)
>    or die "Timed out while waiting for WAL segment to be archived";
>  my $last_archived_wal_file = $walfile_to_be_archived;
>
> I wonder whether that is a workaround for the poll_query_until bug
> I proposed to fix at [1].
>
>             regards, tom lane
>
> [1] https://www.postgresql.org/message-id/2130215.1623450521%40sss.pgh.pa.us



No, it's because I found it annoying and confusing that there was an
invisible result when last_archived_wal is null.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Race condition in recovery?

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 6/12/21 10:20 AM, Tom Lane wrote:
>> I wonder whether that is a workaround for the poll_query_until bug
>> I proposed to fix at [1].

> No, it's because I found it annoying and confusing that there was an
> invisible result when last_archived_wal is null.

OK.  But it makes me itch a bit that this one wait-for-wal-to-be-
processed query looks different from all the other ones.

            regards, tom lane



Re: Race condition in recovery?

From
Andrew Dunstan
Date:
On 6/12/21 1:07 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 6/12/21 10:20 AM, Tom Lane wrote:
>>> I wonder whether that is a workaround for the poll_query_until bug
>>> I proposed to fix at [1].
>> No, it's because I found it annoying and confusing that there was an
>> invisible result when last_archived_wal is null.
> OK.  But it makes me itch a bit that this one wait-for-wal-to-be-
> processed query looks different from all the other ones.
>
>             


I'm happy to bring the other two queries that look like this into line
with this one if you like.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Race condition in recovery?

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 6/12/21 1:07 PM, Tom Lane wrote:
>> OK.  But it makes me itch a bit that this one wait-for-wal-to-be-
>> processed query looks different from all the other ones.

> I'm happy to bring the other two queries that look like this into line
> with this one if you like.

I see a lot more than two --- grepping for poll_query_until with
a test involving a LSN comparison finds a bunch.  Are we sure that
there are only three in which the LSN could be null?  How much
does it really matter if it is?

            regards, tom lane



Re: Race condition in recovery?

From
Andrew Dunstan
Date:
On 6/12/21 1:54 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 6/12/21 1:07 PM, Tom Lane wrote:
>>> OK.  But it makes me itch a bit that this one wait-for-wal-to-be-
>>> processed query looks different from all the other ones.
>> I'm happy to bring the other two queries that look like this into line
>> with this one if you like.
> I see a lot more than two --- grepping for poll_query_until with
> a test involving a LSN comparison finds a bunch.  Are we sure that
> there are only three in which the LSN could be null?  


Well, I'm counting places that specifically compare it with
pg_stat_archiver.last_archived_wal.



> How much
> does it really matter if it is?
>
>             


It makes it harder to tell if there was any result at all when there's a
failure. If it bugs you that much I can revert just that line. Now that
I have fixed the immediate issue it matters less. I'm not prepared to
put in a lot of effort here, though.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Race condition in recovery?

From
Mikael Kjellström
Date:

On 2021-06-10 01:09, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Got it. I have now committed the patch to all branches, after adapting
>> your changes just a little bit.
>> Thanks to you and Kyotaro-san for all the time spent on this. What a slog!
> 
> conchuela failed its first encounter with this test case:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2021-06-09%2021%3A12%3A25
> 
> That machine has a certain, er, history of flakiness; so this may
> not mean anything.  Still, we'd better keep an eye out to see if
> the test needs more stabilization.

Yes, the flakiness is caused by the very weird filesystem (HAMMERFS) 
that has some weird garbage collection handling that sometimes fills up 
the disk and then never recovers automatically.

I have tried to put in the cleanup-utility for HAMMERFS in cron to run 
at a schedule but it's isn't 100% fool proof.

So I am going to upgrade to a newer version of DragonflyBSD in the near 
future.

/Mikael



Re: Race condition in recovery?

From
Robert Haas
Date:
On Sat, Jun 12, 2021 at 10:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > I have pushed a fix, tested on a replica of fairywren/drongo,
>
> This bit seems a bit random:
>
>  # WAL segment, this is enough to guarantee that the history file was
>  # archived.
>  my $archive_wait_query =
> -  "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM pg_stat_archiver;";
> +  "SELECT coalesce('$walfile_to_be_archived' <= last_archived_wal, false) " .
> +  "FROM pg_stat_archiver";
>  $node_standby->poll_query_until('postgres', $archive_wait_query)
>    or die "Timed out while waiting for WAL segment to be archived";
>  my $last_archived_wal_file = $walfile_to_be_archived;
>
> I wonder whether that is a workaround for the poll_query_until bug
> I proposed to fix at [1].

I found that a bit random too, but it wasn't the only part of the
patch I found a bit random. Like, what can this possibly be doing?

+if ($^O eq 'msys')
+{
+    $perlbin = TestLib::perl2host(dirname($^X)) . '\\' . basename($^X);
+}

The idea here is apparently that on msys, the directory name that is
part of $^X needs to be passed through perl2host, but the file name
that is part of the same $^X needs to NOT be passed through perl2host.
Is $^X really that broken? If so, I think some comments are in order.

+local $ENV{PERL_BADLANG}=0;

Similarly here. There's not a single other reference to PERL_BADLANG
in the repository, so if we need this one here, there should be a
comment explaining why this is different from all the places where we
don't need it.

On those occasions when I commit TAP test cases, I do try to think
about whether they are going to be portable, but I find these kinds of
changes indistinguishable from magic.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Andrew Dunstan
Date:
On 6/14/21 11:52 AM, Robert Haas wrote:
> On Sat, Jun 12, 2021 at 10:20 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>> I have pushed a fix, tested on a replica of fairywren/drongo,
>> This bit seems a bit random:
>>
>>  # WAL segment, this is enough to guarantee that the history file was
>>  # archived.
>>  my $archive_wait_query =
>> -  "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM pg_stat_archiver;";
>> +  "SELECT coalesce('$walfile_to_be_archived' <= last_archived_wal, false) " .
>> +  "FROM pg_stat_archiver";
>>  $node_standby->poll_query_until('postgres', $archive_wait_query)
>>    or die "Timed out while waiting for WAL segment to be archived";
>>  my $last_archived_wal_file = $walfile_to_be_archived;
>>
>> I wonder whether that is a workaround for the poll_query_until bug
>> I proposed to fix at [1].



This has been reverted.


> I found that a bit random too, but it wasn't the only part of the
> patch I found a bit random. Like, what can this possibly be doing?
>
> +if ($^O eq 'msys')
> +{
> +    $perlbin = TestLib::perl2host(dirname($^X)) . '\\' . basename($^X);
> +}
>
> The idea here is apparently that on msys, the directory name that is
> part of $^X needs to be passed through perl2host, but the file name
> that is part of the same $^X needs to NOT be passed through perl2host.
> Is $^X really that broken? If so, I think some comments are in order.


$^X is not at all broken.


The explanation here is pretty simple - the argument to perl2host is
meant to be a directory. If we're going to accomodate plain files then
we have some more work to do in TestLib.


> +local $ENV{PERL_BADLANG}=0;
>
> Similarly here. There's not a single other reference to PERL_BADLANG
> in the repository, so if we need this one here, there should be a
> comment explaining why this is different from all the places where we
> don't need it.


Here's why this is different: this is the only place that we invoke the
msys perl in this way (i.e. from a non-msys aware environment - the
binaries we build are not msys-aware). We need to do that if for no
other reason than that it might well be the only perl available. Doing
so makes it complain loudly about missing locale info. Setting this
variable makes it shut up. I can add a comment on that if you like.


> On those occasions when I commit TAP test cases, I do try to think
> about whether they are going to be portable, but I find these kinds of
> changes indistinguishable from magic.



Part of the trouble is that I've been living and breathing some of these
issues so much recently that I forget that what might be fairly obvious
to me isn't to others. I assure you there is not the faintest touch of
pixy dust involved.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Race condition in recovery?

From
Robert Haas
Date:
On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> $^X is not at all broken.
>
> The explanation here is pretty simple - the argument to perl2host is
> meant to be a directory. If we're going to accomodate plain files then
> we have some more work to do in TestLib.

This explanation seems to contradict the documentation in TestLib.pm,
which makes no mention of any such restriction.

> > +local $ENV{PERL_BADLANG}=0;
> >
> > Similarly here. There's not a single other reference to PERL_BADLANG
> > in the repository, so if we need this one here, there should be a
> > comment explaining why this is different from all the places where we
> > don't need it.
>
> Here's why this is different: this is the only place that we invoke the
> msys perl in this way (i.e. from a non-msys aware environment - the
> binaries we build are not msys-aware). We need to do that if for no
> other reason than that it might well be the only perl available. Doing
> so makes it complain loudly about missing locale info. Setting this
> variable makes it shut up. I can add a comment on that if you like.

Yes, please, but perhaps you'd like to post patches for discussion
first instead of committing directly.

> Part of the trouble is that I've been living and breathing some of these
> issues so much recently that I forget that what might be fairly obvious
> to me isn't to others. I assure you there is not the faintest touch of
> pixy dust involved.

Every pixie with whom I've spoken today says otherwise!

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Andrew Dunstan
Date:
On 6/14/21 1:11 PM, Robert Haas wrote:
> On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>> $^X is not at all broken.
>>
>> The explanation here is pretty simple - the argument to perl2host is
>> meant to be a directory. If we're going to accomodate plain files then
>> we have some more work to do in TestLib.
> This explanation seems to contradict the documentation in TestLib.pm,
> which makes no mention of any such restriction.


Heres a snippet:


    sub perl2host
    {
        my ($subject) = @_;
        ...
        if (chdir $subject)
     

Last time I looked you can't chdir to anything except a directory.


>
>>> +local $ENV{PERL_BADLANG}=0;
>>>
>>> Similarly here. There's not a single other reference to PERL_BADLANG
>>> in the repository, so if we need this one here, there should be a
>>> comment explaining why this is different from all the places where we
>>> don't need it.
>> Here's why this is different: this is the only place that we invoke the
>> msys perl in this way (i.e. from a non-msys aware environment - the
>> binaries we build are not msys-aware). We need to do that if for no
>> other reason than that it might well be the only perl available. Doing
>> so makes it complain loudly about missing locale info. Setting this
>> variable makes it shut up. I can add a comment on that if you like.
> Yes, please, but perhaps you'd like to post patches for discussion
> first instead of committing directly.


I was trying to get the buildfarm green again. There have been plenty of
times when small patches directly for such fixes have been committed
directly. And that's the only circumstance when I do.



cheers


andrew




Re: Race condition in recovery?

From
Robert Haas
Date:
On Mon, Jun 14, 2021 at 1:50 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> Heres a snippet:
>
>     sub perl2host
>     {
>         my ($subject) = @_;
>         ...
>         if (chdir $subject)
>
> Last time I looked you can't chdir to anything except a directory.

OK, but like I said, you can't tell that from the documentation. The
documentation says: "Translate a virtual file name to a host file
name. Currently, this is a no-op except for the case of Perl=msys and
host=mingw32.  The subject need not exist, but its parent or
grandparent directory must exist unless cygpath is available." If you
look just at that, there's nothing that would lead you to believe that
it has to be a directory name.

> I was trying to get the buildfarm green again. There have been plenty of
> times when small patches directly for such fixes have been committed
> directly. And that's the only circumstance when I do.

I wasn't intending to criticize your work on this. I really appreciate
it, in fact, as I also said to you off-list. I do think that there
were some small things in those patches where a little bit of quick
discussion might have been useful: e.g. should the archive_command
change have gone in in the first place? Do we need any comments to
explain the fixes? But it's not like it's a big deal either. I'm
certainly not disagreeing with the goodness of making the buildfarm
green as expediently as possible.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Andrew Dunstan
Date:
On 6/14/21 1:50 PM, Andrew Dunstan wrote:
> On 6/14/21 1:11 PM, Robert Haas wrote:
>> On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>>> $^X is not at all broken.
>>>
>>> The explanation here is pretty simple - the argument to perl2host is
>>> meant to be a directory. If we're going to accomodate plain files then
>>> we have some more work to do in TestLib.
>> This explanation seems to contradict the documentation in TestLib.pm,
>> which makes no mention of any such restriction.
>
> Heres a snippet:
>
>
>     sub perl2host
>     {
>         my ($subject) = @_;
>         ...
>         if (chdir $subject)
>      
>
> Last time I looked you can't chdir to anything except a directory.



Actually, I take it back, it does work for a file. I'll change it. I
probably did this when something else wasn't working.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Race condition in recovery?

From
Andrew Dunstan
Date:
On 6/14/21 3:32 PM, Andrew Dunstan wrote:
> On 6/14/21 1:50 PM, Andrew Dunstan wrote:
>> On 6/14/21 1:11 PM, Robert Haas wrote:
>>> On Mon, Jun 14, 2021 at 12:56 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>>>> $^X is not at all broken.
>>>>
>>>> The explanation here is pretty simple - the argument to perl2host is
>>>> meant to be a directory. If we're going to accomodate plain files then
>>>> we have some more work to do in TestLib.
>>> This explanation seems to contradict the documentation in TestLib.pm,
>>> which makes no mention of any such restriction.
>> Heres a snippet:
>>
>>
>>     sub perl2host
>>     {
>>         my ($subject) = @_;
>>         ...
>>         if (chdir $subject)
>>      
>>
>> Last time I looked you can't chdir to anything except a directory.
>
>
> Actually, I take it back, it does work for a file. I'll change it. I
> probably did this when something else wasn't working.




So, will you feel happier with this applied? I haven't tested it yet but
I'm confident it will work.


diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl b/src/test/recovery/t/025_stuck_on_old_timeline.pl
index e4e58cb8ab..3e19bc4c50 100644
--- a/src/test/recovery/t/025_stuck_on_old_timeline.pl
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -24,11 +24,11 @@ my $node_primary = get_new_node('primary');
 # the timeline history file reaches the archive but before any of the WAL files
 # get there.
 $node_primary->init(allows_streaming => 1, has_archiving => 1);
-my $perlbin = $^X;
-if ($^O eq 'msys')
-{
-   $perlbin = TestLib::perl2host(dirname($^X)) . '\\' . basename($^X);
-}
+
+# Note: consistent use of forward slashes here avoids any escaping problems
+# that arise from use of backslashes. That means we need to double-quote all
+# the paths in the archive_command
+my $perlbin = TestLib::perl2host(^X);
 $perlbin =~ s!\\!/!g if $TestLib::windows_os;
 my $archivedir_primary = $node_primary->archive_dir;
 $archivedir_primary =~ s!\\!/!g if $TestLib::windows_os;
@@ -36,6 +36,8 @@ $node_primary->append_conf('postgresql.conf', qq(
 archive_command = '"$perlbin" "$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"'
 wal_keep_size=128MB
 ));
+# make sure that Msys perl doesn't complain about difficulty in setting locale
+# when called this way.
 local $ENV{PERL_BADLANG}=0;
 $node_primary->start;
 


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Fri, 11 Jun 2021 10:46:45 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> I think jacana uses msys[2?], so this likely indicates a problem
> in path sanitization for the archive command.  Andrew, any advice?

Thanks for fixing it.

# I haven't still succeed to run TAP tests on MSYS2 environment. I
# cannot install IPC::Run for msys perl..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Race condition in recovery?

From
Andrew Dunstan
Date:
On 6/15/21 2:16 AM, Kyotaro Horiguchi wrote:
> At Fri, 11 Jun 2021 10:46:45 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
>> I think jacana uses msys[2?], so this likely indicates a problem
>> in path sanitization for the archive command.  Andrew, any advice?
> Thanks for fixing it.
>
> # I haven't still succeed to run TAP tests on MSYS2 environment. I
> # cannot install IPC::Run for msys perl..
>
> regards.
>


Unpack the attached somewhere and point your PERL5LIB at it. That's all
I do.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com


Attachment

Re: Race condition in recovery?

From
Robert Haas
Date:
On Mon, Jun 14, 2021 at 3:47 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> So, will you feel happier with this applied? I haven't tested it yet but
> I'm confident it will work.

I'm not all that unhappy now, but yeah, that looks like an improvement
to me. I'm still afraid that I will keep writing tests that blow up on
Windows but that's a bigger problem than we can hope to fix on this
thread, and I do think this discussion has helped.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Race condition in recovery?

From
Kyotaro Horiguchi
Date:
At Tue, 15 Jun 2021 07:54:49 -0400, Andrew Dunstan <andrew@dunslane.net> wrote in 
> 
> On 6/15/21 2:16 AM, Kyotaro Horiguchi wrote:
> > At Fri, 11 Jun 2021 10:46:45 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> >> I think jacana uses msys[2?], so this likely indicates a problem
> >> in path sanitization for the archive command.  Andrew, any advice?
> > Thanks for fixing it.
> >
> > # I haven't still succeed to run TAP tests on MSYS2 environment. I
> > # cannot install IPC::Run for msys perl..
> >
> > regards.
> >
> 
> 
> Unpack the attached somewhere and point your PERL5LIB at it. That's all
> I do.

Thanks a lot, Andrew.  I get to run the TAP test with it and saw the
same error with jacana.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center