Re: Race condition in recovery? - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Race condition in recovery?
Date
Msg-id 20210507.115302.1371322012690532443.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Race condition in recovery?  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: Race condition in recovery?
List pgsql-hackers
At Tue, 4 May 2021 17:41:06 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in 
> I think the fix for the problem will be that, after reading/validating
> the checkpoint record, we can free the current value of expectedTLEs
> and reinitialize it based on the recoveryTargetTLI as shown in the
> attached patch?

I'm not sure I understand the issue here.  I think that the attached
should reproduce the issue mentioned here, but didn't for me.

The result of running the attached test script is shown below.  TLIs
are adjusted in your descriptions cited below. The lines prefixed by
NodeN> are the server log lines written while running the attached
test script.

> 1. Node2 got promoted to new primary, and node 2 picked new TL 2 in
>   the middle of the segment 3.

Node2>  LOG:  selected new timeline ID: 2

> 2. Node3, restarted with new primary info of Node2 and restore command

Node2> node_3 LOG:  received replication command: IDENTIFY_SYSTEM

> 3. Node3, found the newest TL2 in validateRecoveryParameters() Because
>   the latest TL was requested in recovery.conf (history file restored
>   from TL2) and set recoveryTargetTLI to 2 So point to note is
>   recoveryTargetTLI is set to 2 but expectedTLEs is not yet set.

This means you specified recovery_target_timeline? Either way,
expectedTLEs is not relevant to the behavior here.  Even if
recovery_target_timeline is set to latest, findNewestTimeLine doesn't
look it.

Node3> LOG:  restored log file "00000002.history" from archive
  
> 4. Node3, entered into the standby mode.

Node3> LOG:  entering standby mode

> 5. Node3, tries to read the checkpoint Record, on Node3 still the
>   checkpoint TL (ControlFile->checkPointCopy.ThisTimeLineID) is 1.

expectedTLEs is loaded just before fetching the last checkpoint.

ReadCheckpointRecord doesn't consider checkPointCopy.ThisTimeLineID.

The reason for the checkpoint TLI is that the segment file was that of
the newest TLI in expectedTLEs found in pg_wal directory.  If the
segment for TLI=2 containing the last checkpoint had been archived,
checkpoint record would be read as TLI=2. Replication starts at TLI=2
in this case because archive recovery has reached that timeline.
(Turn on the optional section in the attached test script to see this
behavior.)  This is the expected behavior since we assume that the
segment files for TLI=n and n+1 are identical in the TLI=n part.

Anyway the checkpoint that is read is on TLI=1 in this case and
replication starts at TLI=1.

Node3> LOG:  Checkpoint record: TLI=1, 0/3014F78
  
> 6. Node3, tries to get the checkpoint record file using new TL2 from
>   the archive which it should get ideally but it may not if the Node2
>   haven't yet archived it.

This doesn't happen for me.  Instead, node3 runs recovery from the
checkpoint up to the end of the archived WAL. In this case the end
point is 3014FF0@TLI=1.

Node3> LOG:  invalid record length at 0/3014FF0: wanted 24, got 0

Then, node3 connects to node2 requesting TLI=1 because the history
file (or expectedTLEs) told that the LSN belongs to TLI=1.

Node3> LOG:  0/3014FF0 is on TLI 1
Node3> LOG:  started streaming WAL from primary at 0/3000000 on timeline 1

After a while node2 finds a timeline switch and disconnects the
replication.

Node3> LOG:  replication terminated by primary server
Node3> DETAIL:  End of WAL reached on timeline 1 at 0/3029A68.

After scanning the archive and pg_wal ends in failure, node3 correctly
requests node2 for TLI=2 because expectedTLEs told that the current
LSN belongs to TLI=2.

Node3> LOG:  0/3029A68 is on TLI 2
Node3> LOG:  restarted WAL streaming at 0/3000000 on timeline 2

Finally the items below don't happen for me, because node3 needs not
to go back to the last checkpoint any longer.  Perhaps the script is
failing to reproduce your issue correctly.

> 7. Node3, tries to stream from primary but using TL1 because
>   ControlFile->checkPointCopy.ThisTimeLineID is 1.

As mentioned above, the checkPointCopy.ThisTimeLineID on either the
primary and secondary is irrelevant to the timline the primary
sends. The primary streams the timeline requested by the secondary.

> 8. Node3, get it because walsender of Node2 read it from TL2 and send
> it and Node2 write in the new WAL file but with TL1.

Walsender strems the requested TLI from walreceiver, then disconnects
at the end of the TLI notifying node3 of the next TLI.  Node3
re-establishes replication with the new TLI.  Looking into pg_wal of
node3, segment 3 for both TLI=1 and 2 are filled by the correct
content.

So,,, I don't understand what are you saying is the race condition..

An issue that may be slightly relevant to this case have been raised
[1]. But it is about writing end-of-recovery checkpoint into the older
timeline.

Could you please fix the test script so that it causes your issue
correctly? And/or elaborate a bit more?

The attached first file is the debugging aid logging. The second is
the test script, to be placed in src/test/recovery/t.


1: https://www.postgresql.org/message-id/CAE-ML%2B_EjH_fzfq1F3RJ1%3DXaaNG%3D-Jz-i3JqkNhXiLAsM3z-Ew%40mail.gmail.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 8d0903c175..9483fd055c 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -55,6 +55,7 @@ restoreTimeLineHistoryFiles(TimeLineID begin, TimeLineID end)
 
     for (tli = begin; tli < end; tli++)
     {
+        elog(LOG, "Trying restoring history file for TLI=%d", tli);
         if (tli == 1)
             continue;
 
@@ -95,6 +96,7 @@ readTimeLineHistory(TimeLineID targetTLI)
 
     if (ArchiveRecoveryRequested)
     {
+        elog(LOG, "Trying reading history file for TLI=%d", targetTLI);
         TLHistoryFileName(histfname, targetTLI);
         fromArchive =
             RestoreArchivedFile(path, histfname, "RECOVERYHISTORY", 0, false);
@@ -231,6 +233,7 @@ existsTimeLineHistory(TimeLineID probeTLI)
 
     if (ArchiveRecoveryRequested)
     {
+        elog(LOG, "Probing history file for TLI=%d", probeTLI);
         TLHistoryFileName(histfname, probeTLI);
         RestoreArchivedFile(path, histfname, "RECOVERYHISTORY", 0, false);
     }
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index adfc6f67e2..e31e1f0ce3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3732,6 +3732,7 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
             snprintf(activitymsg, sizeof(activitymsg), "waiting for %s",
                      xlogfname);
             set_ps_display(activitymsg);
+            elog(LOG, "Trying fetching history file for TLI=%d", tli);
             restoredFromArchive = RestoreArchivedFile(path, xlogfname,
                                                       "RECOVERYXLOG",
                                                       wal_segment_size,
@@ -3825,7 +3826,10 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
     if (expectedTLEs)
         tles = expectedTLEs;
     else
+    {
+        elog(LOG, "Loading history file for TLI=%d (2)", recoveryTargetTLI);
         tles = readTimeLineHistory(recoveryTargetTLI);
+    }
 
     foreach(cell, tles)
     {
@@ -3839,6 +3843,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
          * Skip scanning the timeline ID that the logfile segment to read
          * doesn't belong to
          */
+        elog(LOG, "scanning segment %lX TLI %d, source %d", segno, tli, source);
+
         if (hent->begin != InvalidXLogRecPtr)
         {
             XLogSegNo    beginseg = 0;
@@ -3865,6 +3871,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
                               XLOG_FROM_ARCHIVE, true);
             if (fd != -1)
             {
+                elog(LOG, "found segment %lX TLI %d, from archive", segno, tli);
                 elog(DEBUG1, "got WAL segment from archive");
                 if (!expectedTLEs)
                     expectedTLEs = tles;
@@ -3878,6 +3885,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
                               XLOG_FROM_PG_WAL, true);
             if (fd != -1)
             {
+                elog(LOG, "found segment %lX TLI %d, from PG_WAL", segno, tli);
                 if (!expectedTLEs)
                     expectedTLEs = tles;
                 return fd;
@@ -8421,7 +8429,7 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr,
 
     XLogBeginRead(xlogreader, RecPtr);
     record = ReadRecord(xlogreader, LOG, true);
-
+    elog(LOG, "Checkpoint record: TLI=%d, %X/%X, rectargetTLI=%d, exptles=%p", xlogreader->seg.ws_tli,
LSN_FORMAT_ARGS(xlogreader->ReadRecPtr),recoveryTargetTLI, expectedTLEs);
 
     if (record == NULL)
     {
         if (!report)
@@ -12628,7 +12636,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                              * TLI, rather than the position we're reading.
                              */
                             tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
-
+                            elog(LOG, "%X/%X is on TLI %X", LSN_FORMAT_ARGS(tliRecPtr), tli);
                             if (curFileTLI > 0 && tli < curFileTLI)
                                 elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but
previousrecovered WAL file came from timeline %u",
 
                                      LSN_FORMAT_ARGS(tliRecPtr),
@@ -12697,7 +12705,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                         if (readFile < 0)
                         {
                             if (!expectedTLEs)
+                            {
+                                elog(LOG, "Loading expectedTLEs for %d", receiveTLI);
                                 expectedTLEs = readTimeLineHistory(receiveTLI);
+                            }
                             readFile = XLogFileRead(readSegNo, PANIC,
                                                     receiveTLI,
                                                     XLOG_FROM_STREAM, false);
# Minimal test testing streaming replication
use strict;
use warnings;
use PostgresNode;
use TestLib;
use Test::More tests => 2;

my $primary = get_new_node('primary');
$primary->init(allows_streaming => 1);
$primary->start;
my $backup_name = 'my_backup';

$primary->backup($backup_name);

my $node_2 = get_new_node('node_2');
$node_2->init_from_backup($primary, $backup_name,
      has_streaming => 1,
      allows_streaming => 1);
$node_2->append_conf('postgresql.conf', "archive_mode = always");
my $archdir = $node_2->data_dir . "/../archive";
$node_2->append_conf('postgresql.conf', "archive_command = 'cp %p $archdir/%f'");

mkdir($archdir);

$node_2->start;

# Create streaming standby linking to primary
my $node_3 = get_new_node('node_3');
$node_3->init_from_backup($primary, $backup_name,
    has_streaming => 1);
$node_3->append_conf('postgresql.conf', "restore_command = 'cp $archdir/%f %p'");
$node_3->start;

$primary->psql('postgres', 'SELECT pg_switch_wal(); CREATE TABLE t(); DROP TABLE t; CHECKPOINT;');
$primary->wait_for_catchup($node_2, 'write',
                                $primary->lsn('insert'));
$primary->wait_for_catchup($node_3, 'write',
                                $primary->lsn('insert'));
$node_3->stop;

# put node3 a bit behind to cause streaming on the old timeline
$primary->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;');
$primary->wait_for_catchup($node_2, 'write',
                           $primary->lsn('insert'));

$primary->stop;

# promote node2
$node_2->psql('postgres', 'SELECT pg_promote()');

# optional: archive segment 3 of TLI=2 on node2 and advance one more segment
if (0)
{
    my $lastwal = $node_2->safe_psql('postgres', 'select last_archived_wal from pg_stat_archiver');
    $node_2->psql('postgres', 'SELECT pg_switch_wal();');
    $node_2->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;');
    $node_2->poll_query_until('postgres', "SELECT '$lastwal' <> last_archived_wal from pg_stat_archiver");

    $lastwal = $node_2->safe_psql('postgres', 'select last_archived_wal from pg_stat_archiver');
    $node_2->psql('postgres', 'SELECT pg_switch_wal();');
    $node_2->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;');
    $node_2->poll_query_until('postgres', "SELECT '$lastwal' <> last_archived_wal from pg_stat_archiver");
}

# attach node3 as a standby of node2
$node_3->enable_streaming($node_2);
$node_3->append_conf('postgresql.conf', "recovery_target_timeline = 2");

# *restart# node3, not just reloading to trigger archive recovery
$node_3->start;
$node_2->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;');
$node_2->psql('postgres', 'SELECT pg_switch_wal();');

# XXX: another defect comes out without this X(
$node_2->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;');

$node_2->wait_for_catchup($node_3, 'write',
                                $node_2->lsn('insert'));
my $result = 
  $node_2->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn() = write_lsn FROM pg_stat_replication');

ok($result eq 't', 'check');

# set 0 to leave data directories after a successful run
ok(1, 'break');


pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Replication slot stats misgivings
Next
From: Andres Freund
Date:
Subject: Re: [BUG]"FailedAssertion" reported in lazy_scan_heap() when running logical replication