Re: Race condition in recovery? - Mailing list pgsql-hackers
From | Kyotaro Horiguchi |
---|---|
Subject | Re: Race condition in recovery? |
Date | |
Msg-id | 20210507.115302.1371322012690532443.horikyota.ntt@gmail.com Whole thread Raw |
In response to | Re: Race condition in recovery? (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: Race condition in recovery?
|
List | pgsql-hackers |
At Tue, 4 May 2021 17:41:06 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > I think the fix for the problem will be that, after reading/validating > the checkpoint record, we can free the current value of expectedTLEs > and reinitialize it based on the recoveryTargetTLI as shown in the > attached patch? I'm not sure I understand the issue here. I think that the attached should reproduce the issue mentioned here, but didn't for me. The result of running the attached test script is shown below. TLIs are adjusted in your descriptions cited below. The lines prefixed by NodeN> are the server log lines written while running the attached test script. > 1. Node2 got promoted to new primary, and node 2 picked new TL 2 in > the middle of the segment 3. Node2> LOG: selected new timeline ID: 2 > 2. Node3, restarted with new primary info of Node2 and restore command Node2> node_3 LOG: received replication command: IDENTIFY_SYSTEM > 3. Node3, found the newest TL2 in validateRecoveryParameters() Because > the latest TL was requested in recovery.conf (history file restored > from TL2) and set recoveryTargetTLI to 2 So point to note is > recoveryTargetTLI is set to 2 but expectedTLEs is not yet set. This means you specified recovery_target_timeline? Either way, expectedTLEs is not relevant to the behavior here. Even if recovery_target_timeline is set to latest, findNewestTimeLine doesn't look it. Node3> LOG: restored log file "00000002.history" from archive > 4. Node3, entered into the standby mode. Node3> LOG: entering standby mode > 5. Node3, tries to read the checkpoint Record, on Node3 still the > checkpoint TL (ControlFile->checkPointCopy.ThisTimeLineID) is 1. expectedTLEs is loaded just before fetching the last checkpoint. ReadCheckpointRecord doesn't consider checkPointCopy.ThisTimeLineID. The reason for the checkpoint TLI is that the segment file was that of the newest TLI in expectedTLEs found in pg_wal directory. If the segment for TLI=2 containing the last checkpoint had been archived, checkpoint record would be read as TLI=2. Replication starts at TLI=2 in this case because archive recovery has reached that timeline. (Turn on the optional section in the attached test script to see this behavior.) This is the expected behavior since we assume that the segment files for TLI=n and n+1 are identical in the TLI=n part. Anyway the checkpoint that is read is on TLI=1 in this case and replication starts at TLI=1. Node3> LOG: Checkpoint record: TLI=1, 0/3014F78 > 6. Node3, tries to get the checkpoint record file using new TL2 from > the archive which it should get ideally but it may not if the Node2 > haven't yet archived it. This doesn't happen for me. Instead, node3 runs recovery from the checkpoint up to the end of the archived WAL. In this case the end point is 3014FF0@TLI=1. Node3> LOG: invalid record length at 0/3014FF0: wanted 24, got 0 Then, node3 connects to node2 requesting TLI=1 because the history file (or expectedTLEs) told that the LSN belongs to TLI=1. Node3> LOG: 0/3014FF0 is on TLI 1 Node3> LOG: started streaming WAL from primary at 0/3000000 on timeline 1 After a while node2 finds a timeline switch and disconnects the replication. Node3> LOG: replication terminated by primary server Node3> DETAIL: End of WAL reached on timeline 1 at 0/3029A68. After scanning the archive and pg_wal ends in failure, node3 correctly requests node2 for TLI=2 because expectedTLEs told that the current LSN belongs to TLI=2. Node3> LOG: 0/3029A68 is on TLI 2 Node3> LOG: restarted WAL streaming at 0/3000000 on timeline 2 Finally the items below don't happen for me, because node3 needs not to go back to the last checkpoint any longer. Perhaps the script is failing to reproduce your issue correctly. > 7. Node3, tries to stream from primary but using TL1 because > ControlFile->checkPointCopy.ThisTimeLineID is 1. As mentioned above, the checkPointCopy.ThisTimeLineID on either the primary and secondary is irrelevant to the timline the primary sends. The primary streams the timeline requested by the secondary. > 8. Node3, get it because walsender of Node2 read it from TL2 and send > it and Node2 write in the new WAL file but with TL1. Walsender strems the requested TLI from walreceiver, then disconnects at the end of the TLI notifying node3 of the next TLI. Node3 re-establishes replication with the new TLI. Looking into pg_wal of node3, segment 3 for both TLI=1 and 2 are filled by the correct content. So,,, I don't understand what are you saying is the race condition.. An issue that may be slightly relevant to this case have been raised [1]. But it is about writing end-of-recovery checkpoint into the older timeline. Could you please fix the test script so that it causes your issue correctly? And/or elaborate a bit more? The attached first file is the debugging aid logging. The second is the test script, to be placed in src/test/recovery/t. 1: https://www.postgresql.org/message-id/CAE-ML%2B_EjH_fzfq1F3RJ1%3DXaaNG%3D-Jz-i3JqkNhXiLAsM3z-Ew%40mail.gmail.com regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 8d0903c175..9483fd055c 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -55,6 +55,7 @@ restoreTimeLineHistoryFiles(TimeLineID begin, TimeLineID end) for (tli = begin; tli < end; tli++) { + elog(LOG, "Trying restoring history file for TLI=%d", tli); if (tli == 1) continue; @@ -95,6 +96,7 @@ readTimeLineHistory(TimeLineID targetTLI) if (ArchiveRecoveryRequested) { + elog(LOG, "Trying reading history file for TLI=%d", targetTLI); TLHistoryFileName(histfname, targetTLI); fromArchive = RestoreArchivedFile(path, histfname, "RECOVERYHISTORY", 0, false); @@ -231,6 +233,7 @@ existsTimeLineHistory(TimeLineID probeTLI) if (ArchiveRecoveryRequested) { + elog(LOG, "Probing history file for TLI=%d", probeTLI); TLHistoryFileName(histfname, probeTLI); RestoreArchivedFile(path, histfname, "RECOVERYHISTORY", 0, false); } diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index adfc6f67e2..e31e1f0ce3 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3732,6 +3732,7 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli, snprintf(activitymsg, sizeof(activitymsg), "waiting for %s", xlogfname); set_ps_display(activitymsg); + elog(LOG, "Trying fetching history file for TLI=%d", tli); restoredFromArchive = RestoreArchivedFile(path, xlogfname, "RECOVERYXLOG", wal_segment_size, @@ -3825,7 +3826,10 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source) if (expectedTLEs) tles = expectedTLEs; else + { + elog(LOG, "Loading history file for TLI=%d (2)", recoveryTargetTLI); tles = readTimeLineHistory(recoveryTargetTLI); + } foreach(cell, tles) { @@ -3839,6 +3843,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source) * Skip scanning the timeline ID that the logfile segment to read * doesn't belong to */ + elog(LOG, "scanning segment %lX TLI %d, source %d", segno, tli, source); + if (hent->begin != InvalidXLogRecPtr) { XLogSegNo beginseg = 0; @@ -3865,6 +3871,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source) XLOG_FROM_ARCHIVE, true); if (fd != -1) { + elog(LOG, "found segment %lX TLI %d, from archive", segno, tli); elog(DEBUG1, "got WAL segment from archive"); if (!expectedTLEs) expectedTLEs = tles; @@ -3878,6 +3885,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source) XLOG_FROM_PG_WAL, true); if (fd != -1) { + elog(LOG, "found segment %lX TLI %d, from PG_WAL", segno, tli); if (!expectedTLEs) expectedTLEs = tles; return fd; @@ -8421,7 +8429,7 @@ ReadCheckpointRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, XLogBeginRead(xlogreader, RecPtr); record = ReadRecord(xlogreader, LOG, true); - + elog(LOG, "Checkpoint record: TLI=%d, %X/%X, rectargetTLI=%d, exptles=%p", xlogreader->seg.ws_tli, LSN_FORMAT_ARGS(xlogreader->ReadRecPtr),recoveryTargetTLI, expectedTLEs); if (record == NULL) { if (!report) @@ -12628,7 +12636,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, * TLI, rather than the position we're reading. */ tli = tliOfPointInHistory(tliRecPtr, expectedTLEs); - + elog(LOG, "%X/%X is on TLI %X", LSN_FORMAT_ARGS(tliRecPtr), tli); if (curFileTLI > 0 && tli < curFileTLI) elog(ERROR, "according to history file, WAL location %X/%X belongs to timeline %u, but previousrecovered WAL file came from timeline %u", LSN_FORMAT_ARGS(tliRecPtr), @@ -12697,7 +12705,10 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, if (readFile < 0) { if (!expectedTLEs) + { + elog(LOG, "Loading expectedTLEs for %d", receiveTLI); expectedTLEs = readTimeLineHistory(receiveTLI); + } readFile = XLogFileRead(readSegNo, PANIC, receiveTLI, XLOG_FROM_STREAM, false); # Minimal test testing streaming replication use strict; use warnings; use PostgresNode; use TestLib; use Test::More tests => 2; my $primary = get_new_node('primary'); $primary->init(allows_streaming => 1); $primary->start; my $backup_name = 'my_backup'; $primary->backup($backup_name); my $node_2 = get_new_node('node_2'); $node_2->init_from_backup($primary, $backup_name, has_streaming => 1, allows_streaming => 1); $node_2->append_conf('postgresql.conf', "archive_mode = always"); my $archdir = $node_2->data_dir . "/../archive"; $node_2->append_conf('postgresql.conf', "archive_command = 'cp %p $archdir/%f'"); mkdir($archdir); $node_2->start; # Create streaming standby linking to primary my $node_3 = get_new_node('node_3'); $node_3->init_from_backup($primary, $backup_name, has_streaming => 1); $node_3->append_conf('postgresql.conf', "restore_command = 'cp $archdir/%f %p'"); $node_3->start; $primary->psql('postgres', 'SELECT pg_switch_wal(); CREATE TABLE t(); DROP TABLE t; CHECKPOINT;'); $primary->wait_for_catchup($node_2, 'write', $primary->lsn('insert')); $primary->wait_for_catchup($node_3, 'write', $primary->lsn('insert')); $node_3->stop; # put node3 a bit behind to cause streaming on the old timeline $primary->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;'); $primary->wait_for_catchup($node_2, 'write', $primary->lsn('insert')); $primary->stop; # promote node2 $node_2->psql('postgres', 'SELECT pg_promote()'); # optional: archive segment 3 of TLI=2 on node2 and advance one more segment if (0) { my $lastwal = $node_2->safe_psql('postgres', 'select last_archived_wal from pg_stat_archiver'); $node_2->psql('postgres', 'SELECT pg_switch_wal();'); $node_2->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;'); $node_2->poll_query_until('postgres', "SELECT '$lastwal' <> last_archived_wal from pg_stat_archiver"); $lastwal = $node_2->safe_psql('postgres', 'select last_archived_wal from pg_stat_archiver'); $node_2->psql('postgres', 'SELECT pg_switch_wal();'); $node_2->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;'); $node_2->poll_query_until('postgres', "SELECT '$lastwal' <> last_archived_wal from pg_stat_archiver"); } # attach node3 as a standby of node2 $node_3->enable_streaming($node_2); $node_3->append_conf('postgresql.conf', "recovery_target_timeline = 2"); # *restart# node3, not just reloading to trigger archive recovery $node_3->start; $node_2->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;'); $node_2->psql('postgres', 'SELECT pg_switch_wal();'); # XXX: another defect comes out without this X( $node_2->psql('postgres', 'CREATE TABLE t(); DROP TABLE t;'); $node_2->wait_for_catchup($node_3, 'write', $node_2->lsn('insert')); my $result = $node_2->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn() = write_lsn FROM pg_stat_replication'); ok($result eq 't', 'check'); # set 0 to leave data directories after a successful run ok(1, 'break');
pgsql-hackers by date: