Re: Race condition in recovery? - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: Race condition in recovery?
Date
Msg-id CAFiTN-tPh8eR1zHc7WCMbBMKn4bOfwvKK0fqKKhY6phVV4ENpg@mail.gmail.com
Whole thread Raw
In response to Re: Race condition in recovery?  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Race condition in recovery?
List pgsql-hackers
On Wed, Jun 9, 2021 at 2:07 AM Robert Haas <robertmhaas@gmail.com> wrote:
Then I tried to get things working on 9.6. There's a patch attached to
back-port a couple of PostgresNode.pm methods from 10 to 9.6, and also
a version of the main patch attached with the necessary wal->xlog,
lsn->location renaming. Unfortunately ... the new test case still
fails on 9.6 in a way that looks an awful lot like the bug isn't
actually fixed:

LOG:  primary server contains no more WAL on requested timeline 1
cp: /Users/rhaas/pgsql/src/test/recovery/tmp_check/data_primary_enMi/archives/000000010000000000000003:
No such file or directory
(repeated many times)

I find that the same failure happens if I back-port the master version
of the patch to v10 or v11,

I think this fails because prior to v12 the recovery target tli was not set to the latest by default because it was not GUC at that time.  So after below fix it started passing on v11(only tested on v11 so far).


diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl b/src/test/recovery/t/025_stuck_on_old_timeline.pl
index 842878a..b3ce5da 100644
--- a/src/test/recovery/t/025_stuck_on_old_timeline.pl
+++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl
@@ -50,6 +50,9 @@ my $node_cascade = get_new_node('cascade');
 $node_cascade->init_from_backup($node_standby, $backup_name,
        has_streaming => 1);
 $node_cascade->enable_restoring($node_primary);
+$node_cascade->append_conf('recovery.conf', qq(
+recovery_target_timeline='latest'
+));
 
But now it started passing even without the fix and the log says that it never tried to stream from primary using TL 1 so it never hit the defect location.

2021-06-09 12:11:08.618 IST [122456] LOG:  entering standby mode
2021-06-09 12:11:08.622 IST [122456] LOG:  restored log file "00000002.history" from archive
cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/000000010000000000000002’: No such file or directory
2021-06-09 12:11:08.627 IST [122456] LOG:  redo starts at 0/2000028
2021-06-09 12:11:08.627 IST [122456] LOG:  consistent recovery state reached at 0/3000000

Next, I will investigate, without a fix on v11 (maybe v12, v10..) why it is not hitting the defect location at all.  And after that, I will check the status on other older versions. 

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [bug?] Missed parallel safety checks, and wrong parallel safety
Next
From: Tatsuro Yamada
Date:
Subject: Re: Duplicate history file?