Re: Race condition in recovery? - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Race condition in recovery? |
Date | |
Msg-id | CA+TgmoatTeD++2p_9-9N=YxFxrHYCvFOV7gtMk8BQr9N-dqrKQ@mail.gmail.com Whole thread Raw |
In response to | Re: Race condition in recovery? (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: Race condition in recovery?
Re: Race condition in recovery? |
List | pgsql-hackers |
On Wed, May 26, 2021 at 2:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I think we need to create some content on promoted standby and check > whether the cascade standby is able to get that or not, that will > guarantee that it is actually following the promoted standby, I have > added the test for that so that it matches the comments. OK, but I ran this test against an unpatched server and it passed. That's not so great, because the test should fail without the bug fix. It seems to be because there's only one actual test in this test case. Looking at the log file, src/test/recovery/tmp_check/log/regress_log_025_timeline_issue, the only "ok" nor "not ok" line is: ok 1 - check streamed content on cascade standby So either that test is not right or some other test is needed. I think there's something else going wrong here, because when I run my original test.sh script, I see this: 2021-05-26 11:37:47.794 EDT [57961] LOG: restored log file "00000002.history" from archive ... 2021-05-26 11:37:47.916 EDT [57961] LOG: redo starts at 0/2000028 ... 2021-05-26 11:37:47.927 EDT [57966] LOG: replication terminated by primary server 2021-05-26 11:37:47.927 EDT [57966] DETAIL: End of WAL reached on timeline 1 at 0/3000000 But in the src/test/recovery/tmp_check/log/025_timeline_issue_cascade.log file generated by this test case: cp: /Users/rhaas/pgsql/src/test/recovery/tmp_check/t_025_timeline_issue_primary_data/archives/00000002.history: No such file or directory ... 2021-05-26 11:41:08.149 EDT [63347] LOG: fetching timeline history file for timeline 2 from primary server ... 2021-05-26 11:41:08.288 EDT [63344] LOG: new target timeline is 2 ... 2021-05-26 11:41:08.303 EDT [63344] LOG: redo starts at 0/2000028 ... 2021-05-26 11:41:08.331 EDT [63347] LOG: restarted WAL streaming at 0/3000000 on timeline 2 So it doesn't seem like the test is actually reproducing the problem correctly. The timeline history file isn't available from the archive, so it streams it, and then the problem doesn't occur. I guess that's because there's nothing to guarantee that the history file reaches the archive before 'cascade' is started. The code just does: # Promote the standby. $node_standby->psql('postgres', 'SELECT pg_promote()'); # Start cascade node $node_cascade->start; ...which has a clear race condition. src/test/recovery/t/023_pitr_prepared_xact.pl has logic to wait for a WAL file to be archived, so maybe we can steal that logic and use it here. I suggest we rename the test to something a bit more descriptive. Like instead of 025_timeline_issue.pl, perhaps 025_stuck_on_old_timeline.pl? Or I'm open to other suggestions, but "timeline issue" is a bit too vague for my taste. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: