Race condition in recovery? - Mailing list pgsql-hackers
From | Dilip Kumar |
---|---|
Subject | Race condition in recovery? |
Date | |
Msg-id | CAFiTN-sE-jr=LB8jQuxeqikd-Ux+jHiXyh4YDiZMPedgQKup0g@mail.gmail.com Whole thread Raw |
Responses |
Re: Race condition in recovery?
|
List | pgsql-hackers |
While analyzing one of the customer issues, based on the log it appeared that there is a race condition in the recovery process. The summary of the issue is, that one of the standby is promoted as the new primary (Node2) and another standby (Node3) is restarted and set the primary_info and the restore_command so that it can stream/restore from Node2 (new primary). The problem is that during the promotion the timeline switch happened in the middle of the segment in Node2 and the Node3 is able to restore the newTLI.history file from the archive but the WAL file with the new TLI is not yet archived. Now, we will try to stream the wal file from the primary but if we are fetching the checkpoint that time we will not use the latest timeline instead we will try with the checkpoint timeline and walsender will send the WAL from the new timeline file because requested WAL recptr is before the TLI switch. And once that happened the expectedTLEs will be set based on the oldTLI.history file. Now, whenever we try to restore the required WAL file and recoveryTargetTimeLineGoal is set to the latest we again try to rescan the latest timeline (rescanLatestTimeLine) but the problem is recoveryTargetTLI was already set to the latest timeline. So now the problem is expectedTLEs is set to oldTLI and recoveryTargetTLI is set to newTLI and rescanLatestTimeLine will never update the expectedTLEs. Now, walsender will fail to stream new WAL using the old TLI and the archiver process will also fail because that file doesn't not exists anymore (converted to .partial). Basically, now we will never try with the newTLI. I have given the sequence of the events based on my analysis. Refer to the sequence of event ----------------------------------------- Node1 primary, Node2 standby1, Node3 standby2 1. Node2 got promoted to new primary, and node 2 picked new TL 13 in the middle of the segment. 2. Node3, restarted with new primary info of Node2 and restore command 3. Node3, found the newest TL13 in validateRecoveryParameters() Because the latest TL was requested in recovery.conf (history file restored from TL13) and set recoveryTargetTLI to 13 So point to note is recoveryTargetTLI is set to 13 but expectedTLEs is not yet set. 4. Node3, entered into the standby mode. 5. Node3, tries to read the checkpoint Record, on Node3 still the checkpoint TL (ControlFile->checkPointCopy.ThisTimeLineID) is 12. 6. Node3, tries to get the checkpoint record file using new TL13 from the archive which it should get ideally but it may not if the Node2 haven't yet archived it. 7. Node3, tries to stream from primary but using TL12 because ControlFile->checkPointCopy.ThisTimeLineID is 12. 8. Node3, get it because walsender of Node2 read it from TL13 and send it and Node2 write in the new WAL file but with TL12. WalSndSegmentOpen() { /*------- * When reading from a historic timeline, and there is a timeline switch * within this segment, read from the WAL segment belonging to the new * timeline. } 9. Node3, now set the expectedTLEs to 12 because that is what walreceiver has streamed the WAL using. 10. Node3, now recoveryTargetTLI is 13 and expectedTLEs is 12. So whenever it tries to find the latest TLE (rescanLatestTimeLine ) it finds it is 13 which is the same as recoveryTargetTLI so nothing to change but expectedTLEs is 12 using which it will try to restore/stream further WAL and fail every time. rescanLatestTimeLine(void) { .... newtarget = findNewestTimeLine(recoveryTargetTLI); if (newtarget == recoveryTargetTLI) { /* No new timelines found */ return false; } 11. So now the situation is that ideally in rescanLatestTimeLine() we should get the latest TLI but it assumes that it is already on the latest TLI because recoveryTargetTLI is on the latest TLI. Other points to be noted: - The actual issue happened on 9.6.11 but based on the code comparison it appeared that same can occur on the latest code as well. - After Node3 is shutdown wal from its pg_wal/ directory were removed so that it can follow the new primary. Based on the sequence of events It is quite clear that something is wrong in rescanLatestTimeLine, maybe after selecting the latest TLI we should compare it with the head of the expectedTLEs as well instead of just comparing it to the recoveryTargetTLI? Log from Node2: 2020-12-22 04:49:02 UTC [1019]: [9-1] user=; db=; app=; client=; SQLcode=00000 LOG: received promote request 2020-12-22 04:49:02 UTC [1019]: [10-1] user=; db=; app=; client=; SQLcode=00000 LOG: redo done at 0/F8000028 2020-12-22 04:49:02 UTC [1019]: [11-1] user=; db=; app=; client=; SQLcode=00000 LOG: last completed transaction was at log time 2020-12-22 04:48:01.833476+00 rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000D.history" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] 2020-12-22 04:49:02 UTC [1019]: [12-1] user=; db=; app=; client=; SQLcode=00000 LOG: selected new timeline ID: 13 2020-12-22 04:49:02 UTC [1019]: [13-1] user=; db=; app=; client=; SQLcode=00000 LOG: archive recovery complete Log From Node3 (with pointwise analysis): 1. Node3 restarted, restored 0000000D.history from archive and recoveryTargetTLI is set to 13 2020-12-22 04:49:40 UTC [2896]: [2-1] user=; db=; app=; client=; SQLcode=00000 LOG: database system is shut down 2020-12-22 04:49:40 UTC [2872]: [6-1] user=; db=; app=; client=; SQLcode=00000 LOG: database system is shut down 2020-12-22 04:49:41 UTC [9082]: [1-1] user=; db=; app=; client=; SQLcode=00000 LOG: database system was shut down in recovery at 2020-12-22 04:49:40 UTC 2020-12-22 04:49:41 UTC [9082]: [2-1] user=; db=; app=; client=; SQLcode=00000 LOG: creating missing WAL directory "pg_xlog/archive_status" 2020-12-22 04:49:41 UTC [9082]: [3-1] user=; db=; app=; client=; SQLcode=00000 LOG: restored log file "0000000D.history" from archive rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] 2020-12-22 04:49:41 UTC [9082]: [4-1] user=; db=; app=; client=; SQLcode=00000 LOG: entering standby mode 2020-12-22 04:49:41 UTC [9082]: [5-1] user=; db=; app=; client=; SQLcode=00000 LOG: restored log file "0000000D.history" from archive 2. Tries to get the WAL file with different timelines from the archive but did not get so expectedTLEs is not yet set rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000D00000000000000F8" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000B00000000000000F8" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000100000000000000F8" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] 3. Since fetching the checkpoint record so use the checkpoint TLI which is 12, although primary doesn’t have 0000000C00000000000000F8 file as it renamed it to 0000000C00000000000000F8.partial But there is the logic in walsender that if requested wal is there in the current TLI then send from their so it will stream from 0000000D00000000000000F8 file 2020-12-22 04:49:42 UTC [9105]: [1-1] user=; db=; app=; client=; SQLcode=00000 LOG: fetching timeline history file for timeline 12 from primary server 2020-12-22 04:49:42 UTC [9105]: [2-1] user=; db=; app=; client=; SQLcode=00000 LOG: started streaming WAL from primary at 0/F8000000 on timeline 12 2020-12-22 04:49:42 UTC [9105]: [3-1] user=; db=; app=; client=; SQLcode=00000 LOG: replication terminated by primary server 2020-12-22 04:49:42 UTC [9105]: [4-1] user=; db=; app=; client=; SQLcode=00000 DETAIL: End of WAL reached on timeline 12 at 0/F8000098. 4. Now since walreciver assumes that it has restore the WAL from the TL 12 the recieveTLI is 12 and the expectedTLEs is set base on the 0000000C.history. See below Logic in WaitForWalToBecomeAvailable if (readFile < 0) { if (!expectedTLEs) expectedTLEs = readTimeLineHistory(receiveTLI); 2020-12-22 04:49:42 UTC [9082]: [6-1] user=; db=; app=; client=; SQLcode=00000 LOG: restored log file "0000000C.history" from archive rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] 2020-12-22 04:49:42 UTC [9082]: [7-1] user=; db=; app=; client=; SQLcode=00000 LOG: restored log file "0000000C.history" from archive 2020-12-22 04:49:42 UTC [9082]: [8-1] user=; db=; app=; client=; SQLcode=00000 LOG: consistent recovery state reached at 0/F8000098 2020-12-22 04:49:42 UTC [9082]: [9-1] user=; db=; app=; client=; SQLcode=00000 LOG: invalid record length at 0/F8000098: wanted 24, got 0 2020-12-22 04:49:42 UTC [9074]: [3-1] user=; db=; app=; client=; SQLcode=00000 LOG: database system is ready to accept read only connections 2020-12-22 04:49:42 UTC [9105]: [5-1] user=; db=; app=; client=; SQLcode=00000 LOG: restarted WAL streaming at 0/F8000000 on timeline 12 2020-12-22 04:49:42 UTC [9105]: [6-1] user=; db=; app=; client=; SQLcode=00000 LOG: replication terminated by primary server 2020-12-22 04:49:42 UTC [9105]: [7-1] user=; db=; app=; client=; SQLcode=00000 DETAIL: End of WAL reached on timeline 12 at 0/F8000098. 4. Now, expectedTLEs head is as 12 and recoveryTargetTLI is 13 so in rescanLatestTimeLine we always assume we are at the latest Ali but we try to archive from expectedTLEs which is old TLI. rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] 2020-12-22 04:49:47 UTC [9105]: [8-1] user=; db=; app=; client=; SQLcode=00000 LOG: restarted WAL streaming at 0/F8000000 on timeline 12 2020-12-22 04:49:47 UTC [9105]: [9-1] user=; db=; app=; client=; SQLcode=00000 LOG: replication terminated by primary server 2020-12-22 04:49:47 UTC [9105]: [10-1] user=; db=; app=; client=; SQLcode=00000 DETAIL: End of WAL reached on timeline 12 at 0/F8000098. rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000E.history" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1179) [sender=3.1.2] rsync: link_stat "/wal_archive/ins30wfm02dbs/0000000C00000000000000F8" failed: No such file or directory (2) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: