Re: Race condition in recovery? - Mailing list pgsql-hackers
From | Dilip Kumar |
---|---|
Subject | Re: Race condition in recovery? |
Date | |
Msg-id | CAFiTN-spAMc6WsobbphZDDz+QuwNOmWfTeR6d2BX3W=_NMmP9g@mail.gmail.com Whole thread Raw |
In response to | Re: Race condition in recovery? (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Responses |
Re: Race condition in recovery?
|
List | pgsql-hackers |
On Fri, May 7, 2021 at 8:23 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Tue, 4 May 2021 17:41:06 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in > Could you please fix the test script so that it causes your issue > correctly? And/or elaborate a bit more? > > The attached first file is the debugging aid logging. The second is > the test script, to be placed in src/test/recovery/t. I will look into your test case and try to see whether we can reproduce the issue. But let me summarise what is the exact issue. Basically, the issue is that first in validateRecoveryParameters if the recovery target is the latest then we fetch the latest history file and set the recoveryTargetTLI timeline to the latest available timeline assume it's 2 but we delay updating the expectedTLEs (as per commit ee994272ca50f70b53074f0febaec97e28f83c4e). Now, while reading the checkpoint record if we don't get the required WAL from the archive then we try to get from primary, and while getting checkpoint from primary we use "ControlFile->checkPointCopy.ThisTimeLineID" suppose that is older timeline 1. Now after reading the checkpoint we will set the expectedTLEs based on the timeline from which we got the checkpoint record. See below Logic in WaitForWalToBecomeAvailable if (readFile < 0) { if (!expectedTLEs) expectedTLEs = readTimeLineHistory(receiveTLI); Now, the first problem is we are breaking the sanity of expectedTLEs because as per the definition it should already start with recoveryTargetTLI but it is starting with the older TLI. Now, in rescanLatestTimeLine we are trying to fetch the latest TLI which is still 2, so this logic returns without reinitializing the expectedTLEs because it assumes that if recoveryTargetTLI is pointing to 2 then expectedTLEs must be correct and need not be changed. See below logic: rescanLatestTimeLine(void) { .... newtarget = findNewestTimeLine(recoveryTargetTLI); if (newtarget == recoveryTargetTLI) { /* No new timelines found */ return false; } ... newExpectedTLEs = readTimeLineHistory(newtarget); ... expectedTLEs = newExpectedTLEs; Solution: 1. Find better way to fix the problem of commit (ee994272ca50f70b53074f0febaec97e28f83c4e) which is breaking the sanity of expectedTLEs. 2. Assume, we have to live with fix 1 and we have to initialize expectedTLEs with an older timeline for validating the checkpoint in absence of tl.hostory file (as this commit claims). Then as soon as we read and validate the checkpoint, fix the expectedTLEs and set it based on the history file of recoveryTargetTLI. Does this explanation make sense? If not please let me know what part is not clear in the explanation so I can point to that code. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: