Re: Race condition in recovery? - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: Race condition in recovery?
Date
Msg-id CAFiTN-spAMc6WsobbphZDDz+QuwNOmWfTeR6d2BX3W=_NMmP9g@mail.gmail.com
Whole thread Raw
In response to Re: Race condition in recovery?  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Responses Re: Race condition in recovery?
List pgsql-hackers
On Fri, May 7, 2021 at 8:23 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Tue, 4 May 2021 17:41:06 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in
> Could you please fix the test script so that it causes your issue
> correctly? And/or elaborate a bit more?
>
> The attached first file is the debugging aid logging. The second is
> the test script, to be placed in src/test/recovery/t.

I will look into your test case and try to see whether we can
reproduce the issue.  But let me summarise what is the exact issue.
Basically, the issue is that first in validateRecoveryParameters if
the recovery target is the latest then we fetch the latest history
file and set the recoveryTargetTLI timeline to the latest available
timeline assume it's 2 but we delay updating the expectedTLEs (as per
commit ee994272ca50f70b53074f0febaec97e28f83c4e).  Now, while reading
the checkpoint record if we don't get the required WAL from the
archive then we try to get from primary, and while getting checkpoint
from primary we use "ControlFile->checkPointCopy.ThisTimeLineID"
suppose that is older timeline 1.  Now after reading the checkpoint we
will set the expectedTLEs based on the timeline from which we got the
checkpoint record.

See below Logic in WaitForWalToBecomeAvailable
                        if (readFile < 0)
                        {
                            if (!expectedTLEs)
                                expectedTLEs = readTimeLineHistory(receiveTLI);

Now, the first problem is we are breaking the sanity of expectedTLEs
because as per the definition it should already start with
recoveryTargetTLI but it is starting with the older TLI.  Now, in
rescanLatestTimeLine we are trying to fetch the latest TLI which is
still 2, so this logic returns without reinitializing the expectedTLEs
because it assumes that if recoveryTargetTLI is pointing to 2 then
expectedTLEs must be correct and need not be changed.

See below logic:
rescanLatestTimeLine(void)
{
....
newtarget = findNewestTimeLine(recoveryTargetTLI);
if (newtarget == recoveryTargetTLI)
{
/* No new timelines found */
return false;
}
...
newExpectedTLEs = readTimeLineHistory(newtarget);
...
expectedTLEs = newExpectedTLEs;


Solution:
1. Find better way to fix the problem of commit
(ee994272ca50f70b53074f0febaec97e28f83c4e) which is breaking the
sanity of expectedTLEs.
2. Assume, we have to live with fix 1 and we have to initialize
expectedTLEs with an older timeline for validating the checkpoint in
absence of tl.hostory file (as this commit claims).  Then as soon as
we read and validate the checkpoint, fix the expectedTLEs and set it
based on the history file of recoveryTargetTLI.

Does this explanation make sense?  If not please let me know what part
is not clear in the explanation so I can point to that code.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: doc issue missing type name "multirange" in chapter title
Next
From: Japin Li
Date:
Subject: Re: Identify missing publications from publisher while create/alter subscription.