Re: Race condition in recovery? - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: Race condition in recovery?
Date
Msg-id CAFiTN-t5L5PYbZW5K1OcdK0qcfgvi570KBTjReZPwYgVz4A56g@mail.gmail.com
Whole thread Raw
In response to Re: Race condition in recovery?  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Race condition in recovery?
List pgsql-hackers
On Tue, May 18, 2021 at 1:28 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> Sorry, you're right. It couldn't be uninitialized, but it could be a
> fake 1-element list saying there are no ancestors rather than the real
> value. So I think the point was to avoid that.

Yeah, it will be a fake 1-element list.  But just to be clear that
1-element can only be "ControlFile->checkPointCopy.ThisTimeLineID" and
nothing else, do you agree to this?  Because we initialize
recoveryTargetTLI to this value and we might change it in
readRecoveryCommandFile() but for that, we must get the history file,
so if we are talking about the case where we don't have the history
file then "recoveryTargetTLI" will still be
"ControlFile->checkPointCopy.ThisTimeLineID".

>
> I think the issue here is: If expectedTLEs was initialized before the
> history file was available, and contained a dummy 1-element list, then
> tliOfPointInHistory() is going to say that every LSN is on that
> timeline rather than any previous timeline. And if we are supposed to
> be switching timelines then that will lead to this sanity check
> failing.

You are talking about the sanity check of validating the timeline of
the checkpoint record right?  but as I mentioned earlier the only
entry in expectedTLEs will be the TLE of the checkpoint record so how
the sanity check will fail?

>
> I agree, but that's actually bad, isn't it?

Yes, it is bad.

 I mean if we want the
> sanity check to never fail we can just take it out. What we want to
> happen is that the sanity check should pass if the startup timeline if
> the TLI of the startup checkpoint is in the history of the recovery
> target timeline, but fail if it isn't. The only way to achieve that
> behavior is if expectedTLEs is initialized from the recovery target
> timeline.

Yes, I agree, with this.  So initializing expectedTLEs with the
recovery target timeline is the right fix.

Conclusion:
- I think now we agree on the point that initializing expectedTLEs
with the recovery target timeline is the right fix.
- We still have some differences of opinion about what was the
original problem in the base code which was fixed by the commit
(ee994272ca50f70b53074f0febaec97e28f83c4e).

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Testing autovacuum wraparound (including failsafe)
Next
From: Peter Geoghegan
Date:
Subject: Re: Testing autovacuum wraparound (including failsafe)