Re: Race condition in recovery? - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: Race condition in recovery?
Date
Msg-id CAFiTN-tJ8gKs0+f7wsybdd3dUX73ZxiSEKN9vjso2=GnhgTJjw@mail.gmail.com
Whole thread Raw
In response to Re: Race condition in recovery?  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-hackers
On Tue, May 18, 2021 at 12:22 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

> And finally I think I could reach the situation the commit wanted to fix.
>
> I took a basebackup from a standby just before replaying the first
> checkpoint of the new timeline (by using debugger), without copying
> pg_wal.  In this backup, the control file contains checkPointCopy of
> the previous timeline.
>
> I modified StartXLOG so that expectedTLEs is set just after first
> determining recoveryTargetTLI, then started the grandchild node.  I
> have the following error and the server fails to continue replication.

> [postmaster] LOG:  starting PostgreSQL 14beta1 on x86_64-pc-linux-gnu...
> [startup] LOG:  database system was interrupted while in recovery at log...
> [startup] LOG:  set expectedtles tli=6, length=1
> [startup] LOG:  Probing history file for TLI=7
> [startup] LOG:  entering standby mode
> [startup] LOG:  scanning segment 3 TLI 6, source 0
> [startup] LOG:  Trying fetching history file for TLI=6
> [walreceiver] LOG:  fetching timeline history file for timeline 5 from pri...
> [walreceiver] LOG:  fetching timeline history file for timeline 6 from pri...
> [walreceiver] LOG:  started streaming ... primary at 0/3000000 on timeline 5
> [walreceiver] DETAIL:  End of WAL reached on timeline 5 at 0/30006E0.
> [startup] LOG:  unexpected timeline ID 1 in log segment 000000050000000000000003, offset 0
> [startup] LOG:  Probing history file for TLI=7
> [startup] LOG:  scanning segment 3 TLI 6, source 0
> (repeats forever)

So IIUC, this logs shows that
"ControlFile->checkPointCopy.ThisTimeLineID" is 6 but
"ControlFile->checkPoint" record is on TL 5?  I think if you had the
old version of the code (before the commit) or below code [1], right
after initializing expectedTLEs then you would have hit the FATAL the
patch had fix.

While debugging did you check what was the "ControlFile->checkPoint"
LSN vs the first LSN of the first segment with TL6?

expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
[1]
if (tliOfPointInHistory(ControlFile->checkPoint, expectedTLEs) !=
ControlFile->checkPointCopy.ThisTimeLineID)
{
report(FATAL..
}

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?
Next
From: Amit Kapila
Date:
Subject: Re: Refactor "mutually exclusive options" error reporting code in parse_subscription_options