Re: Skip checkpoint on promoting from streaming replication - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: Skip checkpoint on promoting from streaming replication
Date
Msg-id 20120619.173046.88698848.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to Re: Skip checkpoint on promoting from streaming replication  (Fujii Masao <masao.fujii@gmail.com>)
Responses Re: Skip checkpoint on promoting from streaming replication
List pgsql-hackers
Thank you.

> What happens if the server skips an end-of-recovery checkpoint,
> is promoted to the master, runs some write transactions,
> crashes and restarts automatically before it completes
> checkpoint? In this case, the server needs to do crash recovery
> from the last checkpoint record with old timeline ID to the
> latest WAL record with new timeline ID. How does crash recovery
> do recovery beyond timeline?

Basically the same as archive recovery as far as I saw. It is
already implemented to work in that way.

After this patch applied, StartupXLOG() gets its
recoveryTargetTLI from the new field lastestTLI in the control
file instead of the latest checkpoint. And the latest checkpoint
record informs its TLI and WAL location as before, but its TLI
does not have a significant meaning in the recovery sequence.

Suggest the case following,
     |seg 1     | seg 2    |
TLI 1 |...c......|....000000|         C           P  X
TLI 2            |........00|

* C - checkpoint, P - promotion, X - crash just after here

This shows the situation that the latest checkpoint(restartpoint)
has been taken at TLI=1/SEG=1/OFF=4 and promoted at
TLI=1/SEG=2/OFF=5, then crashed just after TLI=2/SEG=2/OFF=8.
Promotion itself inserts no wal records but creates a copy of the
current segment for new TLI. the file for TLI=2/SEG=1 should not
exist. (Who will create it?)

The control file will looks as follows

latest checkpoint : TLI=1/SEG=1/OFF=4
latest TLI        : 2

So the crash recovery sequence starts from SEG=1/LOC=4.
expectedTLIs will be (2, 1) so 1 will naturally be selected as
the TLI for SEG1 and 2 for SEG2 in XLogFileReadAnyTLI().

In the closer view, startup constructs expectedTLIs reading the
timeline hisotry file corresponds to the recoveryTargetTLI. Then
runs the recovery sequence from the redo point of the latest
checkpoint using WALs with the largest TLI - which is
distinguised by its file name, not header - within the
expectedTLIs in XLogPageRead(). The only difference to archive
recovery is XLogFileReadAnyTLI() reads only the WAL files already
sit in pg_xlog directory, and not reaches archive. The pages with
the new TLI will be naturally picked up as mentioned above in
this sequence and then will stop at the last readable record.

latestTLI field in the control file is updated just after the TLI
was incremented and the new WAL files with the new TLI was
created. So the crash recovery sequence won't stop before
reaching the WAL with new TLI disignated in the control file.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

== My e-mail address has been changed since Apr. 1, 2012.


pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: WAL format changes
Next
From: Fabien COELHO
Date:
Subject: Re: Pg default's verbosity?