It seems for me that there is currently a pitfall in the pg_rewind implementation.
Imagine the following situation:
There is a cluster consisting of a primary with the following configuration: wal_level=‘replica’, archive_mode=‘on’ and a replica.
The primary that is not fast enough in archiving WAL segments (e.g. network issues, high CPU/Disk load...)
The primary fails
The replica is promoted
We are not lucky enough, the new and the old primary’s timelines diverged, we need to run pg_rewind
We are even less lucky: the old primary still has some WAL segments with .ready signal files that were generated before the point of divergence and were not archived. (e.g. 000000020004D20200000095.done, 000000020004D20200000096.ready, 000000020004D20200000097.ready, 000000020004D20200000098.ready)
The promoted primary runs for some time and recycles the old WAL segments.
We revive the old primary and try to rewind it
When pg_rewind finished successfully, we see that the WAL segments with .ready files are removed, because they were already absent on the promoted replica. We end up in a situation where we completely lose some WAL segments, even though we had a clear sign that they were not archived and more importantly, pg_rewind read these segments while collecting information about the data blocks.
The old primary fails to start because of the missing WAL segments (more strictly, the records between the last common checkpoint and the point of divergence) with the following log record: "ERROR: requested WAL segment 000000020004D20200000096 has already been removed"
In this situation, after pg_rewind: archived:
000000020004D20200000095
000000020004D20200000099.partial
000000030004D20200000099
the following segments are lost:
000000020004D20200000096
000000020004D20200000097
000000020004D20200000098
Thus, my thoughts are: why can’t pg_rewind be a little bit wiser in terms of creating filemap for WALs? Can it preserve the WAL segments that contain those potentially lost records (> the last common checkpoint and< the point of divergence) on the target? (see the patch attached)
If I am missing something however, please correct me or explain why it is not possible to implement this straightforward solution.