Then I tried to get things working on 9.6. There's a patch attached to back-port a couple of PostgresNode.pm methods from 10 to 9.6, and also a version of the main patch attached with the necessary wal->xlog, lsn->location renaming. Unfortunately ... the new test case still fails on 9.6 in a way that looks an awful lot like the bug isn't actually fixed:
LOG: primary server contains no more WAL on requested timeline 1 cp: /Users/rhaas/pgsql/src/test/recovery/tmp_check/data_primary_enMi/archives/000000010000000000000003: No such file or directory (repeated many times)
I find that the same failure happens if I back-port the master version of the patch to v10 or v11,
I think this fails because prior to v12 the recovery target tli was not set to the latest by default because it was not GUC at that time. So after below fix it started passing on v11(only tested on v11 so far).
But now it started passing even without the fix and the log says that it never tried to stream from primary using TL 1 so it never hit the defect location.
2021-06-09 12:11:08.618 IST [122456] LOG: entering standby mode 2021-06-09 12:11:08.622 IST [122456] LOG: restored log file "00000002.history" from archive cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/t_025_stuck_on_old_timeline_primary_data/archives/000000010000000000000002’: No such file or directory 2021-06-09 12:11:08.627 IST [122456] LOG: redo starts at 0/2000028 2021-06-09 12:11:08.627 IST [122456] LOG: consistent recovery state reached at 0/3000000
Next, I will investigate, without a fix on v11 (maybe v12, v10..) why it is not hitting the defect location at all. And after that, I will check the status on other older versions.