Hello!
In our production system with tens of thousands PostgreSQL clusters we encounter exactly the same issue and forced to synchronize upstreams and downstreams via external means, which is quite suboptimal.
I`ve done done some work on proposed patch would like to present it for a discussion.
There are number of changes, such as sending just TLI and Segno instead of full WAL filename, shifting some work into archiver and adding shared memory for walreceiver/archiver synchronization.
There is number of issues currently unresolved, which I would also like to discuss.
1. Should we update pg_stat_archiver on standby to support cascading replication or should we just resend report, received from upstream? Personally I incline more toward pg_stat_archiver path, because there will be less `if-else` programming.
2. What should we do with *.history.ready, *.backup.ready and .partial.ready files on standby? I think, we can just stamp them with .done.
3. Should we keep this awkward part with switchpont calculation in timeline switch case? I think, all segments, that are not in our server history should just be stamped with .done.
4. Currently .done is forced either by walreceiver (on receiving report from upstream) and archiver. Should we move this into the archiver entirely?
Thank you for your interest in this topic!