I identified the cause of the second issue. When I tried to replay the issue, the second standby accidentally received the old timeline's last page-spanning record till the end while the first standby was promoting (but it had not been read by recovery). In addition to that, on the second standby, there's a time window where the timeline increased but the first segment of the new timeline is not available yet. In this case, the second standby successfully reads the page-spanning record in the old timeline even after the second standby noticed that the timeline ID has been increased, thanks to the robustness of XLogFileReadAnyTLI().
Hmm, I don't think it could really be prevented.
There are always chances that the standby that is not ahead of other standbys could be promoted due to reasons like:
1. HA configuration doesn't let certain nodes to be promoted.
2. This is an async standby (name isn't listed in synchronous_standby_names) and it was ahead of promoted sync standby. No data loss from the client point of view.
Of course, regardless of the changes above, if recovery on the second standby had reached the end of the page-spanning record before redirection to the first standby, it would need pg_rewind to connect to the first standby.
Correct, IMO pg_rewind is a right way of solving it.