In the first place, it's important to note that we do not guarantee that an async standby can always switch its replication connection to the old primary or another sibling standby. This is due to the variations in replication lag among standbys. pg_rewind is required to adjust such discrepancies.
Sure, I know. But in this case the async standby received and flushed absolutely the same amount of WAL as the promoted one.
I might be overlooking something, but I don't understand how this occurs without purposefully tweaking WAL files. The repro script pushes an incomplete WAL file to the archive as a non-partial segment. This shouldn't happen in the real world.
It easily happens if the primary crashed and standbys didn't receive another page with continuation record.
In the repro script, the replication connection of the second standby is switched from the old primary to the first standby after its promotion. After the switching, replication is expected to continue from the beginning of the last replayed segment.
Well, maybe, but apparently the standby is busy trying to decode a record that spans multiple pages, and it is just infinitely waiting for the next page to arrive. Also, the restart "fixes" the problem, because indeed it is reading the file from the beginning.
But with the script, the second standby copies the intentionally broken file, which differs from the data that should be received via streaming.
As I already said, this is a simple way to emulate the primary crash while standbys receiving WAL.
It could easily happen that the record spans on multiple pages is not fully received and flushed.