In this case, doesn't the flush LSN typically catch up to the write LSN on node2 after a few seconds? Even if the walreceiver exits while there's still written but unflushed WAL, it looks like WalRcvDie() ensures everything is flushed by calling XLogWalRcvFlush(). So, isn't it safe to rely on the flush LSN when selecting the most advanced node? No?
I think it is a bit more complex than that. There are also cases when we want to ensure that there are "healthy" standby nodes when switchover is requested.
Meaning of "healthy" could be something like: "According to the write LSN it is not lagging more than 16MB" or similar.
Now it is possible to extract this value using pg_stat_get_wal_receiver()/pg_stat_wal_receiver, but it works only when the walreceiver process is alive.
>>> Caveat: we already have a function pg_last_wal_receive_lsn(), which in fact returns flushed LSN, not written. I propose to add a new function which returns LSN actually written. Internals of this function are already implemented (GetWalRcvWriteRecPtr()), but unused.
GetWalRcvWriteRecPtr() returns walrcv->writtenUpto, which can move backward when the walreceiver restarts. This behavior is OK for your purpose?
IMO, most of HA tools are prepared for it. They can't rely only on write/flush LSN, because standby may be replaying WALs from the archive using restore_command and as a result only replay LSN is progressing.
That is, they are supposed to be doing something like max(write_lsn, replay_lsn).