replication fails because the standby would keep retrying to get the
next record even in such case.
As I mentioned earlier, when replication fails, it retries to establish streaming replication. At this point, the value of walrcv->flushedUpto is not necessarily the data actually flushed to disk. However, the startup process mistakenly believes that the latest flushed LSN is walrcv->flushedUpto and attempts to open the corresponding WAL file, which doesn't exist, leading to a file open failure and causing the startup process to PANIC.
> Yugo Nagata <nagata@sraoss.co.jp> 于2024年8月21日周三 00:49写道: > > > > > > > > Is s1 a cascading standby of s2? If otherwise s1 and s2 is the standbys > > of > > > the primary server respectively, it is not surprising that s2 has > > progressed > > > far than s1 when the primary fails. I believe that this is the case you > > should > > > use pg_rewind. Even if flushedUpto is reset as proposed in your patch, > > s2 might > > > already have applied a WAL record that s1 has not processed yet, and > > there > > > would be no gurantee that subsecuent applys suceed. > > > > > Thank you for your response. In my scenario, s1 and s2 is the standbys of > the primary server respectively, and s1 a synchronous standby and s2 is an > asynchronous standby. You mentioned that if s2's replay progress is ahead > of s1, pg_rewind should be used. However, what I'm trying to address is an > issue where s2 crashes during replay after s1 has been promoted to primary, > even though s2's progress hasn't surpassed s1.
I understood your point. It is odd that the standby server crashes when replication fails because the standby would keep retrying to get the next record even in such case.