Do you happen to have historical host-monitoring data available for when the replication interruption happened? You should definitely check for CPU (on both sides) and I/O (on the receiver/secondary) saturation.
We do have grafana and zenoss info going way back, I'll see if I can get a login there.
I remember when we first set up streaming replication initially, back then under postgres 9.0, the replication connection defaulted to using TLS/SSL; at the time with SSL/TLS compression enabled. The huge extra work that this incurred on the CPUs involved regularly made the WAL sender on the primary break streaming replication because it couldn't possibly keep up with the data that was being pushed into it encrypted & compressed TCP connection over a 10G link. (Linux's excellent perf tool proved invaluable in determining the exact cause for the high CPU load inside the postgres processes; once we had re-compiled OpenSSL without compression, the problem went away.)
Now of course modern TLS library versions don't implement compression any more, and the streaming ciphers are most probably hardware accelerated for your combination of hard- and software, but the lesson we learned back then may still be worth keeping in mind...
Very interesting read. I just re-examined all of our settings in postgresql.conf, pg_hba.con and recovery.conf and we don't have SSL enabled anywhere there. I'm going to assume that this isn't a bottleneck in our case then.
Other than that... have you verified that the network link between your hosts can actually live up to you and your manager's expectations in terms of bandwidth delivered? iperf3 could help verify that; if the measured bandwidth for a single TCP stream lives up to what you'd expect, you can probably rule out network-related concerns and concentrate on looking at other potential bottlenecks.
Thanks, I'll play around with some of these tools.