In the attached files, 22607 is pid of a waiting session (we have had around 100 sessions in such state)
and 22591 is a pid of a startup process.
--
Victor Yegorov
Some additional notes about this problem:
1)deadlock_timeout cannot not resolve this problem or simple cannot detect it properly.
2)max_standby_streaming_delay doesn't work as well (with sufficiently high rate of new problem queries from application).
Only one way to fix it on the loaded replica is preventing all new incoming connections via pg_hba.conf, killing all locked queries and verify that the offending part of wal had been replayed. No built-in mechanisms designed to deal with such issues work in that case. Only manual intervention.
I seen such issues like 10 times over last year on different projects. It isn't once per lifetime issue unfortunately.