So, basically, when we go from 1 process to 4, the additional processes spend all of their time waiting rather than doing any useful work, and that's why there is no performance benefit. Presumably, the reason they spend all their time waiting for ClientRead/ClientWrite is because the network between the two machines is saturated, so adding more processes that are trying to use it at maximum speed just leads to spending more time waiting for it to be available.
Do we have the same results for the local backup case, where the patch helped?
Here is the result for local backup case (100GB data). Attaching the captured logs.
The total number of events (pg_stat_activity) captured during local runs: - 82 events for normal backups - 31 events for parallel backups (-j 4)
BaseBackupRead wait event numbers: (newly added) 24 - in normal backups 14 - in parallel backup (-j 4)
ClientWrite wait event numbers: 8 - in normal backup 43 - in parallel backups
ClientRead wait event numbers: 0 - ClientRead in normal backup 32 - ClientRead in parallel backups for diff processes.