one take-away from the Gitlab Post-Mortem[1] appears to be that after their secondary lost replication, they were confused about what pg_basebackup was doing when they tried to rebuild it. It just sat there and did nothing (even with --verbose), so they assumed something was wrong with either the primary or the connection, and restarted it several times.
AFAICT, it turns out the checkpoint was written on the master (they probably did not use -c fast), but this wasn't obvious to them:
Yeah, I've seen this happen to a number of people. I think that sounds like what's happened here as well. I've considered things in the line of the patch you posted, but never got around to actually doing anything about it.
ISTM that even with WAL streaming, nothing would be written on the client server until the checkpoint is complete, as do_pg_start_backup() runs the checkpoint and only returns the starting WAL location afterwards.
The attached (untested) patch is to kick of a discussion on how to improve the situation, it is supposed to mention the checkpoint when --verbose is used and adds a paragraph about the checkpoint being run to the Notes section of the documentation.
Docs look good to me, other than claiming that pg_basebackup runs on a server (it can run anywhere). I would just say "during which pg_basebackup will appear idle". How does that sound to you?
As for the code, while I haven't tested it, isn't the "checkpoint completed" message in the wrong place? Doesn't PQsendQuery() complete immediately, and the check needs to be put *after* the PQgetResult() call?