When using pg_basebackup with WAL streaming (-X stream), we have observed on a
number of times in production that the streaming child exited prematurely (to
no fault of the code it seems, most likely due to network middleboxes), which
cause the backup to fail but only after it has run to completion. On long
running backups this can consume a lot of time before it’s noticed.
By trapping the failure of the streaming process we can instead exit early to
allow the user to fix and/or restart the process.
The attached adds a SIGCHLD handler for Unix, and catch the returnvalue from
the Windows thread, in order to break out early from the main loop. It still
needs a test, and proper testing on Windows, but early feedback on the approach
would be appreciated.
--
Daniel Gustafsson https://vmware.com/