On Thu, Aug 26, 2021 at 2:55 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> When using pg_basebackup with WAL streaming (-X stream), we have observed on a
> number of times in production that the streaming child exited prematurely (to
> no fault of the code it seems, most likely due to network middleboxes), which
> cause the backup to fail but only after it has run to completion. On long
> running backups this can consume a lot of time before it’s noticed.
Hm.
> By trapping the failure of the streaming process we can instead exit early to
> allow the user to fix and/or restart the process.
>
> The attached adds a SIGCHLD handler for Unix, and catch the returnvalue from
> the Windows thread, in order to break out early from the main loop. It still
> needs a test, and proper testing on Windows, but early feedback on the approach
> would be appreciated.
Here are some comments on the patch:
1) Do we need volatile keyword here to read the value of the variables
always from the memory?
+static volatile sig_atomic_t bgchild_exited = false;
2) Do we need #ifndef WIN32 ... #endif around sigchld_handler function
definition?
3) I'm not sure if the new value of bgchild_exited being set in the
child thread will reflect in the main process on Windows? But
theoretically, I can understand that the memory will be shared between
the main process thread and child thread.
#ifdef WIN32
/*
* In order to signal the main thread of an ungraceful exit we
* set the flag used on Unix to signal SIGCHLD.
*/
bgchild_exited = true;
#endif
4) How about "set the same flag that we use on Unix to signal
SIGCHLD." instead of "* set the flag used on Unix to signal
SIGCHLD."?
5) How about "background WAL receiver terminated unexpectedly" instead
of "log streamer child terminated unexpectedly"? This will be in sync
with the existing message "starting background WAL receiver". "log
streamer" is the word used internally in the code, user doesn't know
it with that name.
6) How about giving the exit code (like postmaster's reaper function
does) instead of just a message saying unexpected termination? It will
be useful to know for what reason the process exited. For Windows, we
can use GetExitCodeThread (I'm referring to the code around waitpid in
pg_basebackup) and for Unix we can use waitpid.
Regards,
Bharath Rupireddy.