Re: Trap errors from streaming child in pg_basebackup to exit early - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Trap errors from streaming child in pg_basebackup to exit early
Date
Msg-id CALj2ACXnS82K2ponhai7_0DBWMxqWcjXa6j2bcL7D9kxgzu1Ug@mail.gmail.com
Whole thread Raw
In response to Trap errors from streaming child in pg_basebackup to exit early  (Daniel Gustafsson <daniel@yesql.se>)
Responses Re: Trap errors from streaming child in pg_basebackup to exit early  (Daniel Gustafsson <daniel@yesql.se>)
List pgsql-hackers
On Thu, Aug 26, 2021 at 2:55 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> When using pg_basebackup with WAL streaming (-X stream), we have observed on a
> number of times in production that the streaming child exited prematurely (to
> no fault of the code it seems, most likely due to network middleboxes), which
> cause the backup to fail but only after it has run to completion.  On long
> running backups this can consume a lot of time before it’s noticed.

Hm.

> By trapping the failure of the streaming process we can instead exit early to
> allow the user to fix and/or restart the process.
>
> The attached adds a SIGCHLD handler for Unix, and catch the returnvalue from
> the Windows thread, in order to break out early from the main loop.  It still
> needs a test, and proper testing on Windows, but early feedback on the approach
> would be appreciated.

Here are some comments on the patch:
1) Do we need volatile keyword here to read the value of the variables
always from the memory?
+static volatile sig_atomic_t bgchild_exited = false;

2) Do we need #ifndef WIN32 ... #endif around sigchld_handler function
definition?

3) I'm not sure if the new value of bgchild_exited being set in the
child thread will reflect in the main process on Windows? But
theoretically, I can understand that the memory will be shared between
the main process thread and child thread.
#ifdef WIN32
/*
* In order to signal the main thread of an ungraceful exit we
* set the flag used on Unix to signal SIGCHLD.
*/
bgchild_exited = true;
#endif

4) How about "set the same flag that we use on Unix to signal
SIGCHLD."  instead of "* set the flag used on Unix to signal
SIGCHLD."?

5) How about "background WAL receiver terminated unexpectedly" instead
of "log streamer child terminated unexpectedly"? This will be in sync
with the existing message "starting background WAL receiver". "log
streamer" is the word used internally in the code, user doesn't know
it with that name.

6) How about giving the exit code (like postmaster's reaper function
does) instead of just a message saying unexpected termination? It will
be useful to know for what reason the process exited. For Windows, we
can use GetExitCodeThread (I'm referring to the code around waitpid in
pg_basebackup) and for Unix we can use waitpid.

Regards,
Bharath Rupireddy.



pgsql-hackers by date:

Previous
From: Andrey Borodin
Date:
Subject: Re: suboverflowed subtransactions concurrency performance optimize
Next
From: David Rowley
Date:
Subject: Re: Avoid choose invalid number of partitions (src/backend/executor/nodeAgg.c)