Re: Trap errors from streaming child in pg_basebackup to exit early - Mailing list pgsql-hackers

From Magnus Hagander
Subject Re: Trap errors from streaming child in pg_basebackup to exit early
Date
Msg-id CABUevExTLbYMLcZAvyD1sgqm2MS=bVh-eON+Spm0VMf+SXdDzQ@mail.gmail.com
Whole thread Raw
In response to Re: Trap errors from streaming child in pg_basebackup to exit early  (Daniel Gustafsson <daniel@yesql.se>)
Responses Re: Trap errors from streaming child in pg_basebackup to exit early  (Daniel Gustafsson <daniel@yesql.se>)
List pgsql-hackers
On Fri, Sep 3, 2021 at 11:53 AM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> > On 1 Sep 2021, at 12:28, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > On Wed, Sep 1, 2021 at 1:56 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> >> A v2 with the above fixes is attached.
> >
> > Thanks for the updated patch. Here are some comments:
> >
> > 1) Do we need to set bgchild = -1 before the exit(1); in the code
> > below so that we don't kill(bgchild, SIGTERM); unnecessarily in
> > kill_bgchild_atexit?
>
> Good point. We can also inspect bgchild_exited in kill_bgchild_atexit.
>
> > 2) Missing "," after "On Windows, we use a ....."
> > + * that time. On Windows we use a background thread which can communicate
> >
> > 3) How about "/* Flag to indicate whether or not child process exited
> > */" instead of +/* State of child process */?
>
> Fixed.
>
> > 4) Instead of just exiting from the main pg_basebackup process when
> > the child WAL receiver dies, can't we think of restarting the child
> > process, probably with the WAL streaming position where it left off or
> > stream from the beginning? This way, the work that the main
> > pg_basebackup has done so far doesn't get wasted. I'm not sure if this
> > affects the pg_basebackup functionality. We can restart the child
> > process for 1 or 2 times, if it still dies, we can kill the main
> > pg_baasebackup process too. Thoughts?
>
> I was toying with the idea, but I ended up not pursuing it.  This error is well
> into the “really shouldn’t happen, but can” territory and it’s quite likely
> that some level of manual intervention is required to make it successfully
> restart.  I’m not convinced that adding complicated logic to restart (and even
> more complicated tests to simulate and test it) will be worthwhile.
>

I think the restart scenario while nice, definitely means moving the
goalposts quite far. Let's get this detection in first at least, and
then we can always consider that a separate patch in the future.

Might be worth noting in one of the comments the difference in
behaviour if the backend process/thread *crashes* -- that is, on Unix
it will be detected via the signal handler and on Windows the whole
process including the main thread will die, so there is nothing to
detect.

Other places in the code just refers to the background process as "the
background process".  The term "WAL receiver" is new from this patch.
While I agree it's in many ways a better term, I think (1) we should
try to be consistent, and (2) maybe use a different term than "WAL
receiver" specifically because we have a backend component with that
name.

--
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



pgsql-hackers by date:

Previous
From: Shinya Kato
Date:
Subject: Re: (LOCK TABLE options) “ONLY” and “NOWAIT” are not yet implemented
Next
From: Andrew Dunstan
Date:
Subject: Re: Couldn't we mark enum_in() as immutable?