Re: Trap errors from streaming child in pg_basebackup to exit early - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Trap errors from streaming child in pg_basebackup to exit early
Date
Msg-id YgynUafCyIu3jIhC@paquier.xyz
Whole thread Raw
In response to Re: Trap errors from streaming child in pg_basebackup to exit early  (Daniel Gustafsson <daniel@yesql.se>)
Responses Re: Trap errors from streaming child in pg_basebackup to exit early  (Daniel Gustafsson <daniel@yesql.se>)
List pgsql-hackers
On Wed, Sep 29, 2021 at 01:18:40PM +0200, Daniel Gustafsson wrote:
> So there is one mention of a background WAL receiver already in there, but it's
> pretty inconsistent as to what we call it.  For now I've changed the messaging
> in this patch to say "background process", leaving making this all consistent
> for a follow-up patch.
>
> The attached fixes the above, as well as the typo mentioned off-list and is
> rebased on top of todays HEAD.

I have been looking a bit at this patch, and did some tests on Windows
to find out that this is able to catch the failure of the thread
streaming the WAL segments in pg_basebackup, avoiding a completion of
the base backup, while HEAD waits until the backup finishes.  Testing
this scenario is actually simple by issuing pg_terminate_backend() on
the WAL sender that streams the WAL with START_REPLICATION, while
throttling the base backup.

Could you add a test to automate this scenario?  As far as I can see,
something like the following should be stable even for Windows:
1) Run a pg_basebackup in the background with IPC::Run, using
--max-rate with a minimal value to slow down the base backup, for slow
machines.  013_crash_restart.pl does that as one example with $killme.
2) Find out the WAL sender doing START_REPLICATION in the backend, and
issue pg_terminate_backend() on it.
3) Use a variant of pump_until() on the pg_basebackup process and
check after one or more failure patterns.  We should refactor this
part, actually.  If this new test uses the same logic, that would make
three tests doing that with 022_crash_temp_files.pl and
013_crash_restart.pl.  The CI should be fine to provide any feedback
with the test in place, though I am fine to test things also in my
box.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: Observability in Postgres
Next
From: John Naylor
Date:
Subject: some aspects of our qsort might not be ideal