Re: [PATCH] Fix for infinite signal loop in parallel scan - Mailing list pgsql-hackers
From | Chris Travers |
---|---|
Subject | Re: [PATCH] Fix for infinite signal loop in parallel scan |
Date | |
Msg-id | CAN-RpxBYbXho+e7JgAdi0CbUeGObUpgwN2+CGU-Lzwmc78AEcg@mail.gmail.com Whole thread Raw |
In response to | Re: [PATCH] Fix for infinite signal loop in parallel scan (Oleksii Kliukin <alexk@hintbits.com>) |
Responses |
Re: [PATCH] Fix for infinite signal loop in parallel scan
|
List | pgsql-hackers |
On Mon, Sep 17, 2018 at 2:59 PM Oleksii Kliukin <alexk@hintbits.com> wrote:
> On 7. Sep 2018, at 17:57, Chris Travers <chris.travers@adjust.com> wrote:
>
> Hi;
>
> Attached is the patch we are fully testing at Adjust. There are a few non-obvious aspects of the code around where the patch hits. I have run make check on Linux and MacOS, and make check-world on Linux (check-world fails on MacOS on all versions and all branches due to ecpg failures). Automatic tests are difficult because it is to a rare race condition which is difficult (or possibly impossible) to automatically create. Our current approach testing is to force the issue using a debugger. Any ideas on how to reproduce automatically are appreciated but even on our heavily loaded systems this seems to be a very small portion of queries that hit this case (meaning the issue happens about once a week for us).
I did some manual testing on it, putting breakpoints in the
ResolveRecoveryConflictWithLock in the startup process on a streaming replica
(configured with a very low max_standby_streaming_delay, i.e. 100ms) and to the
posix_fallocate call on the normal backend on the same replica. At this point I
also instructed gdb not to stop upon receiving SIGUSR1 (handle SIGUSR1 nonstop)
and resumed execution on both the backend and the startup process.
Then I simulated a conflict by creating a rather big (3GB) table on the master,
doing some updates on it and then running an aggregate on the replica backend
(i.e. 'select count(1) from test' with 'force_parallel_mode = true') where I
set the breakpoint. The aggregate and force_parallel_mode ensured that
the query was executed as a parallel one, leading to allocation of the DSM
Once the backend reached the posix_fallocate call and was waiting on a
breakpoint, I called 'vacuum full test' on the master that lead to a conflict
on the replica running 'select from test' (in a vast majority of cases),
triggering the breakpoint in ResolveRecoveryConflictWithLock in the startup
process, since the startup process tried to cancel the conflicting backend. At
that point I continued execution of the startup process (which would loop in
CancelVirtualTransaction sending SIGUSR1 to the backend while the backend
waited to be resumed from the breakpoint). Right after that, I changed the size
parameter on the backend to something that would make posix_fallocate run for a
bit longer, typically ten to hundred MB. Once the backend process was resumed,
it started receiving SIGUSR1 from the startup process, resulting in
posix_fallocate existing with EINTR.
With the patch applied, the posix_fallocate loop terminated right away (because
of QueryCancelPending flag set to true) and the backend went through the
cleanup, showing an ERROR of cancelling due to the conflict with recovery.
Without the patch, it looped indefinitely in the dsm_impl_posix_resize, while
the startup process were looping forever, trying to send SIGUSR1.
One thing I’m wondering is whether we could do the same by just blocking SIGUSR1
for the duration of posix_fallocate?
Many thanks!
If we were to do that, I would say we should mask all signals we can mask during the call.
I don't have a problem going down that road instead if people think it is better.
As a note, we have a patched version of PostgreSQL 10.5 ready to deploy to a system affected by this and expect that to be done this week.
Cheers,
Oleksii Kliukin
Best Regards,
Chris Travers
Head of Database
Saarbrücker Straße 37a, 10405 Berlin
pgsql-hackers by date: