Home > mailing lists

Re: [PATCH] Fix for infinite signal loop in parallel scan - Mailing list pgsql-hackers

From	Oleksii Kliukin
Subject	Re: [PATCH] Fix for infinite signal loop in parallel scan
Date	September 17, 2018 18:59:21
Msg-id	58C9F6AF-253E-4ADA-988D-83C926B608D1@hintbits.com Whole thread Raw
In response to	[PATCH] Fix for infinite signal loop in parallel scan (Chris Travers <chris.travers@adjust.com>)
Responses	Re: [PATCH] Fix for infinite signal loop in parallel scan
List	pgsql-hackers

Tree view

> On 7. Sep 2018, at 17:57, Chris Travers <chris.travers@adjust.com> wrote:
>
> Hi;
>
> Attached is the patch we are fully testing at Adjust.  There are a few non-obvious aspects of the code around where
thepatch hits.    I have run make check on Linux and MacOS, and make check-world on Linux (check-world fails on MacOS
onall versions and all branches due to ecpg failures).  Automatic tests are difficult because it is to a rare race
conditionwhich is difficult (or possibly impossible) to automatically create.  Our current approach testing is to force
theissue using a debugger.  Any ideas on how to reproduce automatically are appreciated but even on our heavily loaded
systemsthis seems to be a very small portion of queries that hit this case (meaning the issue happens about once a week
forus). 

I did some manual testing on it, putting breakpoints in the
ResolveRecoveryConflictWithLock in the startup process on a streaming replica
(configured with a very low max_standby_streaming_delay, i.e. 100ms) and to the
posix_fallocate call on the normal backend on the same replica. At this point I
also instructed gdb not to stop upon receiving SIGUSR1 (handle SIGUSR1 nonstop)
and resumed execution on both the backend and the startup process.

Then I simulated a conflict by creating a rather big (3GB) table on the master,
doing some updates on it and then running an aggregate on the replica backend
(i.e. 'select count(1) from test' with 'force_parallel_mode = true')  where I
set the breakpoint. The aggregate and force_parallel_mode ensured that
the query was executed as a parallel one, leading to allocation of the DSM

Once the backend reached the posix_fallocate call and was waiting on a
breakpoint, I called 'vacuum full test' on the master that lead to a conflict
on the replica running 'select from test' (in a vast majority of cases),
triggering the breakpoint in ResolveRecoveryConflictWithLock in the startup
process, since the startup process tried to cancel the conflicting backend. At
that point I continued execution of the startup process (which would loop in
CancelVirtualTransaction sending SIGUSR1 to the backend while the backend
waited to be resumed from the breakpoint). Right after that, I changed the size
parameter on the backend to something that would make posix_fallocate run for a
bit longer, typically ten to hundred MB. Once the backend process was resumed,
it started receiving SIGUSR1 from the startup process, resulting in
posix_fallocate existing with EINTR.

With the patch applied, the posix_fallocate loop terminated right away (because
of QueryCancelPending flag set to true) and the backend went through the
cleanup, showing an ERROR of cancelling due to the conflict with recovery.
Without the patch, it looped indefinitely in the dsm_impl_posix_resize, while
the startup process were looping forever, trying to send SIGUSR1.

One thing I’m wondering is whether we could do the same by just blocking SIGUSR1
for the duration of posix_fallocate?

Cheers,
Oleksii Kliukin

pgsql-hackers by date:

From: "Jonathan S. Katz"
Date: 17 September 2018, 18:44:36
Subject: Re: Stored procedures and out parameters

From: Chris Travers
Date: 17 September 2018, 19:05:08
Subject: Re: [PATCH] Fix for infinite signal loop in parallel scan

Re: [PATCH] Fix for infinite signal loop in parallel scan - Mailing list pgsql-hackers

Previous

Next