Re: [PATCH] Fix for infinite signal loop in parallel scan - Mailing list pgsql-hackers
From | Oleksii Kliukin |
---|---|
Subject | Re: [PATCH] Fix for infinite signal loop in parallel scan |
Date | |
Msg-id | 58C9F6AF-253E-4ADA-988D-83C926B608D1@hintbits.com Whole thread Raw |
In response to | [PATCH] Fix for infinite signal loop in parallel scan (Chris Travers <chris.travers@adjust.com>) |
Responses |
Re: [PATCH] Fix for infinite signal loop in parallel scan
|
List | pgsql-hackers |
> On 7. Sep 2018, at 17:57, Chris Travers <chris.travers@adjust.com> wrote: > > Hi; > > Attached is the patch we are fully testing at Adjust. There are a few non-obvious aspects of the code around where thepatch hits. I have run make check on Linux and MacOS, and make check-world on Linux (check-world fails on MacOS onall versions and all branches due to ecpg failures). Automatic tests are difficult because it is to a rare race conditionwhich is difficult (or possibly impossible) to automatically create. Our current approach testing is to force theissue using a debugger. Any ideas on how to reproduce automatically are appreciated but even on our heavily loaded systemsthis seems to be a very small portion of queries that hit this case (meaning the issue happens about once a week forus). I did some manual testing on it, putting breakpoints in the ResolveRecoveryConflictWithLock in the startup process on a streaming replica (configured with a very low max_standby_streaming_delay, i.e. 100ms) and to the posix_fallocate call on the normal backend on the same replica. At this point I also instructed gdb not to stop upon receiving SIGUSR1 (handle SIGUSR1 nonstop) and resumed execution on both the backend and the startup process. Then I simulated a conflict by creating a rather big (3GB) table on the master, doing some updates on it and then running an aggregate on the replica backend (i.e. 'select count(1) from test' with 'force_parallel_mode = true') where I set the breakpoint. The aggregate and force_parallel_mode ensured that the query was executed as a parallel one, leading to allocation of the DSM Once the backend reached the posix_fallocate call and was waiting on a breakpoint, I called 'vacuum full test' on the master that lead to a conflict on the replica running 'select from test' (in a vast majority of cases), triggering the breakpoint in ResolveRecoveryConflictWithLock in the startup process, since the startup process tried to cancel the conflicting backend. At that point I continued execution of the startup process (which would loop in CancelVirtualTransaction sending SIGUSR1 to the backend while the backend waited to be resumed from the breakpoint). Right after that, I changed the size parameter on the backend to something that would make posix_fallocate run for a bit longer, typically ten to hundred MB. Once the backend process was resumed, it started receiving SIGUSR1 from the startup process, resulting in posix_fallocate existing with EINTR. With the patch applied, the posix_fallocate loop terminated right away (because of QueryCancelPending flag set to true) and the backend went through the cleanup, showing an ERROR of cancelling due to the conflict with recovery. Without the patch, it looped indefinitely in the dsm_impl_posix_resize, while the startup process were looping forever, trying to send SIGUSR1. One thing I’m wondering is whether we could do the same by just blocking SIGUSR1 for the duration of posix_fallocate? Cheers, Oleksii Kliukin
pgsql-hackers by date: