I have applied tested both patches separately and ran regression with both. No new test cases are failing with both patches.
The issues is fixed by both patches however the fix from Tom looks more elegant. I haven't done a detailed code review.
On Fri, Jan 31, 2020 at 12:39 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
vignesh C <vignesh21@gmail.com> writes: > On Wed, Jan 29, 2020 at 6:54 PM Ahsan Hadi <ahsan.hadi@gmail.com> wrote: >> Can you share a test case or steps that you are using to reproduce this issue? Are you reproducing this using a debugger?
> I could reproduce with the following steps: > Make cluster setup. > Create few tables. > Take a dump in directory format using pg_dump. > Restore the dump generated above using pg_restore with very high > number for --jobs options around 600.
I agree this is quite broken. Another way to observe the crash is to make the fork() call randomly fail, as per booby-trap-fork.patch below (not intended for commit, obviously).
I don't especially like the proposed patch, though, as it introduces a great deal of confusion into what ParallelState.numWorkers means. I think it's better to leave that as being the allocated array size, and instead clean up all the fuzzy thinking about whether workers are actually running or not. Like 0001-fix-worker-status.patch below.