Re: Parallel worker hangs while handling errors. - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Parallel worker hangs while handling errors.
Date
Msg-id CALj2ACX=iLQyY9-htQvvXG1ER_6wM5DELy8k5quA+gRLHDSxDA@mail.gmail.com
Whole thread Raw
In response to Re: Parallel worker hangs while handling errors.  (vignesh C <vignesh21@gmail.com>)
Responses Re: Parallel worker hangs while handling errors.
List pgsql-hackers
On Thu, Jul 23, 2020 at 12:54 PM vignesh C <vignesh21@gmail.com> wrote:
>
> Thanks for reviewing and adding your thoughts, My comments are inline.
>
> On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > The same hang issue can occur(though I'm not able to back it up with a
> > use case), in the cases from wherever the EmitErrorReport() is called
> > from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
> > autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
> > postgres.c.
> >
>
> I'm not sure if this can occur in other cases.
>

I checked further on this point: Yes, it can't occur for the other
cases, as mq_putmessage() gets only used for parallel
workers(ParallelWorkerMain()  --> pq_redirect_to_shm_mq()).

>
> > Note that, in all sigsetjmp blocks, we intentionally
> > HOLD_INTERRUPTS(), to not cause any issues while performing error
> > handling, I'm concerned here that now, if we directly call
> > BackgroundWorkerUnblockSignals() which will open up all the signals
> > and our main intention of holding interrupts or block signals may go
> > away.
> >
> > Since the main problem for this hang issue is because of blocking
> > SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
> > instead of unblocking all signals? I tried this with parallel copy
> > hang, the issue is resolved.
> >
>
> On putting further thoughts on this, I feel just unblocking SIGUSR1
> would be the right approach in this case. I'm attaching a new patch
> which unblocks SIGUSR1 signal. I have verified that the original issue
> with WIP parallel copy patch gets fixed. I have made changes only in
> bgworker.c as we require the parallel worker to receive this signal
> and continue processing. I have not included the changes for other
> processes as I'm not sure if this scenario is applicable for other
> processes.
>

Since we are sure that this hang issue can occur only for parallel
workers, and the change in StartBackgroundWorker's sigsetjmp's block
should only be made for parallel worker cases. And also there can be a
lot of other callbacks execution and processing in proc_exit() for
which we might not need the SIGUSR1 unblocked. So, let's undo the
unblocking right after EmitErrorReport() to not cause any new issues.

Attaching a modified v2 patch: it has the unblocking for only for
parallel workers, undoing it after EmitErrorReport(), and some
adjustments in the comment.

I verified this fix for the parallel copy hang issue. And also make
check and make check-world passes.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Amul Sul
Date:
Subject: Re: [Patch] ALTER SYSTEM READ ONLY
Next
From: David Rowley
Date:
Subject: Re: Parallel Seq Scan vs kernel read ahead