Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker - Mailing list pgsql-bugs

From Masahiko Sawada
Subject Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
Date
Msg-id CAD21AoAVW54YBBPCXKubODCBX9+NS0XW5MDuFNZ9SRt_GBHs5Q@mail.gmail.com
Whole thread Raw
In response to Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker  (Andres Freund <andres@anarazel.de>)
Responses Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
List pgsql-bugs
Hi,

On Tue, Jul 25, 2023 at 12:39 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> Thanks for finding / debugging this issue!
>
> On 2023-07-21 17:01:11 +0900, Masahiko Sawada wrote:
> > I've reproduced the issue in my environment with the provided script.
> > The crashed process is not a parallel vacuum worker actually but a
> > parallel worker for rebuilding the index. The scenario seems that when
> > detecting a deadlock, the process removes itself from the wait queue
> > by RemoveFromWaitQueue() (called by CheckDeadLock()), and then
> > RemoveFromWaitQueue() is called again by LockErrorCleanup() while
> > aborting the transaction. With commit 5764f611e, in
> > RemoveFromWaitQueue() we remove the process from the wait queue using
> > dclist_delete_from():
> >
> >     /* Remove proc from lock's wait queue */
> >     dclist_delete_from(&waitLock->waitProcs, &proc->links);
> > :
> >     /* Clean up the proc's own state, and pass it the ok/fail signal */
> >     proc->waitLock = NULL;
> >     proc->waitProcLock = NULL;
> >     proc->waitStatus = PROC_WAIT_STATUS_ERROR;
> >
> >  However, since dclist_delete_from() doesn't clear proc->links, in
> > LockErrorCleanup(), dlist_node_is_detached() still returns false:
> >
> >     if (!dlist_node_is_detached(&MyProc->links))
> >     {
> >         /* We could not have been granted the lock yet */
> >         RemoveFromWaitQueue(MyProc, lockAwaited->hashcode);
> >     }
>
> Indeed :(
>
>
> > leading to calling RemoveFromWaitQueue() again. I think we should use
> > dclist_delete_from_thoroughly() instead. With the attached patch, the
> > issue doesn't happen in my environment.
>
> Yep. Do you want to push that fix, or should I?

Thank you for your confirmation. I can push the fix if you're okay.
I've attached the patch.

>
>
> > Another thing I noticed is that the Assert(waitLock) in
> > RemoveFromWaitQueue() is useless actually, since we access *waitLock
> > before that:
> >
> > I think we should fix it as well. This fix is also included in the
> > attached patch.
>
> Don't really have an opinion on that. It's been this way for longer, afaict.

True. I would leave this part alone. We can fix it in a separate
commit later if necessary.


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-bugs by date:

Previous
From: Andres Freund
Date:
Subject: Re: BUG #18031: Segmentation fault after deadlock within VACUUM's parallel worker
Next
From: PG Bug reporting form
Date:
Subject: BUG #18035: Assertion failure in jsonb_path_query